Thursday, August 20, 2009

Spell-checking in JavaScript

I just heard of Atwood's Law a few days ago. Apparently Jeff Atwood first published his discovery a couple of years ago, which is like a century in Internet time, but I can't keep up with everything. The Law states that "any application that can be written in JavaScript, will eventually be written in JavaScript". As Atwood explains, it is based on Tim Berners-Lee's Principle Of Least Power:

Computer Science spent the last forty years making languages which were as powerful as possible. Nowadays we have to appreciate the reasons for picking not the most powerful solution but the least powerful. The less powerful the language, the more you can do with the data stored in that language.


I think there is a common theme between Atwood's Law, Zawinski's Law ("Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can") and Greenspun's Tenth Rule ("Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp"). The inevitability of the specified outcome makes the world seem like a much simpler place. There is no need to argue. Believe and ye shall be redeemed.

I might have been in this state of mind, when I came across Peter Norvig's "How to Write a Spelling Corrector", because I knew instantly I had to port the algorithm to JavaScript. The algorithm is quite simple and has already been ported to many different languages, so I seized the opportunity to study the differences in expressiveness, performance and style, between these languages and my latest affection, JavaScript.

After a few night's work I have it working and free (as in beer and speech) for anyone to use. Check out the speller project on GitHub. If you want to understand how the algorithm works, you should go and read Norvig's article, although you might get some hints from the comments in my code. The modestly condensed version of the code is 53 lines:


var speller = {};
speller.train = function (text) {
var m;
while ((m = /[a-z]+/g.exec(text.toLowerCase()))) {
speller.nWords[m[0]] = speller.nWords.hasOwnProperty(m[0]) ? speller.nWords[m[0]] + 1 : 1;
}
};
speller.correct = function (word) {
if (speller.nWords.hasOwnProperty(word)) return word;
var candidates = {}, list = speller.edits(word);
list.forEach(function (edit) {
if (speller.nWords.hasOwnProperty(edit)) candidates[speller.nWords[edit]] = edit;
});
if (speller.countKeys(candidates) > 0) return candidates[speller.max(candidates)];
list.forEach(function (edit) {
speller.edits(edit).forEach(function (w) {
if (speller.nWords.hasOwnProperty(w)) candidates[speller.nWords[w]] = w;
});
});
return speller.countKeys(candidates) > 0 ? candidates[speller.max(candidates)] : word;
};
speller.nWords = {};
speller.countKeys = function (object) {
var attr, count = 0;
for (attr in object)
if (object.hasOwnProperty(attr))
count++;
return count;
};
speller.max = function (candidates) {
var candidate, arr = [];
for (candidate in candidates)
if (candidates.hasOwnProperty(candidate))
arr.push(candidate);
return Math.max.apply(null, arr);
};
speller.letters = "abcdefghijklmnopqrstuvwxyz".split("");
speller.edits = function (word) {
var i, results = [];
for (i=0; i < word.length; i++)
results.push(word.slice(0, i) + word.slice(i+1));
for (i=0; i < word.length-1; i++)
results.push(word.slice(0, i) + word.slice(i+1, i+2) + word.slice(i, i+1) + word.slice(i+2));
for (i=0; i < word.length; i++)
speller.letters.forEach(function (l) {
results.push(word.slice(0, i) + l + word.slice(i+1));
});
for (i=0; i <= word.length; i++)
speller.letters.forEach(function (l) {
results.push(word.slice(0, i) + l + word.slice(i));
});
return results;
};


It may not be as succinct as Norvig's Python version (21 lines), or the record-holding Awk and F# versions (15 lines), but is much better than C (184 lines), C++, Perl, PHP, Rebol and Erlang. There is even a Java version with 372 lines. It must contain some sort of spell-checking framework in there, or something. The condensed version above, although correct, has terrible performance in most JavaScript engines, however. For real-world use you should pick the regular version which may be slightly longer, but performs much better.

Since this was a toy project of mine, I wanted to play with ServerJS modules as well, in order to run it as a shell script. I turned the code into a securable module, so you can run it from the command line, using narwhal. I have a couple of scripts to that end. Of course since this is JavaScript, you can try it from your browser, by visiting the demo page, courtesy of GitHub Pages. Be sure to use a modern browser, like Firefox 3.5, Safari 4 or Chrome 3 (beta), otherwise you won't be able to run the test suite, since I used the brand-new Web Workers to make the long-running tasks run in the background.

Norvig's Python implementation took 16 seconds for test 1 and the best I got was 25 seconds with Safari 4 on my Mac. Narwhal is using Rhino by default, so it is definitely not competitive in such tests (139 seconds), but I'm planning to fix support for v8cgi and give that a try.

And it goes without saying that I'd love to hear about ways to improve the performance or the conciseness of the code. If you have any ideas, don't be shy, leave a comment or even better fork the code and send me a pull request on GitHub.

Creative Commons License Unless otherwise expressly stated, all original material in this weblog is licensed under a Creative Commons Attribution 3.0 License.