Thursday, August 20, 2009

Spell-checking in JavaScript

I just heard of Atwood's Law a few days ago. Apparently Jeff Atwood first published his discovery a couple of years ago, which is like a century in Internet time, but I can't keep up with everything. The Law states that "any application that can be written in JavaScript, will eventually be written in JavaScript". As Atwood explains, it is based on Tim Berners-Lee's Principle Of Least Power:

Computer Science spent the last forty years making languages which were as powerful as possible. Nowadays we have to appreciate the reasons for picking not the most powerful solution but the least powerful. The less powerful the language, the more you can do with the data stored in that language.


I think there is a common theme between Atwood's Law, Zawinski's Law ("Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can") and Greenspun's Tenth Rule ("Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp"). The inevitability of the specified outcome makes the world seem like a much simpler place. There is no need to argue. Believe and ye shall be redeemed.

I might have been in this state of mind, when I came across Peter Norvig's "How to Write a Spelling Corrector", because I knew instantly I had to port the algorithm to JavaScript. The algorithm is quite simple and has already been ported to many different languages, so I seized the opportunity to study the differences in expressiveness, performance and style, between these languages and my latest affection, JavaScript.

After a few night's work I have it working and free (as in beer and speech) for anyone to use. Check out the speller project on GitHub. If you want to understand how the algorithm works, you should go and read Norvig's article, although you might get some hints from the comments in my code. The modestly condensed version of the code is 53 lines:


var speller = {};
speller.train = function (text) {
var m;
while ((m = /[a-z]+/g.exec(text.toLowerCase()))) {
speller.nWords[m[0]] = speller.nWords.hasOwnProperty(m[0]) ? speller.nWords[m[0]] + 1 : 1;
}
};
speller.correct = function (word) {
if (speller.nWords.hasOwnProperty(word)) return word;
var candidates = {}, list = speller.edits(word);
list.forEach(function (edit) {
if (speller.nWords.hasOwnProperty(edit)) candidates[speller.nWords[edit]] = edit;
});
if (speller.countKeys(candidates) > 0) return candidates[speller.max(candidates)];
list.forEach(function (edit) {
speller.edits(edit).forEach(function (w) {
if (speller.nWords.hasOwnProperty(w)) candidates[speller.nWords[w]] = w;
});
});
return speller.countKeys(candidates) > 0 ? candidates[speller.max(candidates)] : word;
};
speller.nWords = {};
speller.countKeys = function (object) {
var attr, count = 0;
for (attr in object)
if (object.hasOwnProperty(attr))
count++;
return count;
};
speller.max = function (candidates) {
var candidate, arr = [];
for (candidate in candidates)
if (candidates.hasOwnProperty(candidate))
arr.push(candidate);
return Math.max.apply(null, arr);
};
speller.letters = "abcdefghijklmnopqrstuvwxyz".split("");
speller.edits = function (word) {
var i, results = [];
for (i=0; i < word.length; i++)
results.push(word.slice(0, i) + word.slice(i+1));
for (i=0; i < word.length-1; i++)
results.push(word.slice(0, i) + word.slice(i+1, i+2) + word.slice(i, i+1) + word.slice(i+2));
for (i=0; i < word.length; i++)
speller.letters.forEach(function (l) {
results.push(word.slice(0, i) + l + word.slice(i+1));
});
for (i=0; i <= word.length; i++)
speller.letters.forEach(function (l) {
results.push(word.slice(0, i) + l + word.slice(i));
});
return results;
};


It may not be as succinct as Norvig's Python version (21 lines), or the record-holding Awk and F# versions (15 lines), but is much better than C (184 lines), C++, Perl, PHP, Rebol and Erlang. There is even a Java version with 372 lines. It must contain some sort of spell-checking framework in there, or something. The condensed version above, although correct, has terrible performance in most JavaScript engines, however. For real-world use you should pick the regular version which may be slightly longer, but performs much better.

Since this was a toy project of mine, I wanted to play with ServerJS modules as well, in order to run it as a shell script. I turned the code into a securable module, so you can run it from the command line, using narwhal. I have a couple of scripts to that end. Of course since this is JavaScript, you can try it from your browser, by visiting the demo page, courtesy of GitHub Pages. Be sure to use a modern browser, like Firefox 3.5, Safari 4 or Chrome 3 (beta), otherwise you won't be able to run the test suite, since I used the brand-new Web Workers to make the long-running tasks run in the background.

Norvig's Python implementation took 16 seconds for test 1 and the best I got was 25 seconds with Safari 4 on my Mac. Narwhal is using Rhino by default, so it is definitely not competitive in such tests (139 seconds), but I'm planning to fix support for v8cgi and give that a try.

And it goes without saying that I'd love to hear about ways to improve the performance or the conciseness of the code. If you have any ideas, don't be shy, leave a comment or even better fork the code and send me a pull request on GitHub.

9 comments:

Sud said...

If we use then this algorithm is not working properly

Anonymous said...

Amazing :)

Anonymous said...

Incredible, !!!!!

Albert Benatov said...

hi, just letting you know that current version gets browser stuck.

this is due to using /[a-z]+/g.exec(...)

rather than something like

var r_token = /[a-z]+/g;
r_token.exec(...)

and another small improvement -

text && (text = text.toLowerCase());

...

r_token.exec(text)


instead of calling toLowerCase() on each match,

regards,
Albert

Albert Benatov said...
This comment has been removed by the author.
past said...

You really don't want to use the condensed version for real world use. See the regular one that I mention in the post that already contains both of your suggestions.

Xena said...

Hi,
Does your github source code only work on javascript environments? I tried it on WAMP but the program just stalled.

past said...

Not sure what you mean by that, but there is an example in the demo/ directory on github that runs on browsers.

Unknown said...
This comment has been removed by a blog administrator.
Creative Commons License Unless otherwise expressly stated, all original material in this weblog is licensed under a Creative Commons Attribution 3.0 License.