views:

80

answers:

2

Surprisingly I've been unable to find anyone else really doing this, but surely someone has. I'm working on a python project currently that involves spell checking some 16 thousand words. That number of words is only going to grow unfortunately. Right now I'm pulling words from Mongo, iterating through them, and then spell checking them with pyenchant. I've removed mongo as the potential bottleneck by grabbing all my items from there first. That leaves me with around 20 minutes to process through 16k words, which is obviously longer than I want to spend. This leaves me with a couple ideas/questions:

  1. Obviously I could leverage threading or some form of parallelism. Even if I chop this into 4 pieces, I'm still looking at roughly 5 minutes assuming peak performance.

  2. Is there a way to tell what spelling library Enchant is using underneath pyenchant? Enchant's website seems to imply it'll use all available spelling libraries/dictionaries when spell checking. If so, then I'm potentially running each word through three-four spelling dicts. This could be my issue right here, but I'm having a hard time proving that's the case. Even if it is, is my option really to uninstall other libraries? Sounds unfortunate.

So, any ideas on how I can squeeze at least a bit more performance out of this? I'm fine with chopping this into parallel tasks, but I'd still like to get the core piece of it to be a bit faster before I do.

Edit: Sorry, posting before morning coffee... Enchant generates a list of suggestions for me if a word is incorrectly spelled. That would appear to be where I spend most of my time in this processing portion.

+3  A: 

I think we agree that the performance bottleneck here is Enchant; for this size of dataset it's nearly instantaneous to do a boolean isSpeltCorrectly. So, why not:

  1. Build a set in memory of correctly-spelt words, using the dictionaries that Enchant does or fetching your own (e.g. OpenOffice's).

    Optionally, uniquify the document's words, say by putting them in a set. This probably won't save you very much.

  2. Check whether each word is in the set or not. This is fast, because it's just a set lookup. (Probably O(log N) where N is the number of words? assuming set buckets by hash and does a binary search... a Python guru can correct me here.)

  3. If it isn't, then ask Enchant to recommend a word for it. This is necessarily slow.

This assumes that most of your words are spelt correctly; if they aren't, you'll have to be cleverer.

katrielalex
Down to 6 minutes going this route. It's still long, but it's better. Just a lot of bad spelling to be found when crawling the web :)
f4nt
You could also try parallelising the Enchant lookups, maybe even with a separate process running as an interface to Enchant to handle caching.
katrielalex
+1  A: 

Perhaps a better way of doing this would be to compress the document, as this would remove any repeating instances of words, which you actually only need to spellcheck once. I only suggest this as it would probably perform faster than writing your own unique word finder.

The compressed version should have references to the unique words, somewhere within it's file, you might have to look up how they are structured.

You can then spell check all the unique words, I hope you are not checking them with individual SQL queries or something like that, you should load a dictionary in the form a trie into your memory and then check words against that.

Once this is done, simply uncompress it and heypresto it's all spellchecked. This should be a fairly fast solution.

Or perhaps you don't need to go through the whole zipping process if spellchecking really is as fast as the comments suggest, which would indicate a wrong implementation.

Tom Gullen
Erm, I'm not sure I understand. Do you mean zip the document and parse the binary zip file? Are you sure that works?
katrielalex
The only reasonable way to make this work would be to make a Huffmanesque tree where the lexemes were entire words. This computationally equivalent to katrielalex's answer. Any other compression that operated at the sub-word level would be insanely complicated yet add no utility.
msw