Surprisingly I've been unable to find anyone else really doing this, but surely someone has. I'm working on a python project currently that involves spell checking some 16 thousand words. That number of words is only going to grow unfortunately. Right now I'm pulling words from Mongo, iterating through them, and then spell checking them with pyenchant. I've removed mongo as the potential bottleneck by grabbing all my items from there first. That leaves me with around 20 minutes to process through 16k words, which is obviously longer than I want to spend. This leaves me with a couple ideas/questions:
Obviously I could leverage threading or some form of parallelism. Even if I chop this into 4 pieces, I'm still looking at roughly 5 minutes assuming peak performance.
Is there a way to tell what spelling library Enchant is using underneath pyenchant? Enchant's website seems to imply it'll use all available spelling libraries/dictionaries when spell checking. If so, then I'm potentially running each word through three-four spelling dicts. This could be my issue right here, but I'm having a hard time proving that's the case. Even if it is, is my option really to uninstall other libraries? Sounds unfortunate.
So, any ideas on how I can squeeze at least a bit more performance out of this? I'm fine with chopping this into parallel tasks, but I'd still like to get the core piece of it to be a bit faster before I do.
Edit: Sorry, posting before morning coffee... Enchant generates a list of suggestions for me if a word is incorrectly spelled. That would appear to be where I spend most of my time in this processing portion.