views:

49

answers:

2

I recently used Adobe Acrobat Pro's OCR feature to process a Japanese kanji dictionary. The overall quality of the output is generally quite a bit better than I'd hoped, but word boundaries in the English portions of the text have often been lost. For example, here's one line from my file:

softening;weakening(ofthemarket)8 CHANGE [transform] oneselfINTO,takethe form of; disguise oneself

I could go around and insert the missing word boundaries everywhere, but this would be adding to what is already a substantial task. I'm hoping that there might exist software which can analyze text like this, where some of the words run together, and split the text on probable word boundaries. Is there such a package?

I'm using Emacs, so it'd be extra-sweet if the package in question were already an Emacs package or could be readily integrated into Emacs, so that I could simply put my cursor on a line like the above and repeatedly invoke some command that splits the line on word boundaries in decreasing order of probable correctness.

+1  A: 

I am unaware of anything that already exists.

The simplest method, is simply match the set of longest words contained in your string against a dictionary. Of course there could be many words, so you'd have to plan for all combinations and permutations. It's computationally expensive to do it this way, but fairly quick to write.

Pestilence
A: 

I couldn't find anything either, and ended up going with a more interactive approach.

Sean