Need good OCR for printed source code listing, any ideas?

views:

255

answers:

+1 Q:

Need good OCR for printed source code listing, any ideas?

At my work, I sometimes have to take some printed source code and manually type the source code into a text editor. Do not ask why.

Obviously typing it up takes a long time and always extra time to debug typing errors (oops missed a "$" sign there).

I decided to try some OCR solutions like:

Microsoft Document Imaging - has built in OCR
- Result: Missed all the leading whitespace, missed all the underscores, interpreted many of the punctuation characters incorrectly.
- Conclusion: Slower than manually typing in code.
Various online web OCR apps
- Result: Similar or worse than Microsoft Document Imaging
- Conclusion: Slower than manually typing in code.

I feel like source code would be very easy to OCR given the font is sans serif and monospace.

Have any of you found a good OCR solution that works well on source code?

Maybe I just need a better OCR solution (not necessarily source code specific)?

Printed text vs handwritten is usually easier for OCR, however it all depends on your source image, I generally find that capturing in PNG format, with reduced colors (grayscale is best) with some manual cleanup (remove any image noise due to scanning etc) works best.

Most OCR are similar in performance and accuracy. OCRs with the ability to train/correct would be best.

Darknight 2009-12-11 14:58:34

+1 A:

En general i found FineReader with very good results. Normally all products has a trial available. Try as much you can.

Now, program source code can be tricky:

leading whitespace: maybe a post code pretty printer process can help
underscores and punctuation: maybe a good product can be trained for that

PeterMmm 2009-12-11 15:07:02

+1 A:

With OCR, there are currently three options:

Abbee FineReader and OminPage. Both are commercial products which are about on par when it comes to features and OCR result. I can't say much about OmniPage but FineReader does come with support for reading source code (for example, it has a Java language library).
The best OSS OCR engine is tesseract. It's much harder to use, you'll probably need to train it for your language.

I rarely do OCR but I've found that spending the $150 on the commercial software weights out the wasted time by far.

Aaron Digulla 2009-12-11 15:11:05

I tried tesseract. It failed when I first downloaded it. The online readme specifies that it doesn't come with any training data. I downloaded the English training data from the website and untarred into tessdata subdir. BUT then it still complained about "could not find eng.unicharset". How am I messing this up?

Trevor Boyd Smith 2009-12-11 16:15:33

See what I mean? Tesseract is only free if your time costs nothing. But you can post questions in the tesseract user group. They are friendly there and your input will help to make it easier for the next person to set this beast up.

Aaron Digulla 2009-12-12 12:41:14

+1 A:

OCRopus is also a good open source option. But like Tesseract, there's a rather steep learning curve to use and integrate it effectively.

clartaq 2009-12-11 15:24:45

Try emailing the scanned image to [email protected] and you will get the OCR results back by email. It uses a high-quality OCR engine (ABBYY) on the back-end. If you like the results, check out the API as well: http://www.webservius.com/corp/docs/wisetrend.pdf

Eugene Osovetsky 2010-03-16 20:57:35

Sounds like the OCR would work well. But the code in question is proprietary... so your solution would violate every security/IP-law etc. known to man...

Trevor Boyd Smith 2010-03-17 14:05:36

I am not a lawyer, but I think there must be a way to make this work with the right agreements in place. The same company (WiseTrend) handles OCR of litigation support documents - these are often *very* sensitive, and legal firms trust WiseTrend with them. And it's certainly possible to make this work technically (SSL everywhere for transfers, not storing the data beyond a certain time limit, etc.) So if your volume is high enough to justify this, you can email the support email at the very bottom of the PDF to figure this out.

Eugene Osovetsky 2010-03-17 17:18:38

ansaurus

tags:

views:

answers:

Need good OCR for printed source code listing, any ideas?

related questions