tags:

views:

255

answers:

5

At my work, I sometimes have to take some printed source code and manually type the source code into a text editor. Do not ask why.

Obviously typing it up takes a long time and always extra time to debug typing errors (oops missed a "$" sign there).

I decided to try some OCR solutions like:

  • Microsoft Document Imaging - has built in OCR
    • Result: Missed all the leading whitespace, missed all the underscores, interpreted many of the punctuation characters incorrectly.
    • Conclusion: Slower than manually typing in code.
  • Various online web OCR apps
    • Result: Similar or worse than Microsoft Document Imaging
    • Conclusion: Slower than manually typing in code.

I feel like source code would be very easy to OCR given the font is sans serif and monospace.

Have any of you found a good OCR solution that works well on source code?

Maybe I just need a better OCR solution (not necessarily source code specific)?

A: 

Printed text vs handwritten is usually easier for OCR, however it all depends on your source image, I generally find that capturing in PNG format, with reduced colors (grayscale is best) with some manual cleanup (remove any image noise due to scanning etc) works best.

Most OCR are similar in performance and accuracy. OCRs with the ability to train/correct would be best.

Darknight
+1  A: 

En general i found FineReader with very good results. Normally all products has a trial available. Try as much you can.

Now, program source code can be tricky:

  • leading whitespace: maybe a post code pretty printer process can help
  • underscores and punctuation: maybe a good product can be trained for that
PeterMmm
+1  A: 

With OCR, there are currently three options:

  • Abbee FineReader and OminPage. Both are commercial products which are about on par when it comes to features and OCR result. I can't say much about OmniPage but FineReader does come with support for reading source code (for example, it has a Java language library).
  • The best OSS OCR engine is tesseract. It's much harder to use, you'll probably need to train it for your language.

I rarely do OCR but I've found that spending the $150 on the commercial software weights out the wasted time by far.

Aaron Digulla
I tried tesseract. It failed when I first downloaded it. The online readme specifies that it doesn't come with any training data. I downloaded the English training data from the website and untarred into tessdata subdir. BUT then it still complained about "could not find eng.unicharset". How am I messing this up?
Trevor Boyd Smith
See what I mean? Tesseract is only free if your time costs nothing. But you can post questions in the tesseract user group. They are friendly there and your input will help to make it easier for the next person to set this beast up.
Aaron Digulla
+1  A: 

OCRopus is also a good open source option. But like Tesseract, there's a rather steep learning curve to use and integrate it effectively.

clartaq
A: 

Try emailing the scanned image to [email protected] and you will get the OCR results back by email. It uses a high-quality OCR engine (ABBYY) on the back-end. If you like the results, check out the API as well: http://www.webservius.com/corp/docs/wisetrend.pdf

Eugene Osovetsky
Sounds like the OCR would work well. But the code in question is proprietary... so your solution would violate every security/IP-law etc. known to man...
Trevor Boyd Smith
I am not a lawyer, but I think there must be a way to make this work with the right agreements in place. The same company (WiseTrend) handles OCR of litigation support documents - these are often *very* sensitive, and legal firms trust WiseTrend with them. And it's certainly possible to make this work technically (SSL everywhere for transfers, not storing the data beyond a certain time limit, etc.) So if your volume is high enough to justify this, you can email the support email at the very bottom of the PDF to figure this out.
Eugene Osovetsky