tags:

views:

126

answers:

2

One challenging topic in computer vision is processing document scans. Typically this involves a number of steps, like noise removal, color analysis, binarization, text block identification, OCR, and then maybe some context analysis and correction.

I'm curious if anyone understands, knows or can point me to literature on how Google identifies text blocks prior to the OCR stage. Any insights?

A: 

This is second-hand information from the digitization specialist in my library, but it seems that Google's approach is to just throw everything through the automated process, ocr anything that looks like text and to not fuss too much about cropping individual images or doing much semantic analasys to look for image captions, etc. They may be doing subtle things that aren't obvious but on the surface they are definitely gunning for quantity over quality, which is smart for them to do for their purposes, IMO.

alxp
+1  A: 

I believe Google uses the Tesseract OCR engine in conjunction with another tool called Ocropus, both of which are open-source. I don't know anything about how they work but you may be interested in checking out the code, available at the above links.

emddudley