views:

65

answers:

2

Hi

I am extracting texts from OCRed Tiff files by using a library and dumping it in database. The text I am extracting are actually FORMS having fields like NAME,DOB,COUNTRY etc. Since OCR does not the difference between actual value and the label,it's just dumping all text. Now I have text in DB in following format:

Name: MyName Address: My Address

etc

Now the next step is to extract values lile MyName and MyAddrss from the DB. The document types may varry hence a generic parser might not work.

What would you suggest to deal this situation? Should I write different parsers? may ANTLR can help me? if yes then how? Kindly guide me.

I am working on .NET

+3  A: 

Hello. This is a common question for which an OCR industry found a generic solution years ago, and the solution branches into two separate directions. Using OCR for form processing, otherwise known as data extraction, can be one of the following two methods.

TEXT PARSING - considered as an old approach that still works in many situations. Obviously you are experienced in that and know the pros and cons, so I will be brief here. Pros is that it requires no other technology, just generic programming. Cons are that a) it requires programming, b) not very adaptive to variations, c) if formatting changes overtime may have to deal re-write some spagetti or legacy code, and d) requires near-perfect OCR result in order to find data successfully (i.e. mis-recognized label may result in missing data). In other words, great for quick and simple solutions, but not too adaptive to variations and changes. Have done it a lot back in my school and early programming days.

DYNAMIC DATA CAPTURE - using some special technology to dynamically locate data. Some technologies do it on the image-level and feed clean data to your database. Other technologies do it on the post-OCR text level. I am most familiar with data capture on image level, as it has several key benefits for complex projects I have done, so I will talk more about that. Only con is that you may need to invest into a specialized software tool, but that is a tool that provides a lot of benefit. Even a plumber has to invest into tools to do his job. The benefit of image-based data extraction is that post-OCR text is not always perfect, so the text-based extractor has to accommodate for mistakes, something that an old text parsing approach cannot. Also, in text parsing you can use only text, while in image parsing you have a ton of other information, such as lines (like in table columns), white gaps between texts (such as paragraph separators), pictures, logos, checkboxes, etc.

For example, I heavily use ABBYY FlexiCapture for these types of extraction (http://www.wisetrend.com/abbyy_flexicapture.shtml). That tool allows me to define what data I need to extract and how it should be extracted. For example, you would do something like this:

  1. Identify the format style, if more than one. If you have multiple formats, you can apply a different set of extraction rules per format.

  2. Locate label "Name:" or some other variation of it using fuzzy search or rules to accommodate OCR mistakes if any. Look in a certain area if more than one name occurs on the page

  3. Locate the area that contains chars of certain type next to the found label Name. Those chars have to fit certain criteria to be accepted as MyName field, and all those criteria are defined through UI (or scripting if you want).

  4. OCR the area content with MyName chars. Another benefit here is that you no longer use a generic OCR. You can use a very specific OCR settings that apply only to your MyName area - which increases the accuracy of OCR and data. This is most useful for specialized data, such as part numbers, codes, addresses, etc. You can use regular expressions, dictionaries, rules. You can be specific per field. That is not possible when full page OCR is used.

  5. Send the clean data to DB. Before you send the data, if you want to guarantee OCR quality, most tools usually have some kind of Verification capability to visually check (requires a human) OCRed text against the image.

In general, setting up these processes is much quicker and more liberating than code-based text parsing. There is plenty of scripting and APIs available for those who want to go past UI or need additional automation.

I scratched the surface, but hopefully that provides a start for your research and decision. If I have not addressed anything, please feel free to let me know.

Ilya Evdokimov, Data Capture Expert for 10+ years, CDIA+ Certified

My blog with more data capture stuff is here: http://wisetrend.com/ocr_and_data_capture_blog/

Ilya Evdokimov
LLya thanks for the answer. In past I have used IBM FileNET which proved option of Zonal OCR and adding meta data for each document. The thing is that I m not using FileNET and I have to make something in .NET.FlexiCapture seems a separate system while I have to make my code as a component which will be integrated as a part of other system.Image based data extraction, I will have to look at it as How smart it is, to locate and extract text and that too from TIFF format.
Volatil3
Yes FC is a separate application, but can run as an automated service. Minimal setup but has all needed. The setup would be:- establish FlexiCapture "service" on some machine- your app drops your image into 'input' folder monitored by service- your app checks 'output' folder for CSV or XML file containing your fields dataThis is a minimal loose integration. Contact me through website for video if you like.If you prefer a much closer integration, per Andrey, FlexiCapture Engine is good. Same capability, with full API and ability to integrate and hide it entirely within your application.
Ilya Evdokimov
* To clarify, the "image based data extraction" phrase may be misleading a bit. I meant to say that some form processing tools like FlexiCapture will use image and location of data on image in conjunction with OCR text values form the image to locate data. This way you have both media to work with. Text parsing uses only text and may loose formatting benefits or relationships between different data items, like distances and white gaps. If it were purely image-based then it would be impossible to locate keyword "Name:", for example.
Ilya Evdokimov
A: 

Hi all!

Ilya, great explanation of DataCapture technology!

Volatil, yes, ABBYY FlexiCapture is a separate system, bu there is also an FlexiCapture engine, which is pretty much the same technology, but in form of SDK: http://www.abbyy.com/flexicapture_engine/

Best regards, Andrey

Tomato