Any interesting OCR/NLP related projects for CS final year project?

My background is in the commercial side of OCR and in my experience writing anything but a simple OCR engine would take a fair amout of time. To get even reasonable results your input files would have to contain very clean text characters for the purposes of OCR or you would need lots of marked up training data to train the engine. This would limit your input data available using OCR to high quality printed documents and computer generated documents such as exporting a Word document to a TIFF image. Commercial OCR engines do a much better job reading standard scanned invoices and letters than even Tesseract OCR and they still make mistakes.

You could write a simple OCR engine and use NLP and language analysis to show how it can improve the OCR results. Most of the OCR engines are doing this anyway but it could be an interesting project. The commercial engines have had years of fine tuning to improve their recognition accuracy and they use every trick they can think of.

This article may give you some ideas on one way how to write an OCR engine:

http://www.codeproject.com/KB/dotnet/simple_ocr.aspx

You may be able to contribute to the Tesseract project but you would first need to research what has already been included and what is not and if anyone else is working on the same problem.

Thanks, what OCR-related books would you recommend for a novice?

Thura 2010-10-22 07:12:03

I am not involved in writing OCR engines. A good Google search should find some interesting books.

Andrew Cash 2010-10-26 06:54:15

Also this may be interesting : http://www.codeproject.com/KB/cs/neural_network_ocr.aspx

Andrew Cash 2010-10-26 06:54:31

ansaurus

tags:

views:

answers:

Any interesting OCR/NLP related projects for CS final year project?

related questions