views:

5534

answers:

7

Are there any good OCR (optical character recognition) SDK or APIs in Java which will be able to convert TIFF files to txt files (or even html is good enough) with some sort of format retention? The challenge is to read a typical news magazine article and know that it has a header and certain number of paragraphs and pictures.

I am Ok with looking at the commercial and open source SDKs/APIs. Any help is appreciated.

+1  A: 

Looks like this does what you may want.

McWafflestix
+2  A: 

Have a look at these

Conrad
I downloaded Aspire and played with it. I was little scared after looking at the results. I gave a Tiff file to it (which is a news magazine article) and it could not convert it. All it says is there is a picture and not text is extracted from it. Any ideas ?
EclipseGuru
+1  A: 

Talking about good OCR, you should definetely look into professional SDK, like ABBYY OCR SDK. It is not native Java as Asprise is, but in exchange it will provide recognition result you would expect.

Tomato
+5  A: 

I researched several OCRs and here is my compilation. Hope this helps many of you.

SimpleOCR

SimpleOCR is the popular freeware OCR software with hundreds of thousands of users worldwide. SimpleOCR is also a royalty-free OCR SDK for developers to use in their custom applications. If you have a scanner and want to avoid retyping your documents, SimpleOCR is the fast, free way to do it. The SimpleOCR freeware is 100% free and not limited in any way. Anyone can use SimpleOCR for free--home users, educational institutions, even corporate users. Our own freeware OCR application provides acceptable accuracy for those who just need to convert a few pages and can't justify the cost of commercial OCR software. Developers can use the command-line and SDK versions to integrate SimpleOCR with their custom applications.

ABBYY FineReader

FineReader Professional is a highly accurate and easy to use OCR software that includes host of features including digital camera OCR, intelligent document layouts, image enhancement, barcode recognition and command line integration. FineReader 9 is our pick for OCR software because its document layout retention will save you much time in reformatting documents you convert for editing

IRIS ReadIRIS [has server software...]

Affordable OCR software for business and home users. ReadIRIS Pro provides a extremely accurate OCR recognition rate at a low cost, but still has some of the advanced features that higher priced professional OCR software includes.

Nuance OmniPage

OmniPage is widely considered the fastest, most accurate and fully featured OCR software. OmniPage 17 Professional has a unique new feature that lets you convert any type of document to searchable PDF or Word. OmniPage does not have a downloadable demo. Nuance also does not provide free technical support after the first call. For these reasons we recommend the ABBYY and IRIS products instead.

OmniPage is an Optical character recognition application available from Nuance Communications. Nuance Communications was acquired by ScanSoft, which also took over its name in October 2005.OmniPage converts images such as scanned paper documents, and PDF files, into file formats used by computer applications such as Microsoft Word, Excel, Adobe Acrobat, or HTML files.OmniPage is in competition with ExperVision (TypeReader), Readiris and ABBYY Fine Reader as well as free software such as GOCR and Tesseract.

http://code.google.com/p/tesseract-ocr

[In computer software, Tesseract is a free optical character recognition engine. It was originally developed as proprietary software at Hewlett-Packard between 1985 until 1995. After ten years without any development taking place, Hewlett Packard and UNLV released it as open source in 2005. Tesseract is currently developed by Google and released under the Apache License, Version 2.0.]

http://jmagick.wiki.sourceforge.net

[JMagick is an open source Java interface of ImageMagick. It is implemented in the form of Java Native Interface (JNI) into the ImageMagick API. JMagick does not attempt to make the ImageMagick API object-oriented. It is merely a thin interface layer into the ImageMagick API. JMagick currently only implements a subset of ImageMagick APIs. Should you require unimplemented features in JMagick, please join the mailing list and make a request. JMagick has a LGPL (Lesser GNU Public License) license.]

http://www.expervision.com [Speed / Qaulity seems to be Good ???]

[The award-winning TypeReader converts scanned documents into electronic files at speed of 8,000 pages per hour with maximum reliability. Desktop 7.0 offers added flexibility to handle color and grayscale images, with duplex scanning support to process documents in English, French, German, Italian, Portuguese, Spanish, Dutch, Danish, Swedish, Norwegian, Finnish, Polish, Hungarian and Polynesian. It employs an unparalleled recognition technology to support 2618 fonts. Users can choose to output to various formats including PDF, MS Word, Excel, Lotus 1-2-3, HTML, etc. ]

http://www.edocfile.com [Not all Documents ???]

[Tiff to Text is designed to perform Optical Character Recognition (OCR) in a batch process. The program utilizes the OCR engine from Nuance (Owners of OMNI Page - formally ScanSoft) that is included with Microsoft Office Document Imaging (MODI).]

http://www.simpleocr.com/OCR%5FSoftware%5FGuide.asp

EclipseGuru
A: 

If you're willing to call an external web API, take a look at this (based on the ABBYY engine): http://www.wisetrend.com/wisetrend_ocr_cloud.shtml - sign up at http://www.webservius.com/cons/subscribe.aspx?p=wisetrend&s=wiseocr

Eugene Osovetsky
A: 

Tesseract is the one that google is using for their Google Books project

http://www.socialseo.com/tesseract-googles-new-ocr-engine.html

http://code.google.com/p/tesseract-ocr/ .

jasimmk