views:

605

answers:

4

From your experience, what is the most accurate open-source Optical Character Recognition (OCR) library/software to read Japanese text?

I just tried nhocr, its mistake rate is over 2% even on an extremely clean high-definition document.

Keywords: kanji, hiragana, katakana, scan, recognize, 光学式文字読取り装置, 光学的文字認識

A: 

Haven't tried it myself, but perhaps you should take a look at tesseract.

baol
Japanese is not available, even as a separate download: http://code.google.com/p/tesseract-ocr/downloadsThe readme briefly mentions that Japanese has been removed and is available somewhere, but actually it is nowhere to be found :-( http://code.google.com/p/tesseract-ocr/wiki/ReadMeOn the mailing list, a user reported some success training Tesseract on 60 Japanese characters, but it is clearly experimental.In conclusion, it might be possible, but in practice nobody uses Tesseract for Japanese.
Nicolas Raoul
I don't know Japanese, but the fact that they had a japanese group seemed interesting: http://groups.google.co.jp/group/tesseract-ocr/ (but looking at it it might as well be a japanese version of the international one, sorry if I wasted your time)
baol
@Nicolas I've opened issue http://code.google.com/p/tesseract-ocr/issues/detail?id=291 about the missing CJK data files
SamB
Nicolas Raoul
@baol: Indeed, if you replace .co.jp by .com, you can see that the questions/answers are the same. It is just the Google interface that is translated in Japanese. There doesn't seem to be any Tesseract Japanese community.
Nicolas Raoul
A: 

I have had some R&D experience with ABBYY's solution - FineReader Engine. It was version 8.1 at the time, and I am not up to date with their newest revisions. But at the time - it was simply the best I could find for our handheld scanner product. I highly recommend it.

BTW, you can get a free version of ABBYY OCR package for end-users when purchasing a XEROX PE220 printer, which it comes bundled with. That printer was on my desk for several years. There must be other printers coming with it bundled inside. Xerox was betting on thei OCR as the best as well.

Etamar L.
FineReader is NOT open-source. And the version you were using did NOT support Japanese: http://www.abbyy.com/Default.aspx?DN=b6d671c1-6da6-4bec-8c06-0ad362f6a7e9
Nicolas Raoul
Sorry, didn't see the open-source request. It is not open-source. The version I was using had CJK support (Chinese, Japanese and Korean), which is an add-on to the engine. We were using it to demonstrate South-eastern buyers our technology. SEE AT: http://www.ocr.gr/downloads/Engine%208.1%20What's%20New.pdf (copy the URL because SO breaks it)
Etamar L.
+2  A: 

Based on the lack of answers it sounds like nhocr IS the most accurate open-source OCR for Japanese.

Peter
A: 

Please try WeOCR. Server version and download version are available.

kmugitani
If I understand well, WeOCR is just a Web front-end for other OCR engines. In particular, it uses nhocr for Japanese. So I guess it is not more accurate than nhocr, right?
Nicolas Raoul
See http://weocr.ocrgrid.org/#todo One of the TODO items is "Develop an OCR for Japanese" and it links to nhocr
Nicolas Raoul
Yah. That is correct. Just a couple month ago, I tried their online server version. But it was far from accurate. Japanese cellphone. specially Sharp cellphone has quite excellent OCR capability. But I did not find other free OCR software. Of course, Sharp does not sell their OCR software at this point.
kmugitani