How to extract English text from pdf where English and Japanese mixed

Hello. I have a set of pdf reports with similar information. Among other information in these reports stored few important for me fields which I need to find and extract. For example:

Name: John Smith

Date of Visit: 01.02.10

For English, French, German and few other languages I simply parse pdf and extract all "(Bla-bla)Tj" occurences. After such exctraction it's not hard to find needed fields with the help of regular expressions.

When I open Japanese language report in pdf-viewer I see "Some Hieroglyphs: John Smith". But internal pdf representation is different for Japanese language. Text data stored in something like "<43a343d7438343834356>Tj" occurences. It should be unicode (UTF-16BE mentioned in specifictation).

I've tried to figure out where my needed fields are located. And the main problem is that I can't find "John" string in pdf representation. I can't find even "J" symbol which is "004A" or "4A00" in UTF_16 with different endians.

If anyone could answer me, how should I look for english words in japanese pdf I would be very grateful!

ansaurus

tags:

views:

answers:

How to extract English text from pdf where English and Japanese mixed

related questions