Hello. I have a set of pdf reports with similar information. Among other information in these reports stored few important for me fields which I need to find and extract. For example:
Name: John Smith
Date of Visit: 01.02.10
For English, French, German and few other languages I simply parse pdf and extract all "(Bla-bla)Tj" occurences. After such exctraction it's not hard to find needed fields with the help of regular expressions.
When I open Japanese language report in pdf-viewer I see "Some Hieroglyphs: John Smith". But internal pdf representation is different for Japanese language. Text data stored in something like "<43a343d7438343834356>Tj" occurences. It should be unicode (UTF-16BE mentioned in specifictation).
I've tried to figure out where my needed fields are located. And the main problem is that I can't find "John" string in pdf representation. I can't find even "J" symbol which is "004A" or "4A00" in UTF_16 with different endians.
If anyone could answer me, how should I look for english words in japanese pdf I would be very grateful!