ansaurus

Question

Programatically rip text from a PDF File (by hand) - Missing some text

Answer 1

+1 A:

Extracting content out of a PDF file can get a little complex. I do this as my daily job, and I think I can point you to the right direction.

What you are trying to do (extracting string between parentheses) works with simple WinAnsi or MacRoman encoding only, used with Type1 or TrueType fonts. Unfortunately these single-byte encodings do not support proper Unicode content. Your sample document uses Type0 aka CID fonts, where each character is identified by a glyph index. These are non-standard, ad-hoc encodings, where the designer of the font may assign a glyph index to any character in an arbitrary way. Sometimes the producer of the PDF intentionally mangles the encoding.

The way it works is that starting with the catalog, you parse the page tree. Once you identify a page object, you parse its contents as well as its resources. The resources dictionary contains a list of fonts used by the page. Each CID font object contains a ToUnicode stream, which is a cmap (character map), which establishes the relationship between the glyph indexes and their Unicode value. For example:

<01> <0044>
<02> <0061>
<03> <0074>
<04> <0020>

This means the glyph 01 is Unicode U+0044, the glyph 02 is U+0061, and so on. You have to use this lookup table to translate glyph IDs back into Unicode.

The page content itself has two important operators for you. The Tf is the font selector, which is important, because it identifies the font object. Each font has its own ToUnicode cmap, therefore depending on the font you must use a different lookup table.

The other interesting operator is the text show (typically TJ or Tj). With Type0 (CID) fonts the Tj doesn't contain human readable text, but instead a sequence of glyph IDs that you are supposed to map into Unicode with the help of the above mentioned cmap. Often the Tj uses hex string, such as <000100a50056> Tj, instead of the more typical (Hello, World) Tj that you are familiar with. Either way, the string is not human readable, and cannot be extracted without fully parsing the page, including all of its font resources, esp. the ToUnicode cmap, which by itself is a PostScript object, but you only care about the hex portions.

Of course I have oversimplified the process, because there are dozens of different standard encodings, custom encodings (differential or ToUnicode), and we haven't even touched Arabic, Hindi, vertical Japanese fonts, Type3 fonts, etc. Sometimes the text cannot be extracted at all, because it's intentionally mangled.

Tamas Demjen 2010-10-29 01:17:16

Thank you. That explains a lot of the parts that were confusing me.

Brian 2010-10-29 04:06:44

ansaurus

tags:

views:

answers:

Programatically rip text from a PDF File (by hand) - Missing some text

related questions