views:

38

answers:

1

Sample PDF file that I cannot parse (2.6MB Zip File)

Note: I am not interested in using a parsing library. This is for my own entertainment.

I've been experimenting with ripping text out of PDF files for a search gizmo, but am unable to extract text from some pdf files.

Note that this is a much easier problem than straight up parsing; I don't care if I inadvertently include some garbage in my output, nor do I really care if the formatting of the document is intact. I don't even care if the words come out in order.

As a first step, I created a very simple pdf parser using the strategy found on this project. Basically, all it does is search pdf files for zlib streams, deflates them, and pulls out any text it finds in parentheses. This fails to parse data stuck inside of << >> blocks, but my understanding is that this is for hex-encoded blobs of data, which doesn't seem to be in the test file that I am failing to parse...or at least I don't see them.

Similarly, iText.Net also fails, though PDFMiner and PDFBox succeed. However, the latter two projects have too many layers of indirection to be easily examined; I had trouble figuring out exactly what they were doing, in part because I don't really use either language enough to be accustomed to debugging it in any significant manner.

My goal is to create a text ripper grabs text out of a pdf file with as little understanding of the pdf format itself as possible (e.g. my test parser grabs text out of parentheses, but has no understanding of which portion of the pdf it is examining is the header).

+1  A: 

Extracting content out of a PDF file can get a little complex. I do this as my daily job, and I think I can point you to the right direction.

What you are trying to do (extracting string between parentheses) works with simple WinAnsi or MacRoman encoding only, used with Type1 or TrueType fonts. Unfortunately these single-byte encodings do not support proper Unicode content. Your sample document uses Type0 aka CID fonts, where each character is identified by a glyph index. These are non-standard, ad-hoc encodings, where the designer of the font may assign a glyph index to any character in an arbitrary way. Sometimes the producer of the PDF intentionally mangles the encoding.

The way it works is that starting with the catalog, you parse the page tree. Once you identify a page object, you parse its contents as well as its resources. The resources dictionary contains a list of fonts used by the page. Each CID font object contains a ToUnicode stream, which is a cmap (character map), which establishes the relationship between the glyph indexes and their Unicode value. For example:

<01> <0044>
<02> <0061>
<03> <0074>
<04> <0020>

This means the glyph 01 is Unicode U+0044, the glyph 02 is U+0061, and so on. You have to use this lookup table to translate glyph IDs back into Unicode.

The page content itself has two important operators for you. The Tf is the font selector, which is important, because it identifies the font object. Each font has its own ToUnicode cmap, therefore depending on the font you must use a different lookup table.

The other interesting operator is the text show (typically TJ or Tj). With Type0 (CID) fonts the Tj doesn't contain human readable text, but instead a sequence of glyph IDs that you are supposed to map into Unicode with the help of the above mentioned cmap. Often the Tj uses hex string, such as <000100a50056> Tj, instead of the more typical (Hello, World) Tj that you are familiar with. Either way, the string is not human readable, and cannot be extracted without fully parsing the page, including all of its font resources, esp. the ToUnicode cmap, which by itself is a PostScript object, but you only care about the hex portions.

Of course I have oversimplified the process, because there are dozens of different standard encodings, custom encodings (differential or ToUnicode), and we haven't even touched Arabic, Hindi, vertical Japanese fonts, Type3 fonts, etc. Sometimes the text cannot be extracted at all, because it's intentionally mangled.

Tamas Demjen
Thank you. That explains a lot of the parts that were confusing me.
Brian