Sample PDF file that I cannot parse (2.6MB Zip File)
Note: I am not interested in using a parsing library. This is for my own entertainment.
I've been experimenting with ripping text out of PDF files for a search gizmo, but am unable to extract text from some pdf files.
Note that this is a much easier problem than straight up parsing; I don't care if I inadvertently include some garbage in my output, nor do I really care if the formatting of the document is intact. I don't even care if the words come out in order.
As a first step, I created a very simple pdf parser using the strategy found on this project. Basically, all it does is search pdf files for zlib streams, deflates them, and pulls out any text it finds in parentheses. This fails to parse data stuck inside of << >>
blocks, but my understanding is that this is for hex-encoded blobs of data, which doesn't seem to be in the test file that I am failing to parse...or at least I don't see them.
Similarly, iText.Net also fails, though PDFMiner and PDFBox succeed. However, the latter two projects have too many layers of indirection to be easily examined; I had trouble figuring out exactly what they were doing, in part because I don't really use either language enough to be accustomed to debugging it in any significant manner.
My goal is to create a text ripper grabs text out of a pdf file with as little understanding of the pdf format itself as possible (e.g. my test parser grabs text out of parentheses, but has no understanding of which portion of the pdf it is examining is the header).