views:

854

answers:

4

My objective is to extract the text and images from a PDF file while parsing its structure. The scope for parsing the structure is not exhaustive; I only need to be able to identify headings and paragraphs.

I have tried a few of different things, but I did not get very far in any of them:

  • Convert PDF to text. It does not work for me as I lose images and the structure of the document.
  • Convert PDF to HTML. I found a few tools that helped me with this, and the best one so far is pdftohtml. The tool is really good presentation wise, but I haven't been able to successfully parse the HTML.
  • Convert PDF to XML. Same as above.

Anyone has any suggestions on how to tackle this problem?

A: 

If you are interested in using 3rd party SDKs,
this works ok.

Kb
A: 

I haven't tried it, but converting the PDF to PostScript may make the result easier to parse.

I've found the Xpdf programs invaluable for dealing with PDF processing.

Jarod Elliott
+1  A: 

Unless its is Marked Content, PDF does not have a structure.... You have to 'guess' it which is what the various tools are doing. There is a good blog post explaining the issues at http://pdf.jpedal.org/java-pdf-blog/bid/12670/PDF-text

A: 

The problem is that creator of PDF document did not want to have the PDF form contents savable. That is most inane and churlish argument I have heard in protecting he *artistic" creation called SHIT. What purpose it serves while increasing agony of the user?

nev