How to extract data from a PDF file while keeping track of its structure?

views:

854

answers:

+2 Q:

How to extract data from a PDF file while keeping track of its structure?

My objective is to extract the text and images from a PDF file while parsing its structure. The scope for parsing the structure is not exhaustive; I only need to be able to identify headings and paragraphs.

I have tried a few of different things, but I did not get very far in any of them:

Convert PDF to text. It does not work for me as I lose images and the structure of the document.
Convert PDF to HTML. I found a few tools that helped me with this, and the best one so far is pdftohtml. The tool is really good presentation wise, but I haven't been able to successfully parse the HTML.
Convert PDF to XML. Same as above.

Anyone has any suggestions on how to tackle this problem?

If you are interested in using 3rd party SDKs,
this works ok.

Kb 2009-06-02 04:56:20

I haven't tried it, but converting the PDF to PostScript may make the result easier to parse.

I've found the Xpdf programs invaluable for dealing with PDF processing.

Jarod Elliott 2009-06-02 05:14:21

+1 A:

Unless its is Marked Content, PDF does not have a structure.... You have to 'guess' it which is what the various tools are doing. There is a good blog post explaining the issues at http://pdf.jpedal.org/java-pdf-blog/bid/12670/PDF-text

2009-06-02 07:11:14

The problem is that creator of PDF document did not want to have the PDF form contents savable. That is most inane and churlish argument I have heard in protecting he *artistic" creation called SHIT. What purpose it serves while increasing agony of the user?

nev 2010-10-07 16:14:08

ansaurus

tags:

views:

answers:

How to extract data from a PDF file while keeping track of its structure?

related questions