My objective is to extract the text and images from a PDF file while parsing its structure. The scope for parsing the structure is not exhaustive; I only need to be able to identify headings and paragraphs.
I have tried a few of different things, but I did not get very far in any of them:
- Convert PDF to text. It does not work for me as I lose images and the structure of the document.
- Convert PDF to HTML. I found a few tools that helped me with this, and the best one so far is pdftohtml. The tool is really good presentation wise, but I haven't been able to successfully parse the HTML.
- Convert PDF to XML. Same as above.
Anyone has any suggestions on how to tackle this problem?