tags:

views:

90

answers:

1

I have a bunch of PDF files that I need to convert to TXT. Unfortunately, when i use one of the many available utilities to do this, it loses all formatting and all the tabulated data in the PDF gets jumbled up. Is it possible to use Python to extract the text from the PDF by specifying postions, etc?

Thanks.

A: 

PDFs do not contain tabular data unless it contains structured content. Some tools include heuristics to try and guess the data structure and put it back. I wrote a blog article explaining the issues with PDF text extraction at http://pdf.jpedal.org/java-pdf-blog/bid/12670/PDF-text

mark stephens
If there a way to check whether a PDF is tagged as Adobe's Structured Content as you wrote in your blog post? Thank you,
Mridang Agarwalla
You need to see if the tags are present.
mark stephens