Extract text from PDF | ansaurus

tags:

pdf
python

views:

90

answers:

1

Q:

Extract text from PDF

I have a bunch of PDF files that I need to convert to TXT. Unfortunately, when i use one of the many available utilities to do this, it loses all formatting and all the tabulated data in the PDF gets jumbled up. Is it possible to use Python to extract the text from the PDF by specifying postions, etc?

Thanks.

A:

PDFs do not contain tabular data unless it contains structured content. Some tools include heuristics to try and guess the data structure and put it back. I wrote a blog article explaining the issues with PDF text extraction at http://pdf.jpedal.org/java-pdf-blog/bid/12670/PDF-text

mark stephens 2010-07-01 07:09:04

If there a way to check whether a PDF is tagged as Adobe's Structured Content as you wrote in your blog post? Thank you,

Mridang Agarwalla 2010-07-11 18:01:41

You need to see if the tags are present.

mark stephens 2010-07-12 08:57:47

related questions

Zend_Pdf_Page::drawContentStream() Example?

Convert a .doc or .pdf to an image and display a thumbnail in Ruby?

Placing a PDF inside another PDF document with Zend_PDF

Open source PDF library for C/C++ application?

Opening a PDF in WPF Application

How to best merge information, at a server, into a "form", a PDF being generated as the final output

How does one decrypt a PDF with an owner password, but no user password?

How does google make make those awesome PDF reports in Analytics and when you print a Google Doc etc?

What's a good method for extracting text from a PDF using C# or classic ASP (VBScript)?

File format for generating dynamic reports in applications

Automated PDF Creation from URL

How do I display a PDF in Adobe Flex?

Latex=>PDF Rights management

Why is my PDF footer text invisible?

Python module for converting PDF to text

What's the best way to import/read data from pdf files?

Are e-book readers good enough for tech books?

PDF generation from XHTML in a LAMP environment

Create PDFs from multipage forms in WebObjects

Printing a PDF in .NET

PDF Creation Tutorials?

PDF Editing in PHP?

Organizing Documents

Get a preview jpeg of a pdf on windows?

How do I programmatically create a PDF in my .NET application?