views:

902

answers:

5

Anyone got any experience with extracting data from PDF files programatically, in particular embedded tables? What tools did you use? Is this always a one-off process depending on the file, or are there tools which will work against all sorts of different files?

+7  A: 

I haven't done this, but it's likely that iTextSharp would work. I've not seen a more complete PDF tool that's also free or cheap. Available in .NET and Java.

jcollum
itext is a great library for reading and creating PDF documents
Mark Robinson
+2  A: 

Do not underestimate the power of copy-paste. A standard copy will usually lose the table formatting (more precisely, it loses the vertical dividers) and is thus not that effective. The secret to getting data from a table in a pdf file using copy and paste is to copy the columns individually. In Adobe Acrobat, holding the alt key allows you to do this. Generally, the horizontal dividers will remain intact in the form of newlines.

If it's just a one-off, this solution is often much easier and faster than programming (but then again, so is retyping the data yourself).

Brian
A: 

I've mostly solved this now - it turns out Acrobat Professional has an "extract to table" menu option which does this pretty nicely.

Simon Willison
your question was about extracting data from PDF files programatically....
DrG
+3  A: 

I've used XPDF's pdftotext (free) with much success. It has several options (including -raw and -layout) depending on whether you prefer maintaining approximate geometry or semantics.

Jason S
A: 

As I understand it, there is no such thing as a table in a PDF document (in the HTML markup sense), just a collection of line and text primitives laid out to look like a table. I've seen some tools that attempt to heuristically discover tables in the text but I doubt it's foolproof.

See:

AndrewR