tags:

views:

982

answers:

6

Hi, I need to extract the text from a PDF file. This text will likely be in a table format, and it is going to be used for automatic transfer of data between an external party and our systems.

Can anyone suggest a command line tool (eg pdf to txt) or a library that would be good for this?

Language options:

  • C# (preferred)
  • Java (if I must)

I found some ideas here, but i think the guy was talking more about a one-off situation, i'm talking more like a daily import:

http://stackoverflow.com/questions/488089/extracting-tables-from-pdf-files

+3  A: 

try this

http://www.codeproject.com/KB/cs/PDFToText.aspx

Bye

RRUZ
That uses itextsharp, for later reference
Chris
+3  A: 

pdftotext seems to do the trick quite nicely.

pdftotext file.pdf [textfile.txt]

Edit: I'm not sure how you would like to retain information about the tables. The best looking output (to my human eye, at least) is produced by

pdftotext -layout file.pdf [textfile.txt]

This maintains the original layout of the document as best as possible. In particular, the tables still look pretty good in the text output. The default is to interpret the columns of the table as columns of text (terrible). Another option that doesn't look as good to me, but might still be useful, is the -raw option.

Anton Geraschenko
Do you mean the Xpdf tool?
Chris
According to Wikipedia, `xpdf` does have an implementation of `pdftotext`. The one I have came in the `poppler-utils` package. I can't seem to find a pdf with a table in it to test what the output looks like. What kind of output would you like?
Anton Geraschenko
Looks like poppler is a fork of xpdf, so its probably the same tool.
Chris
I used the xpdf version of this and was very happy with the result. The -layout flag _really_ helped as Anton notes above.
Tim Perry
A: 

try the opensource java pdf library

http://www.lowagie.com/iText/docs.html

janetsmith
+1  A: 

Hi there,

I can't provide a solution but only offer general advice. My advice to you is to open a PDF document in Notepad or another Plain Text editor and study the formatting codes. They're very easy to understand. For example, //par is a Paragraph and //tab is a Tab. Once you know the formatting codes for table layouts, it'll be very easy for you to come up with your own solution to extract anything from a PDF document.

baeltazor
It's not that easy. There's a lot of work involved in extracting text from a document in a human readable format. The task becomes more a bit easier if you just need to extract text from the same document every time, but if you need to extract text from random documents, from varying sources, it's not easy at all. So I wouldn't recommend this option unless you want to spend quite a bit of time perfecting it and really cannot use any third party libraries.
Rowan
+1  A: 

There is also PdfBox and JPedal on Java. Tables do not exist in the PDF file format so any software will be 'guessing' them.

mark stephens
+1  A: 

Apache Tika is open-source Java toolkit that specializes in what you are looking for: extracting structured context from various documents including pdf.

It does use PDFBox for pdf file format but provides level of abstraction that is ideal for extracting structured context.

It contains command line utility - see here.

grigory