ansaurus

Question

Extracting text from a PDF file

Answer 1

+3 A:

try this

http://www.codeproject.com/KB/cs/PDFToText.aspx

Bye

RRUZ 2009-08-14 04:27:41

That uses itextsharp, for later reference

Chris 2009-08-14 05:03:09

Answer 2

+3 A:

pdftotext seems to do the trick quite nicely.

pdftotext file.pdf [textfile.txt]

Edit: I'm not sure how you would like to retain information about the tables. The best looking output (to my human eye, at least) is produced by

pdftotext -layout file.pdf [textfile.txt]

This maintains the original layout of the document as best as possible. In particular, the tables still look pretty good in the text output. The default is to interpret the columns of the table as columns of text (terrible). Another option that doesn't look as good to me, but might still be useful, is the -raw option.

Anton Geraschenko 2009-08-14 04:40:04

Do you mean the Xpdf tool?

Chris 2009-08-14 04:41:00

According to Wikipedia, `xpdf` does have an implementation of `pdftotext`. The one I have came in the `poppler-utils` package. I can't seem to find a pdf with a table in it to test what the output looks like. What kind of output would you like?

Anton Geraschenko 2009-08-14 04:54:27

Looks like poppler is a fork of xpdf, so its probably the same tool.

Chris 2009-08-14 06:39:34

I used the xpdf version of this and was very happy with the result. The -layout flag _really_ helped as Anton notes above.

Tim Perry 2010-07-07 23:18:41

Answer 3

A:

try the opensource java pdf library

http://www.lowagie.com/iText/docs.html

janetsmith 2009-08-14 04:42:05

Answer 4

+1 A:

Hi there,

I can't provide a solution but only offer general advice. My advice to you is to open a PDF document in Notepad or another Plain Text editor and study the formatting codes. They're very easy to understand. For example, //par is a Paragraph and //tab is a Tab. Once you know the formatting codes for table layouts, it'll be very easy for you to come up with your own solution to extract anything from a PDF document.

baeltazor 2009-08-14 04:52:54

It's not that easy. There's a lot of work involved in extracting text from a document in a human readable format. The task becomes more a bit easier if you just need to extract text from the same document every time, but if you need to extract text from random documents, from varying sources, it's not easy at all. So I wouldn't recommend this option unless you want to spend quite a bit of time perfecting it and really cannot use any third party libraries.

Rowan 2009-08-14 23:56:57

Answer 5

+1 A:

There is also PdfBox and JPedal on Java. Tables do not exist in the PDF file format so any software will be 'guessing' them.

mark stephens 2009-08-14 06:12:05

Answer 6

+1 A:

Apache Tika is open-source Java toolkit that specializes in what you are looking for: extracting structured context from various documents including pdf.

It does use PDFBox for pdf file format but provides level of abstraction that is ideal for extracting structured context.

It contains command line utility - see here.

grigory 2009-08-14 07:10:43

ansaurus

tags:

views:

answers:

Extracting text from a PDF file

related questions