ansaurus

Question

Parsing PDF files (especially with tables) with PDFBox

Answer 1

A:

I'm not familiar with PDFBox, but you could try looking at itext. Even though the homepage says PDF generation, you can also do PDF manipulation and extraction. Have a look and see if it fits your use case.

Paul Sanwald 2010-07-08 13:53:26

Do you have any example of using itext to extract file content?

matheus.emm 2010-07-08 17:36:10

I found a simple way to read the content using iText but it didn't help me. Using PdfTextExtractor I get a similar result as using PDFBox. :-(

matheus.emm 2010-07-08 18:36:33

it's been a while, but isn't it PdfReader then .getContent()?

Paul Sanwald 2010-07-08 19:16:40

Answer 2

+1 A:

Extracting data from PDF is bound to be fraught with problems. Are the documents created through some kind of automatic process? If so, you might consider converting the PDFs to uncompressed PostScript (try pdf2ps) and seeing if the PostScript contains some sort of regular pattern which you can exploit.

Todd Owen 2010-07-09 14:15:18

Answer 3

+1 A:

PDF to Excell resulted in http://www.cogniview.com/pdf2xl.php.

(It may be useful to go back a bit and check how the PDF is generated. if you are lucky, you may able to apply data extraction at the source? )

Jayan 2010-07-17 14:21:36

I tried PDF2XL and it does an awsome job on extracting tabular data from PDF files but it isn't a library (it is a desktop application).It would be good if my app's users could buy PDF2XL and use it to convert the PDF and import the XLS file.

matheus.emm 2010-07-19 13:34:38

Answer 4

+1 A:

How about printing to image and doing OCR on that?

Sounds terribly ineffective, but it's practically the very purpose of PDF to make text inaccessible, you gotta do what you gotta do.

Carl Smotricz 2010-07-17 14:26:09

Dont suppose you could eleborate on what OCR could read tables?

markdigi 2010-08-18 16:51:00

@markdigi: I have very little experience with OCR software. Something very clumsy called ReadIris that came free with my HP printer, and a surprisingly capable, yet reasonably priced product called aabby FineReader (I think). If I remember correctly, both are able to read documents with tables into MS Word format, and that included tables. Please take this info as a hint for further exploration, not a concrete recommendation.

Carl Smotricz 2010-08-18 18:42:14

Answer 5

+1 A:

You will need to devise an algorithm to extract the data in a usable format. Regardless of which PDF library you use, you will need to do this. Characters and graphics are drawn by a series of stateful drawing operations, i.e. move to this position on the screen and draw the glyph for character 'c'.

I suggest that you extend org.apache.pdfbox.pdfviewer.PDFPageDrawer and override the strokePath method. From there you can intercept the drawing operations for horizontal and vertical line segments and use that information to determine the column and row positions for your table. Then its a simple matter of setting up text regions and determining which numbers/letters/characters are drawn in which region. Since you know the layout of the regions, you'll be able to tell which column the extracted text belongs to.

Also, the reason you may not have spaces between text that is visually separated is that very often, a space character is not drawn by the PDF. Instead the text matrix is updated and a drawing command for 'move' is issued to draw the next character and a "space width" apart from the last one.

Good luck.

purecharger 2010-08-12 21:47:34

Answer 6

A:

http://www.pdflib.com/products/tet/

they do an OK job of getting tables. They have an XML format which shows tables as well. But it hasnt worked for all PDFs.

kaushalc 2010-10-26 13:02:12

Answer 7

A:

http://swftools.org/ these guys have a pdf2swf component. They are also able to show tables. They are also giving the source. So you could possibly check it out.

kaushalc 2010-10-26 13:26:43

ansaurus

tags:

views:

answers:

Parsing PDF files (especially with tables) with PDFBox

related questions