views:

127

answers:

7

Hi! I need to parse a PDF file which contais tabular data. I'm using PDFBox to extract the file text to parse the result (String) later. The problem is that the text extraction doesn't work as I expected for tabular data. For example, I have a file which contais a table like this (7 colums: the first two always have data, only one Complexity column has data, only one Financing column has data):

+----------------------------------------------------------------+
| AIH | Value | Complexity                     | Financing       |
|     |       | Medium | High | Not applicable | MAC/Other | FAE |
+----------------------------------------------------------------+
| xyz | 12.43 | 12.34  |      |                | 12.34     |     |
+----------------------------------------------------------------+
| abc | 1.56  |        | 1.56 |                |           | 1.56|
+----------------------------------------------------------------+

Then I use PDFBox:

PDDocument document = PDDocument.load(pathToFile);
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);

Those two line data would be extracted like this:

xyz 12.43 12.4312.43
abc 1.56 1.561.56

There are no white spaces between the last two numbers, but this is not the biggest problem. The problem is that I don't know what the last two numbers mean: Medium, High, Not applicable? MAC/Other, FAE? I don't have the relation between the numbers and their columns.

It is not required for me to use the PDFBox library, so a solution that uses another library is fine. What I want is to be able to parse the file and know what each number parse means.

Thanks in advance.

A: 

I'm not familiar with PDFBox, but you could try looking at itext. Even though the homepage says PDF generation, you can also do PDF manipulation and extraction. Have a look and see if it fits your use case.

Paul Sanwald
Do you have any example of using itext to extract file content?
matheus.emm
I found a simple way to read the content using iText but it didn't help me. Using PdfTextExtractor I get a similar result as using PDFBox. :-(
matheus.emm
it's been a while, but isn't it PdfReader then .getContent()?
Paul Sanwald
+1  A: 

Extracting data from PDF is bound to be fraught with problems. Are the documents created through some kind of automatic process? If so, you might consider converting the PDFs to uncompressed PostScript (try pdf2ps) and seeing if the PostScript contains some sort of regular pattern which you can exploit.

Todd Owen
+1  A: 

PDF to Excell resulted in http://www.cogniview.com/pdf2xl.php.

(It may be useful to go back a bit and check how the PDF is generated. if you are lucky, you may able to apply data extraction at the source? )

Jayan
I tried PDF2XL and it does an awsome job on extracting tabular data from PDF files but it isn't a library (it is a desktop application).It would be good if my app's users could buy PDF2XL and use it to convert the PDF and import the XLS file.
matheus.emm
+1  A: 

How about printing to image and doing OCR on that?

Sounds terribly ineffective, but it's practically the very purpose of PDF to make text inaccessible, you gotta do what you gotta do.

Carl Smotricz
Dont suppose you could eleborate on what OCR could read tables?
markdigi
@markdigi: I have very little experience with OCR software. Something very clumsy called ReadIris that came free with my HP printer, and a surprisingly capable, yet reasonably priced product called aabby FineReader (I think). If I remember correctly, both are able to read documents with tables into MS Word format, and that included tables. Please take this info as a hint for further exploration, not a concrete recommendation.
Carl Smotricz
+1  A: 

You will need to devise an algorithm to extract the data in a usable format. Regardless of which PDF library you use, you will need to do this. Characters and graphics are drawn by a series of stateful drawing operations, i.e. move to this position on the screen and draw the glyph for character 'c'.

I suggest that you extend org.apache.pdfbox.pdfviewer.PDFPageDrawer and override the strokePath method. From there you can intercept the drawing operations for horizontal and vertical line segments and use that information to determine the column and row positions for your table. Then its a simple matter of setting up text regions and determining which numbers/letters/characters are drawn in which region. Since you know the layout of the regions, you'll be able to tell which column the extracted text belongs to.

Also, the reason you may not have spaces between text that is visually separated is that very often, a space character is not drawn by the PDF. Instead the text matrix is updated and a drawing command for 'move' is issued to draw the next character and a "space width" apart from the last one.

Good luck.

purecharger
A: 

http://www.pdflib.com/products/tet/

they do an OK job of getting tables. They have an XML format which shows tables as well. But it hasnt worked for all PDFs.

kaushalc
A: 

http://swftools.org/ these guys have a pdf2swf component. They are also able to show tables. They are also giving the source. So you could possibly check it out.

kaushalc