views:

227

answers:

3

I have to read a pdf file which contains a table with several columns. Using iTextSharp I am able to read the file but I get bunch of non-formatted text. I am not able to structure the data so that I can insert into a database.

Any suggestions?

A: 

If I understand it correctly, pdf text is stored positionally, so it has no concept of rows or columns. That means you have to use heuristics based on the "likelihood" that a you're reading from a different column.

You can try doing this by comparing the amount of space between the words. (I'm not familiar with the ITextSharp interface so please forgive me if I'm mentioning things its not capable of. . . I'm mostly familiar with pdfNet.

Another idea that just came to me is that if the text has visual cues such as vertical lines separating the columns. If that's the case you should be able to come up with heuristics to determine if the text is left or right of the column lines.

...

However the best thing to do, if possible, is to get ahold of the data in a more database friendly format. This will likely save heartaches in the long run.

-- Jason

Jason D
+1  A: 

Unless its structured text there is no tagging to show columns. Tools like PdfBox make 'guesses' to try and extract the table.

There is an article explaining why text extraction is so hard at http://pdf.jpedal.org/java-pdf-blog/bid/12670/PDF-text

mark stephens
A: 

I am concluding there is no straight forward way to do this. Atleast reading the data in tabular format. I tried suggestions provided by Mark, but it is seems to be not feasible as per my requirement.

Vadi