views:

59

answers:

1
  1. I have a few pdf files that were created from word or excel files.

  2. I need to get the information thats in the tables.

  3. The text in the document is not an image so I'm able to extract the text using tools such as pdfbox.

  4. When I have the text I have no way of knowing what cells in the table it belongs to because I don't know where the table borders are.

  5. Iv'e tried a few desktop tools such as abby or solid pdf converter and they are able to convert the files into nice word documents but this doesn't suit my needs as I want to be able to do this programatticly in C#.

  6. Some of the tables have nested tables wich I think makes this a little bit more diffucult.

I appreciate your help

+1  A: 

The difficulty here is caused by the fact that the text in the PDF is not contained within any table. It might look like it is, but underneath the surface, it is not.

So there are a couple of options that I can think of. But none of them are going to be quite as satisfying as you'd probably like.

  1. There are some companies that offer SDKs for PDF to Excel/Word conversion. Investintech and Iceni are a couple of examples. But these solutions are not free.
  2. If you know the exact layout of the PDF files that you need to extract the table data from, then you can use any SDK that lets you extract text from a PDF and also tells you the exact co-ordinates of the extracted text. Using this method you need to know in advance where the text is going to be, so that you can extract text from a specific area on the page. It obviously won't work if you need to process any random document.

It's a difficult task, but hopefully this will give you a starting point.

Rowan
Thank you for your response1. The programs you mentioned don't give a good result. I don't mind going with a solution that isn't free but I have to be sure it will work 100% 2. I tried playing around with the solution of using the co-ordinates but I don't see how I can use this solution without knowing the co-ordinates of the border.The location of the text in the tables changes (nested tables, multile lines in a cell)
pooky