views:

213

answers:

4

Hi All,

I have a bunch of PDF docs with tabular data in them which I need to extract into a more readable format to store in a spreadsheet, database or whatever.

Is there anything out in the world (preferablly free) that is able to get tabluar data out of PDFs into a more readable format in bulk either natively integrated with an app or passively via command line or looping the process in code(.net)?

Can be any format really (doc, html) just as long as the tables are maintained.

Anything I've found so far is either a one-off (only does one doc at a time, I have hundreds, that aint happening) or does not maintain the table structure.

Any ideas please post.

A: 

When you say

Anything I've found so far ... only does one doc at a time

I'll assume you mean "is a GUI app, without a programming interface."

In this case you could use Microsoft UI Automation to programmatically control the app and make it do what you want.

UIA ... provides a means for exposing and collecting information about user interface elements and controls to support user interface accessibility and software test automation ... and is compatible with both Win32 and the .NET Framework.

Hugh Allen
+2  A: 

This is a giant hassle. In general, extracting the text content of a PDF file is running against the grain of what PDF wants you to do.

Start by trying to get the text out. This may be more or less successful, depending on how the PDF is built. One place to start is GhostScript or pstotext. If that fails you, this guy has a list of text extraction tools. Once you have the text stream, you could then try to reassemble the tabular structure programmatically.

Finally, if you are in seriously bad shape, and if the PDFs don't cooperate, you could do the OCR thing. The right long term solution is to get the data into the right format at the outset, either by doing a single, massive, painful, and probably partially-manual process; or to go to the source and suggest that the data be provided in a more useable form.

If you can give a more specific PDF example file, there may be a better or more precise answer... there is NO general solution to this, if it's possible, it will need to be tailored to your specific source data.

Note this rather pointed response to the general question... doesn't help with the fact that you have the problem in front of you, but maybe it would provide useful topcover when explaining to your boss why there isn't an obvious answer? ;-)

A new SO question popped up, and referred to this library -- iTextSharp -- which looks possibly related. SO question: Best way to extract...

andersoj
+1  A: 

Considering your requirement, Straight forward answer for your question would be it is quite not possible. The reason is, unlike word/excel, PDF specification does not have a object called Table. The table which you see in those PDF documents are just series of rectangle drawn in such a way that it looks like table and it is up to PDF Writer which created those PDF files, because some might draw table kind of structure using Series of Line.

But possibly you could write your own parser based on PDF File Specification , but it is still a daunting task if you choose to implement your own parser and it will take several months to get one which is working with quite a few PDF documents.

Incase, you decided to write your own parser. The below article would give you jump start. Code Project Article

Karthik
There are a bunch of PDF toolsets out there... I don't know how this helps answer the question.
andersoj
@andersoj, Thanks for your feedback. I've been developing commercial PDF solution for the past 2 years. Based on my knowledge and years of experience in the PDF file format, And this question was asked by several of our customers in the past. Hence I gave my straight-forward response. Also, as far as I know there is no such components available in the market. But there are some commercial solutions available which would export PDF as Word Document and I know how far they are reliable ;) Cheers,
Karthik
Ah, that's similar to the LaTeX to Word approach? Generate one bitmap for each page, place on the page, ready is your word document?
Stephan Eggermont
@Karthik -- I removed my downvote. As a PDF guru, you know that the question isn't answerable in its current form -- suppose these tables were encoded as embedded images? Suppose they used a non-standard font/font encoding? Given then PDF has little in the way of semantics, and the haphazard ways PDF output has been structured by various producers, these problems are rife... We need sample data to answer the question.
andersoj
@Stephan, No, those tools don't use bitmap based approach. Instead, those tools basically parse the given PDF file then extract text and it's positions during the first pass and based on the text XY position retrieved from the PDF document, they create new word document. This approach would work fine with few documents (where you will get similar output as exists in PDF), but there is no guarantee that this will work reliably with all the PDF documents.
Karthik
@andersoj, If the tables are encoded as embedded images, then we could extract the image from PDF file, with some small tweaks to ITextSharp library code. But the thing is that most PDF producers don't typically do this, because if you encode table as image, then the text contents within those table will not be selectable and searchable in the Pdf Reader(for ex: Adobe Reader). I am not quite sure about what you meant by Sample data. If @markdigi could share couple of PDF files from which he wants to extract tables, then I could share some further details. thank you so much ;)
Karthik
@Karthik -- re: sample data, that's exactly what I meant. Beyond pointing the questioner to some toolkits, vaguely, we'd need a sample PDF to see if any of them would really apply. Agreed that most contemporary PDF producers wouldn't embed images, but if the questioner was working with a contemporary producer, he could probably get the data in a much more suitable form than PDF! Several times I've wanted to do this to extract hundreds of pages of protocol spec from a *recent* (~2002 era) MILSPEC document, only to find that I had to OCR the whole thing b/c it was all images.
andersoj
A: 

PDF format is build as a collection of letters, which have no inherent format or anything. You can think of PDF just as a page that has come through the OCR and you are taking it from there - letters and their coordinates are there - rest is up to you - to figure out layout, formats, columns, and eventual tables.

Daniel Mošmondor