Hi.. I want to convert the pdf data into our own file specifications. So pls help me out to choose the correct API for PDF parsing using java or .net. The parsing should extract each and every component(element) from the PDF pages.
+1
A:
There's a library called IText that does what you want. It's sort of the #1 product out there and is free as in beer.
I've worked with IText before, extracting content from PDFs, and while it's not super-duper automatic, it allows you to get at everything.
Recommended, in other words.
Carl Smotricz
2010-07-13 08:27:17
@Naimur it's in AGPL, you may want to check the license compatibility with your program.
streetpc
2010-07-13 08:31:31
In addition to that the PDF Reference will be required to understand the format. You can find it here (ISO charges, free links at end of page): http://www.adobe.com/devnet/pdf/pdf_reference.html
Stroboskop
2010-07-13 08:33:43
Thanks a lot carl.. Actually i started using iText only.. I want to know how to get the text which is in non readable format (some complex pdf structures). Also i need to get the type of pdf form whether it is scannable form or pre printed form or something els..Thanks in advance.
Naimur
2010-07-13 08:33:49
@streetpc: ... or get an older version which is "just as good" and was LGPL as far as i remember
Stroboskop
2010-07-13 08:34:47
My own work with PDF was rather simple, so I can't advise on details. I had the impression, though, that IText will allow you to get at everything if you just know how to ask. You will certainly need to understand the file structure in detail. Stroboskop's answer may help you there.
Carl Smotricz
2010-07-13 08:38:24
Thanks guys... If any other suggestions, help me out.
Naimur
2010-07-13 08:42:36