tags:

views:

72

answers:

4

I need some recommendations on processing PDF documents. These documents are annual statements and contains amounts and dollar figures that I need to reconcile.

I saw some recommendations on

1) iTextSharp, 
2) PDFBox (IKVM)
3) PDFSharp
4) PDFEdit API (from Adobe)

Which ones would you recomend and if there are any limitations that I should be aware of? Besides open source, I do not mind paying for a commercial product as long as it is well supported and fully featured.

*Other information: * The PDFs are all generated by the same third party vendor. Not all the PDFs have the same structure - there are about 10 different structures (templates).

I do not have a write requirement on PDF.

Many thanks in advance.

+2  A: 

My vote would be PDFSharp for the following reasons...

  • Easier to use than ITextSharp (subjective opinion)
  • Permissive licence (X11 licence)
  • I had never heard of PDFBox before ;-)
Tim Jarvis
Thanks Tim. PDFBox is now taken over by Apache. http://pdfbox.apache.org/
Syd
A: 

They all have different strengths and weaknesses? What are you trying to do exactly?

mark stephens
@mark, I have a requirement to perform financial reconcilation on the total amount of $$$ stated in the PDFs (each PDF is an annual statement letter to a customer). The requirement is not unlike OCR on printed documents to extract the metadata.
Syd
+1  A: 

You could also look at PDFText. We use this in many cases for extracting raw data from PDF files. He also has other inexpensive libraries to aid with other aspects of PDF manipulation.

This assumes that the document is not scanned and has data that can be extracted.

Douglas Anderson
@douglas. thanks for the link (+1). i will add to my research. one question, why did you choose this option instead of what i have listed above?
Syd
@Syd. we chose this for another project that needed to pull data from thousands of pdf files from different origins. it turned out to be the only library that worked with all files, especially ones from oracle xml publisher which were all malformed. since it worked so well we turn to it every time we need pdf text extraction and have written an entire set of wrappers for it to pull from different zones etc. For the price, we find it very useful. Support has been good too from the developer.
Douglas Anderson
Thanks Douglas for giving the extra reason (+1 for your extra comments).
Syd
+1  A: 

Check out http://www.pdftron.com/. We use it to both read and write PDF documents- very reliable.

unclepaul84
@Uncle Paul84. thanks for the link (+1). i will add to my research. one question, why did you choose this option instead of what i have listed above?
Syd