views:

114

answers:

4

I am in the classic scenario where the business gives you a bunch of new pdf forms for the new year with no revision notes whatsoever and you are supposed to figure out what's different from the previous year ones.

I am talking loads of forms here, so I am trying to find a way to compare PDFs to outline differences without having people to manually go through each and every one of them.

My idea was to extract all the text from the PDFs and dump it into a .txt then run differences on text files, but it sounds horrible.

My question says programmatically, but I'd be happy with any reliable tools for comparing PDFs, and mainly looking to get an idea from people experiences. Also willing to entertain any programmatic solutions (preferably in C# but pls shoot out any ideas).

+3  A: 

There is quite a few software products that claim to diff pdfs. I've never had need to use one but if this is going to be a recurring process I think it'd be wise for your company to invest in one of them. Just Google "pdf diff" for a bunch of potential applications.

Additionally, your situation is very similar to this question: http://stackoverflow.com/questions/145657/how-to-compare-two-pdf-files I think its discussion may help.

Sorax
thanks for that - that question is indeed very similar (for some reason didn't pop up when I composed mine).
JohnIdol
+2  A: 

I went the approach to getting the raw data out of the PDF, then making use of Word or TortiseSVN, or WinMerge, etc...to take care of the comparison piece. In my instance I did the comparison in a RichTextBox in C#...coloring the differences, etc...since we wanted it all within our app.

Here is what I did... PDF comparison as I was trying to compare mixed documents, Word and PDF.

However I would recommend PDFBox for the parsing, a bit more elegant...although iTextSharp worked out ok...

Aaron
+1  A: 

I wrote a blog suggesting some approaches to comparing PDF files at http://www.jpedal.org/PDFblog/?p=561

mark stephens
+2  A: 

I am a developer of Docotic.Pdf Library. We use PDF comparison in unit tests for checking that test produces PDF as expected. PDF is a collection of special objects and we compare all PDF objects ignoring some properties like trailer IDs and creator info. This implementation works fine.

You can try the method PdfDocument.DocumentsAreEqual. This method just tell you are documents equal, without specific differences. You may contact us if you need more functionality.

Vitaliy Shibaev