tags:

views:

413

answers:

3

I need to find the difference between two pdf files. Does any any python related tool have a feature which directly gives the diff of the 2 PDFs?

+3  A: 

What do you mean by "difference"? A difference in the text of the PDF or some layout change (e.g. an embedded graphic was resized). The first is easy to detect, the second is almost impossible to get (PDF is an VERY complicated file format, that offers endless file formatting capabilities).

If you want to get the text diff, just run a pdf to text utility on the two PDFs and then use Python's built-in diff library to get the difference of the converted texts.

This question deals with pdf to text conversion in python: http://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text.

The reliability of this method depends on the PDF Generators you are using. If you use e.g. Adobe Acrobat and some Ghostscript-based PDF-Creator to make two PDFs from the SAME word document, you might still get a diff although the source document was identical.

This is because there are dozens of ways to encode the information of the source document to a PDF and each converter uses a different approach. Often the pdf to text converter can't figure out the correct text flow, especially with complex layouts or tables.

Franz
Just the text will do. The pdf generator should not be a problem.
Goutham
A: 

Check this out, it can be useful: http://pybrary.net/pyPdf/

mtasic
pyPdf was not very robust in my tests. it crashed on pdfs created by Illustrator/InDesign and other vector drawing programs. However it might be ok for simple PDFs from Office apps. A far more solid alternative is pdftotext from the xpdf toolkit.
Franz
A: 

I do not know your use case, but for regression tests of script which generates pdf using reportlab, I do diff pdfs by

  1. Converting each page to an image using ghostsript
  2. Diffing each page against page image of standard pdf, using PIL

e.g

im1 = Image.open(imagePath1)
im2 = Image.open(imagePath2)

imDiff = ImageChops.difference(im1, im2)

This works in my case for flagging any changes introduced due to code changes.

Anurag Uniyal