tags:

views:

45

answers:

4

I'm looking for a utility that will help me find duplicate PDFs. The problem: I have a 1000s of PDF files. Some are duplicates. They are not easy to detect due differing files names and small differences in file size. Is there a utility/algorithm/library that can help me find the duplicates or show me files that are very similar (or degree of difference)?

+1  A: 

DiffPDF looks like something that might help you.

Anthony Labarre
DiffPDF compares 2 files side by side. Unfortunately, I have 1000s of files to compare so an automated solution would be best.
Elvin
+1  A: 

Create an MD5 hash for each file and store it in a database. Identical files will then sort next to each other, or you can quickly search for a pre-existing key.

Jaydee
+1  A: 

If the files were created by the different tools, they could look the same but generate very different results because they are structured totally differently. I made some suggestions in a blog article at http://www.jpedal.org/PDFblog/?p=561

mark stephens
+1  A: 

I remember that there is a UNIX utility called pdf2txt (see the package poppler-utils). You can try to extract the text from the files and make a textual diff.

Luca Martini