I'm looking for a utility that will help me find duplicate PDFs. The problem: I have a 1000s of PDF files. Some are duplicates. They are not easy to detect due differing files names and small differences in file size. Is there a utility/algorithm/library that can help me find the duplicates or show me files that are very similar (or degree of difference)?
DiffPDF compares 2 files side by side. Unfortunately, I have 1000s of files to compare so an automated solution would be best.
Elvin
2010-10-03 15:44:39
+1
A:
Create an MD5 hash for each file and store it in a database. Identical files will then sort next to each other, or you can quickly search for a pre-existing key.
Jaydee
2010-10-04 12:25:25
+1
A:
If the files were created by the different tools, they could look the same but generate very different results because they are structured totally differently. I made some suggestions in a blog article at http://www.jpedal.org/PDFblog/?p=561
mark stephens
2010-10-08 07:02:55
+1
A:
I remember that there is a UNIX utility called pdf2txt (see the package poppler-utils
). You can try to extract the text from the files and make a textual diff.
Luca Martini
2010-10-08 07:08:12