Find duplicate PDFs

views:

answers:

+1 Q:

Find duplicate PDFs

I'm looking for a utility that will help me find duplicate PDFs. The problem: I have a 1000s of PDF files. Some are duplicates. They are not easy to detect due differing files names and small differences in file size. Is there a utility/algorithm/library that can help me find the duplicates or show me files that are very similar (or degree of difference)?

+1 A:

DiffPDF looks like something that might help you.

Anthony Labarre 2010-10-03 15:19:07

DiffPDF compares 2 files side by side. Unfortunately, I have 1000s of files to compare so an automated solution would be best.

Elvin 2010-10-03 15:44:39

+1 A:

Create an MD5 hash for each file and store it in a database. Identical files will then sort next to each other, or you can quickly search for a pre-existing key.

Jaydee 2010-10-04 12:25:25

+1 A:

If the files were created by the different tools, they could look the same but generate very different results because they are structured totally differently. I made some suggestions in a blog article at http://www.jpedal.org/PDFblog/?p=561

mark stephens 2010-10-08 07:02:55

+1 A:

I remember that there is a UNIX utility called pdf2txt (see the package poppler-utils). You can try to extract the text from the files and make a textual diff.

Luca Martini 2010-10-08 07:08:12

ansaurus

tags:

views:

answers:

Find duplicate PDFs

related questions