Checking for document duplicates and similar documents in a document management application

views:

226

answers:

Checking for document duplicates and similar documents in a document management application

Update: I have now written a PHP extension called php_ssdeep for the ssdeep C API to facilitate fuzzy hashing and hash comparisons in PHP natively. More information can be found over at my blog. I hope this is helpful to people.

I am involved in writing a custom document management application in PHP on a Linux box that will store various file formats (potentially 1000's of files) and we need to be able to check whether a text document has been uploaded before to prevent duplication in the database.

Essentially when a user uploads a new file we would like to be able to present them with a list of files that are either duplicates or contain similar content. This would then allow them to choose one of the pre-existing documents or continue uploading their own.

Similar documents would be determined by looking through their content for similar sentances and perhaps a dynamically generated list of keywords. We can then display a percentage match to the user to help them find the duplicates.

Can you recommend any packages for this process and any ideas of how you might have done this in the past?

The direct duplicate I think can be done by getting all the text content and

Stripping whitespace
Removing punctuation
Convert to lower or upper case

then form an MD5 hash to compare with any new documents. Stripping those items out should help prevent dupes not being found if the user edits a document to add in extra paragraph breaks for example. Any thoughts?

This process could also potentially run as a nightly job and we could notify the user of any duplicates when they next login if the computational requirement is too great to run in realtime. Realtime would be preferred however.

I'm working on a similar problem in web2project and after asking around and digging, I came to the conclusion of "the user doesn't care". Having duplicate documents doesn't matter to the user as long as they can find their own document by its own name.

That being said, here's the approach I'm taking:

Allow a user to upload a document associating it with whichever Projects/Tasks they want;
The file should be renamed to prevent someone getting at it via http.. or better stored outside the web root. The user will still see their filename in the system and if they download it, you can set the headers with the "proper" filename;
At some point in the future, process the document to see if there are duplicates.. at this point though, we are not modifying the document. After all, there could be important reasons the whitespace or capitalization is changed;
If there are dupes, delete the new file and then link to the old one;
If there aren't dupes, do nothing;
Index the file for search terms - depending on the file format, there are lots of options, even for Word docs;

Throughout all of this, we don't tell the user it was a duplicate... they don't care. It's us (developers, db admins, etc) that care.

And yes, this works even if they upload a new version of the file later. First, you delete the reference to the file, then - just like in garbage collection - you only delete the old file if there are zero references to it.

CaseySoftware 2009-11-13 13:31:55

Interesting ideas. We must not have duplicates in our system however as the documents are used across many different sites from one central location and all sites must be updated simultaneously.I was not suggesting modify the document itself. Just so the hash match could match similar documents as much as possible. Should there be a match then I would ask the user to either accept the presently available file or update it with the new file they are uploading or simply add it as another file they absolutely must.I cannot delete an old file as its not transparent to the user whats happening.

Treffynnon 2009-11-13 14:40:35

Sorry, I misphrased my response a bit.My point was that by modifying it before you compare the documents, means that you're not actually comparing the documents... you're comparing the modified documents. For example, is "Hello, my name is keith" the same sentence as "Hello, my name is Keith". Conceptually they're the same, but not capitalizing my name in the first one is probably a typo. Your proposed idea would treat these documents the same and flag one as a duplicate.

CaseySoftware 2009-11-15 19:03:16

Thats exactly what I was hoping to do. :) They are essentially the same sentence. One just has a typo in it. Therefore I want them to update the existing document rather than uploading a new one.

Treffynnon 2009-11-16 09:21:05

And my point is that you don't know which one is a Typo. ;)

CaseySoftware 2009-11-16 13:22:44

I have found a program that does what its creator, Jesse Kornblum, calls "Fuzzy Hashing". Very basically it makes hashes of a file that can be used to detect similar files or identical matches.

The theory behind it is documented here: Identifying almost identical files using context triggered piecewise hashing

ssdeep is the name of the program and it can be run on Windows or Linux. It was intended for use in forensic computing, but it seems suited enough to our purposes. I have done a short test on an old Pentium 4 machine and it takes about 3 secs to go through a hash file of 23MB (hashes for just under 135,000 files) looking for matches against two files. That time includes creating hashes for the two files I was searching against as well.

Treffynnon 2009-11-13 16:23:28

ansaurus

tags:

views:

answers:

Checking for document duplicates and similar documents in a document management application

related questions