Deleting (near) Duplicate Files

views:

answers:

Deleting (near) Duplicate Files

What's the best scripted way to delete (near) duplicate files based on filespec in Windows (XP in this case)? I am thinking of RegEX and some VB Script but if there is a better way...

Examples include filenames that slighlty differ in name either with one or two (known) extra characters at the end or beggining but are identical in size, files that are slighlty different in size as well..etc

Is Regex the best way to handle these variances if the boundaries are known.

+2 A:

No, I don't think regex is the right tool here. It sounds a bit dangerous, if you ask me. Anyway, you could calculate the Levenshtein distance between the two file names and if sufficiently small (be careful with file names that consist of just a couple of characters!) delete one of the two.

The sizes can be done using simple arithmetic.

Bart Kiers 2009-10-07 15:04:34

I share the concern about danger of regex (easy to overmatch). Levenshtein may be what you're looking for, if things like character swap/replacement is okay. If prefix/suffix is all you expect, though, it'd be better to only check for that.

Jefromi 2009-10-07 15:30:54

Can this method be used on all relevant file attributes to produce an overall quantifiable metric or would it be better to use RegEx and assign individual metric based on arbitary scales of equal size (weighted with say an importance multiplier) and then sum them?

MaSuGaNa 2009-10-07 15:35:12

@Jefromi - it's not only prefix/suffix unfortuanlty or I'd use simple string manipulation (Left/Right/Mid) etc.

MaSuGaNa 2009-10-07 15:36:47

You can use regex to match (or near match) the filenames.

I would use regex to match the names, and build a list of file sizes. You can calculate a variance based on those file sizes which fall within that variance.

After you have build the list of matching files you can access different file attributes (size , date etc.) to flag which files to delete.

Xetius 2009-10-07 15:06:17

ansaurus

tags:

views:

answers:

Deleting (near) Duplicate Files

related questions