views:

231

answers:

2

Hello

I don't know enough about VB.Net (2008, Express Edition) yet, so I wanted to ask if there were a better way to find files with different names but the same contents, ie. duplicates.

In the following code, I use GetFiles() to retrieve all the files in a given directory, and for each file, use MD5 to hash its contents, check if this value already lives in a dictionary: If yes, it's a duplicate and I'll delete it; If not, I add this filename/hashvalue into the dictionary for later:

'Get all files from directory
Dim currfile As String
For Each currfile In Directory.GetFiles("C:\MyFiles\", "File.*")
    'Check if hashing already found as value, ie. duplicate
    If StoreItem.ContainsValue(ReadFileMD5(currfile)) Then
        'Delete duplicate
    'This hashing not yet found in dictionary -> add it
    Else
        StoreItem.Add(currfile, ReadFileMD5(currfile))
    End If
Next

Is this a good way to solve the issue of finding duplicates, or is there a better way I should know about?

Thank you.

A: 

You can optimize this routine a bit by calculating MD5 hash only once (it's either typo in the question, or you're really doing so twice).

Additionally, you can compare file lengths prior to calculating hash: if lengths are different, hash values will be different as well (theoretically they can be identical, but that's less than probable).

Anton Gogolev
+3  A: 

You can optimise this by the following

  • Iterate all the files and record the filename and length
  • Then compare (MD5) each file only with those that are the same length
  • This is one of those tasks that is called embarrassingly parallel, so you should be able to use mulitple threads to do this and more efficiently, and since each comparison is independent
  • You only need to compare one file to another once not both ways round, ie. if you do compare(f1, f2) then you don't need to do compare(f2, f1)

I've sure there are many others.

Preet Sangha
Thanks guys for the tips.
OverTheRainbow