views:

163

answers:

5

Hi all,

I'm writing a small program to find duplicate files

I iterate through each file in a directory

then i load the file path and the md5hash of that file into a dictionary ( file path being the key)

I next want to walk through each value in the dictionary to see if any values match then display the two+ keys in a display window

however im not sure how to not display duplicate findings

1a
2b
3a
4c

If i use a for each loop with the key value pair I would get entries for 1 matchs 3 and then that 3 match 1

If i had a search that i could read everything below the search string and not have to worry about that ( plus i believe it would be more efficient)

Is there a name for this type of loop ( please excuse my lack of formal knowlege)

OR would the best practice be to remove any dictionary entries as they are found?

Thanks so much for your help

A: 

If I understand what you are trying to do correctly:

Create a class containing the file path and md5hash, and make it implement the IComparable interface such that the CompareTo method works on the md5hash.

Iterate through each file creating a new object for each and throw them in an ArrayList. Then sort the ArrayList. Now all the files with the same md5hash'es will be located consecutively, so you can very easily see which files are duplicates.

Phil
How do i create a Icomparable interface?
Crash893
+2  A: 

If I understand you correctly, you are using the hash to decide if two files are identical, and you are using the hash as the dictionary key. You can't have duplicate keys in a dictionary, so you'd want to have a Dictionary<Hash, IList<string>> and add any files to the list for each hash value.

Lee
he's using the path as the key but you've hit on a better way of counting the duplicates here.
grenade
If you use Lee's suggestion of hashes as keys and paths as values the counting will already be done for you when the dictionary is populated.
grenade
That is a good idea
Crash893
A: 

It really depends on whether you want to keep the 'duplicate' data and just not print it out, or if you really truly do not want the data in the dictionary. Tahts a decision only you can make in relation to your program.

cyberconte
A: 

When you read the files and create their hashes you could simply employ a second list that you throw your hash values into. Befor inserting you would then check if the list already contains an item with the new value.

This approach has a little memory overhead but saves some loop iterations.

Frank Bollack
A: 

Assuming that dict is a Dictionary that contains the filename as the key and the MD5 hash as the value, you could use the following code to display duplicate files :

var groupedByHash = from kvp in dict
                    group kvp by kvp.Value into grp
                    let count = grp.Count()
                    where count > 1
                    select grp;

foreach (IGrouping<string,KeyValuePair<string,string>> grp in groupedByHash)
{
    Console.WriteLine("Hashcode : {0}", grp.Key);
    foreach(KeyValuePair<string,string> kvp in grp)
    {
        Console.WriteLine("\tFilename = {0}", kvp.Key);
    }
    Console.WriteLine();
}
Thomas Levesque