views:

223

answers:

5

I'm currently writing a program that needs to compare each file in an ArrayList of variable size. Right now, the way I'm doing this is through a nested code loop:

         if(tempList.size()>1){
            for(int i=0;i<=tempList.size()-1;i++)
                //Nested loops.  I should feel dirty?
                for(int j=i+1;j<=tempList.size()-1;j++){
                    //*Gets sorted.
                    System.out.println(checkBytes(tempList.get(i), tempList.get(j)));
                }
            }

I've read a few differing opinions on the necessity of nested loops, and I was wondering if anyone had a more efficient alternative.

At a glance, each comparison is going to need to be done, either way, so the performance should be fairly steady, but I'm moderately convinced there's a cleaner way to do this. Any pointers?

EDIT:: This is only a part of the function, for clarity. The files have been compared and put into buckets based on length - after going through the map of the set, and finding a bucket which is greater than one in length, it runs this. So - these are all files of the same size. I will be doing a checksum comparison before I get to bytes as well, but right now I'm just trying to clean up the loop.

Also, holy cow this site responds fast. Thanks, guys.

EDIT2:: Sorry, for further clarification: The file handling part I've got a decent grasp on, I think - first, I compare and sort by length, then by checksum, then by bytes - the issue I have is how to properly deal with needing to compare all files in the ArrayList efficiently, assuming they all need to be compared. If a nested loop is sufficient for this, that's cool, I just wanted to check that this was a suitable method, convention-wise.

+1  A: 

Comparing everything with everything else like that is bound to be O(n²). But there are tricks you can try. The main one is to make comparisons cheaper; this can be done by generating a hash code for each file and comparing those first, which will at least avoid the majority of comparisons (use a good enough algorithm and you'll avoid virtually every one). You can also speed things up if you don't need to retain information about which files are equal; produce a Set of hashcodes of each file and at the end test to see if the size of the set is the same as the size of the list of files.

Donal Fellows
Note that I'm assuming you're comparing for equality here. If not, and you can't capture the essence of what you're comparing for in a hash, then you've already got the best basic algorithm.
Donal Fellows
Depending on the content of the files, this could actually be slower (will be so when they're very long and the contents random). Because comparisons can terminate early while a typical hashCode() implementation would look at the entire file. Of course you could just hash a part of the file, but then you could get lots of collisions - and the comparison doesn't necessarily have to be sequential either.
Michael Borgwardt
+2  A: 

A good optimization would be to calculate first all the hashes of the files and then do a single loop over the list.

This basically because you'll have anyway to check each pair of files of your list, but this will imply just a O(1) complexity for each pair instead that calculating a lot of things for each one you are going to check.

You can go something like:

HashSet<YourFile> fileSet = new HashSet<YourFile>();
ArrayList<YourFile> files = new ArrayList<YourFile>();

class YourFile
{
  int hashcode = -1;

  public int hashCode()
  {
     // override it to provide an hashcode based on file contents
     // you can also cache it to avoid recalculating anything

     if (hashcode == -1)
       hashcode = calculateIt();

     return hashcode;
  }
}

// fill up files
files.add(...);

// do comparisons
for (YourFile f : files)
{
  if (fileSet.contains(f))
    // f and fileSet.get(f) are equal: this is a tricky utilization of the hashCode() method so be careful about it!
  else
  {
    fileSet.put(f);
    // since there's not a file with same hashcode you just add this one
  }
}

This will actually drop the inner loop, since when you use hashSet.contains it will check all the already added files, but with an O(1) complexity.

As stated from doublep you have to be careful about performances, since when you plainly check bytes you will stop as soon as you find two different bytes while calculating the hash will need to check the whole file. This will work good when you have many files or when the file are rather small.. the best thing to do would be to benchmark both approaches and see if there are notable differences.

Jack
This algorithm is a bit wrong. Your code doesn't deal with the case where the `hashCode` for two files is equal, but the files are not equal. Since you are using `hashCode` which returns only `2**32` distinct values, the probability of this happening cannot be ignored.
Stephen C
According to the birthday paradox you'll need at least 2*(2**16) files to have a collision with a considerable probability. Since however in practical you'll end up having a low amount of them (or at least I suppose we are not talking about millions of files) we can just check the files using a normal approach if they result equal. That shouldn't kill performance.
Jack
+2  A: 

Depending on what exactly you are doing, you might get considerable speedup by never comparing files of different sizes. Among files of the same size compare only those with the same hash (by whatever algorithm), as suggested in other answers.

EDIT:

Computing the hash can be conunterproductive, though. First, never do it if you are comparing the file only against one another: you need to read the file completely to build a hash, and one read is already enough for comparison, so you will gain nothing.

Second, if you rarely expect a match and actually files will differ considerably (early on), computing the hash might be counterproductive regardless of number of files to compare. That's because failed comparison in such a situation will fail early (i.e. not reading the whole file), while for a hash building you will need a full read. Alternatively, you may build "partial" hash (e.g. a hash of first 10 kb of a file), but then remember to use equal chunks of all files.

doublep
A: 

One tiny cleanup would be to remove the initial size test - if the size is less than 2, it will simply fall out without having done any comparisons. Better adherence to Java coding conventions would be, in the loops, to compare i < tempList.size() instead of i <= tempList.size() - 1 - that will simply make your code easier for other programmers to understand. Neither of these changes has any impact on performance.

for (int i = 0; i < tempList.size(); i++)
    for (int j = i + 1; j < tempList.size(); j++) {
        //*Gets sorted.
        System.out.println(checkBytes(tempList.get(i), tempList.get(j)));
    }
Carl Manaster
Thank you, that was a bit daft of me.
KGVT
Subquestion: This function executes multiple times in the course of the program, and I'm expecting most of the ArrayLists produced to be of size 1, since this program is checking for duplicate files, and most files will (hopefully) be unique - removing the if statement means that it checks and enters the first for loop, and then checks and fails the second for loop, meaning it has performed two comparisons instead of one. It's relatively minor, but is this still an appropriate action to take? Or does expecting it to fail most of the time negate the need to change it?
KGVT
@KGVT: I don't believe this can make a measurable difference on any modern system. Yes, technically, it's one extra comparison in the common case, but it's just not big enough to matter. If your program is too slow, instrument it and find the bottlenecks; I don't believe this will be one of them.
Carl Manaster
The first loop should actually have the condition `i < tempList.size() - 1`, or the last time through the first loop the second loop would never run.
Gabe
@Gabe: There's no harm in a loop running zero times.
Carl Manaster
+2  A: 

My answer to your EDIT2 question is in two parts

The part is that if you have a small number of files, then your nested loop approach should be fine. The performance is O(N**2) and the optimal solution is O(N). However, if N is small enough it won't make much difference which approach you use. You only need to consider an alternative solution if you are sure that N can be large.

The second part spells out an algorithm that exploits file hashes to get an O(N) solution for detecting duplicates. This is what the previous answers alluded to.

  1. Create a FileHash class to represent file hash values. This needs to define equals(Object) and hashCode() methods that implement byte-wise equality of the file hashes.

  2. Create a HashMap<FileHash, List<File>> map instance.

  3. For each File in your input ArrayList:

    1. Calculate the hash for the file, and create a FileHash object for it.
    2. Lookup the FileHash in the map:
    3. If you found an entry, perform a byte-wise comparison of your current file with each of the files in the list you got from the map. If you find a duplicate file in the list, BINGO! Otherwise add current file to the list.
    4. If you didn't find an entry, create a new map entry with the "FileHash` as the key, and the current file as the first element of the value list.

(Note that the map above is really a multi-map, and that there are 3rd party implementations available; e.g. in Apache commons collections and Google collections. I've presented the algorithm in the form above for the sake of simplicity.)

Some performance issues:

  • If you use a good cryptographic hash function to generate your file hashes, then the chances of finding an entry in 3.3 that has more than one element in the list are vanishingly small, and the chances that the byte-wise comparison of the files will not say the files are equal is also vanishingly small. However, the cost of calculating the crypto hash will be greater than the cost of calculating a lower quality hash.

  • If you do use a lower quality hash, you can mitigate the potential cost of comparing more files by looking at the file sizes before you do the byte-wise comparison. If you do that you can make the map type HashMap<FileHash, List<FileTuple>> where FileTuple is a class that holds both a File and its length.

  • You could potentially decrease the cost of hashing by using a hash of just (say) the first block of each file. But that increases the probability that two files may have the same hash but still be different; e.g. in the 2nd block. Whether this is significant depends on the nature of the files. (But for example if you just checksummed the first 256 bytes of a collection of source code files, you could get a huge number of collisions ... due to the presence of identical copyright headers!)

Stephen C