tags:

views:

350

answers:

8

What's the easiest way to find out if two text files are different programmatically? Given two files I just need to know whether they are different or not. This is for a quick tool to help with a particularly nasty merge (switched languages from VB to C# in one branch (yay!) and made many changes in the other), it won't be going into production.

Possible solutions:

  1. Hash both files and compare the hash
  2. Pull the files in and just do a string compare
  3. Call out to an external diff tool (unfortunately Winmerge doesn't have a CLI for this)

If possible ignoring white space would be awesome but I don't care that much about it. The main thing is this it needs to be quick and easy.

I'm using .Net 3.5sp1 by the way. Thanks for any ideas or pointers.

+9  A: 

The fastest way to do that is comparing byte-to-byte of the files loaded on a stream. Hashing both files will take too long for large files, string compare too, external tools too.

Comparing byte-to-byte will be the best for you, as it will only reach the EOF of the files when both are identical.

If you do hash compare, string compare or external tools you'll have to go through the entire files all the times you compare, comparing byte-to-byte will do it only in case they're identical.

Tufo
+1 : simple, efficient, 100% correct, and clearly the fastest
chburd
Technically not the fastest. The fastest involves checking file sizes first for trivial rejection. Also, the amount of time to compute a simple hash may be minimal compared to IO time - first make sure you are performing file caching with decent sized file reads. Also, hash compare does not have parse the whole file to reject - you can chunk the data and just compare hash chunks. Chunked hash cmp using processor cache prefetching can be 2-3X faster than a naive byte compare (but it likely won't be as fast as a SIMD/SIAR cmp w/prefetching). Plus you can multithread hash or cmp's easily.
Adisak
+2  A: 

Would using a MD5 Hash algorithm do to compare the results? Here's an example.

Hope this helps, Best regards, Tom.

tommieb75
+1 I've implemented this solution. Works great every time.
George
yeah, great results, bad performance, try comparing 2 files with 100MB each, it'll take a long time, don't matter how different the files are, doing byte-by-byte will stop the entire process in the first different byte it finds.
Tufo
+1 @Tufo -- good point.
George
@Tofu: Do the hash in chunks and compare the has for the chunks. You can do it in 4K or 16K chunks and run them in different threads. If you have multiple cores, it may run faster than naive byte compare.
Adisak
+10  A: 

There is an article in the Microsoft Knowledge Base, hope it helps. They compare the bytes to see whether two files are different - How to create a File-Compare function in Visual C#

Steffen
A: 

From the question - Easiest & Text file

StreamReader sr1 = new StreamReader(filePath1);
StreamReader sr2 = new StreamReader(filePath2);
if (sr1.ReadToEnd() == sr2.ReadToEnd() )
{ do stuff }

It isn't fast or pretty, but it's easy

Russell Steen
+4  A: 

Check byte by byte, here's some code:

public static bool AreFilesIdentical(string path1, string path2)
{
    using (FileStream file1 = new FileStream(path1)) {
        using (FileStream file2 = new FileStream(path2)) {

            if (file1.Length == file2.Length) {
                while (file1.Position < file1.Length) {
                    if (file1.ReadByte() != file2.ReadByte()) {
                        return false;
                     }
                }
                return true;
            }  
            return false;
        }
    }

}
Alex LE
I'd suggest to decorate the FileStream with a buffered stream, or read the stream by blocks.
Rafa Castaneda
A: 

if ( $file1 != $file2 ) return true;

Of course this varies between VB and C#

polbek
+1  A: 

It also depends on what you are trying to solve. Are you trying to answer the question: in this directory of N files, find all the exact duplicates? Or are these two files exactly the same?

If you are specifically just comparing two files, then using a byte by byte check is more efficient.

But if you are trying to find all duplicate pairs in N files, then a MD5 hash is better, because you can create and store the MD5 hash value once and compare this much smaller value to each pair of files. Other wise you would be iterating over each files byte stream for every other file in the directory.

Turbo
+1  A: 

I've implemented a very specialized version of diff a year ago (I had file's with over than 6GB and had to compare then). So I know the internal workings of diff (lot's of copy & paste, of course). Some thoughts:

  • If you want to simple know if they are different, compare them byte by byte. Optimize by checking if their sizes (lengths) are different and then read the files one byte at a time and check if they are different. You don't have to care about buffering, since your file API should do that for you (.Net does).
  • If there are some rules you'd like to apply to the comparing:
    • If you will ignore whitespace or any other character, as you read the byte, checks whether it should be ignored. If it should, read the next, but just on that file.
    • If there are rules that will be applied line-wise, then read the file line by line. Then hash the line, ignoring whatever you want to ignore.
    • Remember that line can be defined as a variable-length record with a newline as a terminator (separator). So you can define line to be whatever you want and read exactly that, hash it, and compare.

I can contribute with code if you want. Diff'ing files is more complex, because you will also output what is different.

Bruno Brant