views:

452

answers:

7

I have some code that is really slow. I knew it would be and now it is. Basically, I am reading files from a bunch of directories. The file names change but the data does not. To determine if I have read the file, I am hashing it's bytes and comparing that to a list of hashes of already processed files. There are about 1000 files in each directory, and figuring out what's new in each directory takes a good minute or so (and then the processing starts). Here's the basic code:

public static class ProgramExtensions
{
    public static byte[] ToSHA256Hash(this FileInfo file)
    {
        using (FileStream fs = new FileStream(file.FullName, FileMode.Open))
        {
            using (SHA256 hasher = new SHA256Managed())
            {
                return hasher.ComputeHash(fs);
            }
        }
    }
    public static string ToHexString(this byte[] p)
    {

        char[] c = new char[p.Length * 2 + 2];

        byte b;

        c[0] = '0'; c[1] = 'x';

        for (int y = 0, x = 2; y < p.Length; ++y, ++x)
        {
            b = ((byte)(p[y] >> 4));

            c[x] = (char)(b > 9 ? b + 0x37 : b + 0x30);

            b = ((byte)(p[y] & 0xF));

            c[++x] = (char)(b > 9 ? b + 0x37 : b + 0x30);
        }

        return new string(c);

    }
}

class Program
{
    static void Main(string[] args)
    {
        var allFiles = new DirectoryInfo("c:\\temp").GetFiles("*.*");

        List<string> readFileHashes = GetReadFileHashes();

        List<FileInfo> filesToRead = new List<FileInfo>();

        foreach (var file in allFiles)
        {
            if (readFileHashes.Contains(file.ToSHA256Hash().ToHexString()))
                filesToRead.Add(file);
        }

        //read new files
    }
}

Is there anyway I can speed this up?

A: 

I'd do a quick CRC hash check first, as it is less expensive. if the CRC does not match, continue on with a more "reliable" hash test such as SHA

Michael G
+7  A: 

I believe you can archive the most significant performance improvement by simply first checking the filesize, if the filesize does not match, you can skip the entire file and don't even open it.

Instead of just saving a list of known hashes, you would also keep a list of known filesizes and only do a content comparison when filesizes match. When filesize doesn't match, you can save yourself from even looking at the file content.

Depending on the general size your files generally have, a further improvement can be worthwhile:

  • Either doing a binary compare with early abort when the first byte is different (saves reading the entire file which can be a very significant improvement if your files generally are large, any hash algorithm would read the entire file. Detecting that the first byte is different saves you from reading the rest of the file). If your lookup file list likely contains many files of the same size so you'd likely have to do a binary comparison against several files instead consider:

  • hashing in blocks of say 1MB each. First check the first block only against the precalculated 1st block hash in your lookup. Only compare 2nd block if 1st block is the same, saves reading beyond 1st block in most cases for different files. Both those options are only really worth the effort when your files are large.

I doubt that changing the hashing algorithm itself (e.g first check doing a CRC as suggested) would make any significant difference. Your bottleneck is likely disk IO, not CPU so avoiding disk IO is what will give you the most improvement. But as always in performance, do measure.

Then, if this is still not enough (and only then), experiment with asynchronous IO (remember though that sequential reads are generally faster than random access, so too much random asynchronous reading can hurt your performance)

Ben Schwehn
I need something a little more reliable than that. The files are binary files with fixed size records. Many files can have the same number of records making them the same size.
scottm
I certainly didn't t mean to *only* check the file size, but to quickly discard files when the file size doesn't match and only futher investigate files that don't match using a more compuationally expensive comparison. Edited my answer to be more clear
Ben Schwehn
I see. I also like the early bail idea because I think the first 20 bytes or so are usually different.
scottm
+1  A: 
  • Create a file list
  • Sort the list by filesize
  • Eliminate files with unique sizes from the list
  • Now do hashing (a fast hash first might improve performance as well)
Vinko Vrsalovic
A: 

Your description of the problem still isn't clear enough.

The biggest problem is that you are doing a bunch of hashing. This is guaranteed to be slow.

You might want to try searching for the modification time, which does not change if a filename is changed:

http://msdn.microsoft.com/en-us/library/ms724320(VS.85,loband).aspx

Or you might want to monitor the folder for any new file changes:

http://www.codeguru.com/forum/showthread.php?t=436716

Unknown
+1  A: 
  • Use an data structure for your readFileHashes store that has an efficient search capability (hashing or binary search). I think HashSet or TreeSet would serve you better here.

  • Use an appropriate checksum (hash sum) function. SHA256 is a cryptographic hash that is probably overkill. CRC is less computationally expensive, originally intended for catching unintentional/random changes (tranmission errors), but is susceptable to changes to are designed/intended to be hidden. What fits the differences between the files you are scanning?

    See http://en.wikipedia.org/wiki/List_of_checksum_algorithms#Computational_costs_of_CRCs_vs_Hashes

    Would a really simple checksum via sampling (e.g. checksum = (first 10 bytes and last 10 bytes)) work?

Bert F
I'd be very suprised (but very happy to be shown otherwise) that the efficiency of the hashing algorithm will make a significant difference compared to the cost of disk I/O. A modern CPU should not have a any problems doing SHA256 at the speed a typical HDD can read data. Fast SSDd in RAID perhaps... But I like your sampling idea, because it saves you from doing the IO
Ben Schwehn
A very good point - I just hate to see expensive algorthms used, possibly unnecessarily, just because newer is "better". So, when iPhone/LinuxOnANasChip/NetBook programmers are looking for answers in StackOverflow in the future, I'd like them to consider which algorthm they should use in a given context. :-)
Bert F
yeah, it's easy to forget how complex an algorithm e.g. SHA256 is when it's so easy to use in modern environments
Ben Schwehn
A: 

First group the files by file sizes - this will leave you with smaller groups of files. Now it depends on the group size and file sizes. You could just start reading all files in parallel until you find a difference. If there is a difference, split the group into smaller groups having the same value at the current position. If you have information how the files differ, you can use this information - start reading at the end, don't read and compare byte by byte if larger cluster change, or what ever you know about the files. This solution might introduce I/O performance problems if you have to read many files in parallel causing random disc access.

You could also calculate hash values for all files in each group and compare them. You must not neccessarily process the whole files at once - just calculate the hash of a few (maybe a 4kiB cluster or whatever fits your file sizes) bytes and check if there are allready differences. If not, calculate the hashes of the next few bytes. This will give you the possibility to process larger blocks of each file without requiring to keep one such large block for each file in a group in the memory.

So its all about a time-memory (disc I/O-memory) trade-off. You have to find your way between reading all files in a group into memory and comparing them byte by byte (high memory requirement, fast sequential access, but may read to much data) and reading the files byte by byte and comparing only the last byte read (low memory requirement, slow random access, reads only required data). Further, if the groups are very large, comparing the files byte by byte will become slower - comparing one byte from n files is a O(n) operation - and it might become more efficient to calculate hash values first and then compare only the hash values.

Daniel Brückner
A: 

updated: Definitely DO NOT make your only check for file size. If your os version allows use FileInfo.LastWriteTime

I've implemented something similar for an in-house project compiler/packager. We have over 8k files so we store the last modified dates and hash data into a sql database. then on subsequent runs we query first against the modified date on any specific file, and only then on the hash data... that way we only calculate new hash data for those files that appear to be modified...

.net has a way to check for last modified date, in the FileInfo class.. I suggest you check it out. EDIT: here is the link LastWriteTime

Our packager takes about 20 secs to find out what files have been modified.

justlost
The idea is to discard when *not* equal size, not confirm without further checks when equal size (noone suggested this, in Vinko's answer as I understand it both candidate and comparison files share the same list, so a unique length means unique file, my answer might not be clear enough).
Ben Schwehn
yeah, I got that from the answer, but still makes no sense to check for file size at all, even if there are more checks after that. I'm not trying to bust anyone here... just saying the one and only check necessary is the modifed date, then check hash if you really want to be 100% sure.
justlost
Often checking LastWriteTime gives less false positives. But beware that in some (rare) cases the LastWriteTime can change (e.g when copying a file from ntfs to fat or vice versa, ntfs stores in utc, fat in local time) without the content changing. Or a program might decide to just not update the last access time, or update the time w/o changing content. So just discarding on different access time without further checks can make you miss duplicate files. How likely that is and how bad depends on your environment. Checking filesize is reliable, so I makes some sense after all.
Ben Schwehn