tags:

views:

157

answers:

7

I'm looking for an efficient way to tell whether or not a string (or a file) has changed since the last time we looked at it.

So, we run this function against 1,000,000 files/strings (each file/string is less than 1000 bytes), and store the output for each file/string.

I'll then wait a few days and run this again. I need to find out whether or not each file has changed or not...

Should I calculate CRC32s for each file? MD5? Something else more efficient?

Is CRC32 good enough for telling me whether or not a file/string has changed?

EDIT It has to work both both files and strings, so timestamps on the files are out of the question.

+1  A: 

For files, do you have to look at the content? The filesystem will track a modified timestamp.

Joe
It has to work with both strings and files. Added clarification in the post.
Keith Palmer
A: 

In Java you can do:

File file = new File(filePath);

file.lastModified();
Read the post again, it has to work for files and strings, so timestamps don't cut it.
Keith Palmer
A: 

I use MD5 for this type of thing, seems to work well enough. If you're using .NET, see System.Security.Cryptography.MD5CryptoServiceProvider.

jsr
+1  A: 

CRC32, or CRC64 will do the job just fine.

You might even be able to use it as a basis for some sort of hash lookup.

EvilTeach
+1  A: 

For the files you could use the timestamp.

For the strings, you could keep a backup copy.

Just comparing them and re-writing the backup might be as fast as CRC or MD5.

Mike Dunlavey
Storing an extra copy of the strings will *double* the size of the database. I'd rather avoid that if possible.
Keith Palmer
@Keith: Well that's a time-space trade-off you can decide. I assume since the strings are not files then they are in memory, so are somewhere between 10^6 and 10^8 bytes, so having a copy might be unpleasant but not outrageous, especially if the copy is on disk, but obviously that's your call. I was just thinking for performance it's hard to beat **memcmp**.
Mike Dunlavey
A: 

You said the data whould be around one million 1kB strings/files and you want to check it every few days. If this is true you really don't have to worry about performance, because processing 1GB of data won't take that long, it doesn't matter if you use crc32 or md5.

I suggest using md5, because it's less likely to collide than crc32. Crc32 will do the job, but you can get a better result without investing much more.

Edit: As someone else stated comparing the strings to a backup is faster. (Because you can abort as soon as two chars differ) This is not 100% true if you have to read the String out of a file. If we assume that the strings come out of files and you use md5 you'll have to read 32 bytes plus the average of the string lengths for every string you want to compare. When you compare byte by byte you'll have to read in minimum 2 bytes and in maximum tow times the string length. So if many of your strings have the same beginning (more chars than 32 + the average of the string lengths are equal) you'll be faster with a hash. (Correct me if I'm wrong) Because this is a theoretical case you'll be fine to stick with a char by char comparison. If the average of the string lengths is bigger than 32 bytes, you'll save disk space when using a hash ;-).

But as I already stated above; performance won't be your problem when dealing with that ammout of data.

svens
By "more accurate", do you mean "less likely to collide"?
Rob
Yes, exactly. I was missing the correct word in English.
svens
A: 

String comparison will be more efficient than either crc32 or md5, or any other hash algorithm proposed.

For starters you can bail out of a string comparison as soon as the two strings are different, whereas with a hashing algorithm you have to hash the entire contents of the file before you can make a comparison.

What is more, hashing algorithms have operations they must perform to generate the hash, whereas a string comparison is checking for equality between two values.

I'd imagine a string-based comparison of the files/strings that short-circuits on the first failure (per-file/string) will get you good performance.

fbrereto
String-to-string comparisons will require fewer CPU operations, but twice as many memory operations. In modern processors, CPU operations are orders of magnitude faster than memory operations, so it may be that a checksum would come out on top.
Commodore Jaeger
The data has to come out of memory and into the processor for either algorithm. I don't see how a hash avoids going out to memory any more than a string compare would?
fbrereto