ansaurus

Question

Efficient ways of telling whether or not a string/file has changed - crc32? md5? something else?

Answer 1

+1 A:

For files, do you have to look at the content? The filesystem will track a modified timestamp.

Joe 2009-10-01 20:09:08

It has to work with both strings and files. Added clarification in the post.

Keith Palmer 2009-10-01 20:13:55

Answer 2

A:

In Java you can do:

File file = new File(filePath);

file.lastModified();

2009-10-01 20:18:08

Read the post again, it has to work for files and strings, so timestamps don't cut it.

Keith Palmer 2009-10-01 20:20:46

Answer 3

A:

I use MD5 for this type of thing, seems to work well enough. If you're using .NET, see System.Security.Cryptography.MD5CryptoServiceProvider.

jsr 2009-10-01 20:27:08

Answer 4

+1 A:

CRC32, or CRC64 will do the job just fine.

You might even be able to use it as a basis for some sort of hash lookup.

EvilTeach 2009-10-01 20:31:15

Answer 5

+1 A:

For the files you could use the timestamp.

For the strings, you could keep a backup copy.

Just comparing them and re-writing the backup might be as fast as CRC or MD5.

Mike Dunlavey 2009-10-01 20:46:55

Storing an extra copy of the strings will *double* the size of the database. I'd rather avoid that if possible.

Keith Palmer 2009-10-02 12:24:00

@Keith: Well that's a time-space trade-off you can decide. I assume since the strings are not files then they are in memory, so are somewhere between 10^6 and 10^8 bytes, so having a copy might be unpleasant but not outrageous, especially if the copy is on disk, but obviously that's your call. I was just thinking for performance it's hard to beat **memcmp**.

Mike Dunlavey 2009-10-02 16:01:01

Answer 6

A:

You said the data whould be around one million 1kB strings/files and you want to check it every few days. If this is true you really don't have to worry about performance, because processing 1GB of data won't take that long, it doesn't matter if you use crc32 or md5.

I suggest using md5, because it's less likely to collide than crc32. Crc32 will do the job, but you can get a better result without investing much more.

Edit: As someone else stated comparing the strings to a backup is faster. (Because you can abort as soon as two chars differ) This is not 100% true if you have to read the String out of a file. If we assume that the strings come out of files and you use md5 you'll have to read 32 bytes plus the average of the string lengths for every string you want to compare. When you compare byte by byte you'll have to read in minimum 2 bytes and in maximum tow times the string length. So if many of your strings have the same beginning (more chars than 32 + the average of the string lengths are equal) you'll be faster with a hash. (Correct me if I'm wrong) Because this is a theoretical case you'll be fine to stick with a char by char comparison. If the average of the string lengths is bigger than 32 bytes, you'll save disk space when using a hash ;-).

But as I already stated above; performance won't be your problem when dealing with that ammout of data.

svens 2009-10-01 20:57:46

By "more accurate", do you mean "less likely to collide"?

Rob 2009-10-01 21:22:48

Yes, exactly. I was missing the correct word in English.

svens 2009-10-01 21:58:00

Answer 7

A:

String comparison will be more efficient than either crc32 or md5, or any other hash algorithm proposed.

For starters you can bail out of a string comparison as soon as the two strings are different, whereas with a hashing algorithm you have to hash the entire contents of the file before you can make a comparison.

What is more, hashing algorithms have operations they must perform to generate the hash, whereas a string comparison is checking for equality between two values.

I'd imagine a string-based comparison of the files/strings that short-circuits on the first failure (per-file/string) will get you good performance.

fbrereto 2009-10-01 21:00:19

String-to-string comparisons will require fewer CPU operations, but twice as many memory operations. In modern processors, CPU operations are orders of magnitude faster than memory operations, so it may be that a checksum would come out on top.

Commodore Jaeger 2009-10-01 21:16:42

The data has to come out of memory and into the processor for either algorithm. I don't see how a hash avoids going out to memory any more than a string compare would?

fbrereto 2009-10-01 22:35:48

ansaurus

tags:

views:

answers:

Efficient ways of telling whether or not a string/file has changed - crc32? md5? something else?

related questions