views:

1491

answers:

6

Hi all,

I find myself needing to generate a checksum for a string of data, for consistency purposes. The broad idea is that the client can regenerate the checksum based on the payload it recieves and thus detect any corruption that took place in transit. I am vaguely aware that there are all kinds of mathematical principles behind this kind of thing, and that it's very easy for subtle errors to make the whole algorithm ineffective if you try to roll it yourself.

So I'm looking for advice on a hashing/checksum algorithm with the following criteria:

  • It will be generated by Javascript, so needs to be relatively light computationally.
  • The validation will be done by Java (though I cannot see this actually being an issue).
  • It will take textual input (URL-encoded Unicode, which I believe is ASCII) of a moderate length; typically around 200-300 characters and in all cases below 2000.
  • The output should be ASCII text as well, and the shorter it can be the better.

I'm primarily interested in something lightweight rather than getting the absolute smallest potential for collisions possible. Would I be naive to imagine that an eight-character hash would be suitable for this? I should also clarify that it's not the end of the world if corruption isn't picked up at the validation stage (and I do realise that this will not be 100% reliable), though the rest of my code is markedly less efficient for every corrupt entry that slips through.

Edit - thanks to all that contributed. I went with the Adler32 option and given that it was natively supported in Java, extremely easy to implement in Javascript, fast to calculate at both ends and have an 8-byte output it was exactly right for my requirements.

(Note that I realise that the network transport is unlikely to be responsible for any corruption errors and won't be folding my arms on this issue just yet; however adding the checksum validation removes one point of failure and means we can focus on other areas should this reoccur.)

+1  A: 

Use SHA-1 JS implementation. It's not as slow as you think (Firefox 3.0 on Core 2 Duo 2.4Ghz hashes over 100KB per second).

porneL
+1  A: 

Google CRC32: fast, and much lighter weight than MD5 et al. There is a Javascript implementation here.

j_random_hacker
+7  A: 

CRC32 is not too hard to implement in any language, it is good enough to detect simple data corruption and when implemted in a good fashion, it is very fast. However you can also try Adler32, which is almost equally good as CRC32, but it's even easier to implement (and about equally fast).

Adler32 in the Wikipedia

CRC32 JavaScript implementation sample

Either of these two (or maybe even both) are available in Java right out of the box.

Mecki
CRC32, definitely was designed to be exactly what you describe.
Die in Sente
A word of caution: the JavaScript in the link implements the algorithm with a table[256] of literal values. If you should modify even a single digit of that table, you will have a nasty bug that is very, very, hard to find! I prefer implementations that generate the table on the 1st call.
Die in Sente
I'll second @D.i.S's comment. Testability is a minus.
Jason S
+1  A: 

Javascript implementation of MD4, MD5 and SHA1. BSD license.

jetxee
+3  A: 

Are aware that both TCP and UDP (and IP, and Ethernet, and...) already provide checksum protection to data in transit?

Unless you're doing something really weird, if you're seeing corruption, something is very wrong. I suggest starting with a memory tester.

Also, you receive strong data integrity protection if you use SSL/TLS.

derobert
Yes, I am was aware of that, though you were right to point it out. Unfortunately it's in input coming from the world at large, so we need to be able to cope with this anyway (malicious/mischevious users could mangle this for example).
Andrzej Doyle
It might be worth pointing out that for any change-detection algorithm, there is always a chance that it won't detect an error. They all can have collisions or false-negatives, though usually the more expensive algorithms reduce this chance to near-astronomically small probabilities.
Alan Krueger
@dtsazza: I wonder about the malicious/mischievous users who can mangle packets going across the network, but can't defeat Javascript. Or Adler32.
derobert
+2  A: 

Other people have mentioned CRC32 already, but here's a link to the W3C implementation of CRC-32 for PNG, as one of the few well-known, reputable sites with a reference CRC implementation.

(A few years back I tried to find a well-known site with a CRC algorithm or at least one that cited the source for its algorithm, & was almost tearing my hair out until I found the PNG page.)

Jason S