views:

2462

answers:

7

I'm trying to choose a hash algorithm for comparing about max 20 different text data.

Which hash is better for these requirements?

  • Less CPU Consumption
  • Small footprint (<=32 bytes)
  • Collision is not a big deal
  • Can be generated from .NET Framework 2 (shouldn't be a 3rd party library)

I'm using hash for less memory footprint and comparison performance

A: 

If you are constrained to algorithms that exist in the framework

Is MD5 small enough (16 bytes)?

Less CPU Consumption and Small footprint are usually mutually exclusive.

http://en.wikipedia.org/wiki/Time-space_tradeoff

Greg Dean
It's small but CPU usage is not that efficient, I'd rather use MD4 (as a non-brainer) or CRC32 alike stuff.
dr. evil
Isn't md5 32 bytes?
Bill the Lizard
MD5 is 128 bits = 16 bytes
Greg Dean
That's right. I was confused because it's usually displayed as a 32 digit hex number.
Bill the Lizard
AFAIK neither MD4 or CRC32 are implemented in the BCL. If don't want to roll your own or use a 3rd party, just pay 15% tax for the convenience of MD5
Greg Dean
A: 

The FNV hash is a well-known fast hashing algorithm. It is not cryptographically secure, but it sounds like you don't need a secure hash.

Greg Hewgill
Is there an FNV implementation in the BCL?
Greg Dean
There is a BCL team blog entry about this very question: http://blogs.msdn.com/bclteam/archive/2003/10/31/49719.aspx
Greg Hewgill
+4  A: 

Paul Hsieh has a decent, simple, fast, 32-bit SuperFastHash that performs better than most existing hash functions, is easier to understand/implement, and sounds like it meets your criteria.

Robert Gamble
I've used this myself for hash maps, and it works very well (little collision) in my cases.
strager
Yeah, I heard this guy is some kind of genius or something. His hash function is awesome-sauce! :)
Paul Hsieh
A: 

Check out the serie Peter Karkowski published on his blog.

arul
Curious as to why down votes... ?
StaxMan
+5  A: 

If collision is not a big deal you can take the first letter of each document. Or you can use the length of the text or the string with the text.

tuinstoel
Or a combination of the first (few) character(s) and length. Of course, you can use the entire string as the hash. =]
strager
this sounds reasonable actually.
dr. evil
+1  A: 

How long does the hash need to hold for? GetHashCode() is pretty accessible, gives a small response (4 bytes), which should be fine (re minimizing collisions) over 20 strings.

However, GetHashCode() should not be persisted to database - it is fine for in-memory comparisons, though. Just be aware that the algorithm may change between frameworks (and did between 1.1 and 2.0).

The other advantage of this is that it is trivial to use - just use a Dictionary<string,Something>, which will deal with all the hashing etc for you.

Marc Gravell
If I use dictionary<string,smt> I assume it'll store the whole string in the memory.
dr. evil
Yes, but hashes *always* have a risk of collision if you don't follow up with an actual equality check.
Marc Gravell
+1  A: 

A very quick check would be to take the length of a text and XOR it with the first 4 bytes of it and use that as a hash. If this is good enough it is extremely fast because independent of the number of bytes of the file.

martinus