views:

645

answers:

3

Hash Tables are said to be the fastest/best way of Storing/Retrieving data.

My understanding of a hash table, hashing is as follows (Please correct me if I am wrong or Please add If there is anything more):

  • A Hash Table is nothing but an array (single or multi-dimensional) to store values.
  • Hashing is the process to find the index/location in the array to insert/retrieve the data. You take a data item(s) and pass it as a key(s) to a hash function and you would get the index/location where to insert/retrieve the data.

I have a question:

Is the hash function used to store/retrieve the data DIFFERENT from a cryptographic hash function used in security applications for authentication like MD5, HMAC, SHA-1 etc...?

In what way(s) are they different?

  • How to write a hash function in C?
  • Is there some standard or guidelines to it?
  • How do we ensure that the output of a hash function i.e, index is not out of range?

It would be great if you could mention some good links to understand these better.

+1  A: 

Bob Jenkins wrote an in-depth description of his good, if slightly outdated, hash function. The article has links to newer, better hash functions, but the writeup addresses the concerns of building a good one.

Also, most hash table implementations actually use an array of linked lists to resolve collisions. If you want to just use an array then the hash function needs to check for collisions and create a new hash index.

The cryptographic hash functions you mention could be used as hash functions for a hash table, but they are much slower than hash functions designed for a hash table. Speed makes brute force attacks easier.

RossFabricant
+4  A: 

A cryptographic hash emphasizes making it difficult for anybody to intentionally create a collision. For a hash table, the emphasis is normally on producing a reasonable spread of results quickly. As such, the two are usually quite different (in particular, a cryptographic hash is normally a lot slower).

For a typical hash function, the result is limited only by the type -- e.g. if it returns a size_t, it's perfectly fine for it to return any possible size_t. It's up to you to reduce that output range to the size of your table (e.g. using the remainder of dividing by the size of your table, which should often be a prime number).

As an example, a fairly typical normal hash function might look something like:

// warning: untested code.
size_t hash(char const *input) { 

    const int ret_size = 32;
    size_t ret = 0x555555;
    const int per_char = 7;

    while (*input) { 
        ret ^= *input++;
        ret = ((ret << per_char) | (ret >> (ret_size - per_char));
   }
   return ret;
}

The basic idea here is to have every bit of the input string affect the result, and to (as quickly as possible) have every bit of the result affected by at least part of the input. Note that I'm not particularly recommending this as a great hash function -- only trying to illustrate some of the basics of what you're trying to accomplish.

Jerry Coffin
Cryptographic hash functions are not necessarily slow. In particular, the MD4 hash function was reported to be faster than CRC32 on some platforms (ARM-based, I think). However, cryptographic hash functions tend to have a large fixed overhead, which means that they will be slow for small input messages. A function such as MD4 achieves its very high processing bandwidth (more than 600 MB/s on my 2.4 GHz Intel CPU) when input size exceeds 1 KB or so. Still, for small inputs (less than 54 bytes), my PC still computes 8 millions of MD4 per second (with a single core).
Thomas Pornin
@Thomas:First, while CRC32 can be reasonably fast, most hash functions are a fair bit faster. Second, while it was certainly intended as a cryptographic hash, MD4 doesn't really qualify anymore. It was comprehensively broken years ago -- generating a collision is about the same speed as generating the original hash. See: http://www.stachliu.com/md4coll.c for an implementation.
Jerry Coffin
@Jerry: I know that MD4 was broken, but for non-cryptographic purposes (the ones we are talking about) MD4 is fairly good; if deliberate collisions are an issue, then every non-cryptographic hash function is ruled out, by definition. When there is no security issue, MD4 can be at least envisioned. Some peer-to-peer systems use MD4 for identifying file elements.As for fast but strong cryptographic functions, there is an ongoing competition for selecting a new one. See http://en.wikipedia.org/wiki/NIST_hash_function_competition for details (I am a coauthor of one of the candidates).
Thomas Pornin
@Thomas:Your claim was that "Cryptographic hash functions are not necessarily slow." My reply is simply saying that MD4 doesn't support that assertion, because it's not a cryptographic hash function.
Jerry Coffin
A: 

The design goals are different.

With cryptographic hash functions you want, for example, that the hash and the hash function cannot be used to determine the original data or any other data that would produce the same hash.

Hash functions used with hash tables & other data structures do not need such security properties. It's often enough if the hash function is fast and it will distribute the input set evenly into the set of possible hashes (to avoid unnecessary clustering / collisions).

Anssi