views:

122

answers:

5

Does size matter when choosing the right algorithm to use for a session hash.

I recently read this article and it suggested using whirlpool to create a hash for session id. Whirlpool generates a 128 character hash string, is this too large?

The plan is to store the session hash in a db. Is there much of a difference between maybe using 64 character field (sha256), 96 character field (sha384) or 128 character field (whirlpool)? One of the initial arguments made for whirlpool was the speed vs other algorithms but looking at the speed results sha384 doesn't fair too badly.

There is the option truncate the hash to make it smaller than 128 characters.

I did modify the original code snippet, to allow changing of the algorithm based of the needs.

Update: There was some discussion about string being hashed, so I've included the code.


function generateUniqueId($maxLength = null) {
    $entropy = '';

    // try ssl first
    if (function_exists('openssl_random_pseudo_bytes')) {
        $entropy = openssl_random_pseudo_bytes(64, $strong);
        // skip ssl since it wasn't using the strong algo
        if($strong !== true) {
            $entropy = '';
        }
    }

    // add some basic mt_rand/uniqid combo
    $entropy .= uniqid(mt_rand(), true);

    // try to read from the windows RNG
    if (class_exists('COM')) {
        try {
            $com = new COM('CAPICOM.Utilities.1');
            $entropy .= base64_decode($com->GetRandom(64, 0));
        } catch (Exception $ex) {
        }
    }

    // try to read from the unix RNG
    if (is_readable('/dev/urandom')) {
        $h = fopen('/dev/urandom', 'rb');
        $entropy .= fread($h, 64);
        fclose($h);
    }

    // create hash
    $hash = hash('whirlpool', $entropy);
    // truncate hash if max length imposed
    if ($maxLength) {
        return substr($hash, 0, $maxLength);
    }
    return $hash;
}
A: 

SHA1 or MD5 is probably enough for your needs. In practice, the probability of a collision is so small that it will likely never happen.

Ultimately, though, it all depends upon your required level of security. Do also keep in mind that longer hashes are both more expensive to compute and require more storage space.

Ben Herila
+3  A: 

The time taken to create the hash is not important, and as long as your database is properly indexed, the storage method should not be a major factor either.

However, the hash has to be transmitted with the client's request every time, frequently as a cookie. Large cookies can add a small amount of additional time to each request. See Yahoo!'s page performance best practices for more information. Smaller cookies, thus a smaller hash, have benefits.

Overall, large hash functions are probably not justified. For their limited scope, good old md5 and sha1 are probably just fine as the source behind a session token.

Charles
The hashes being stored can last anywhere from hours to weeks. For the purpose of one time use tokens, then a smaller hash will be used.
Andre
If your *session* hashes are sticking around for weeks, your GC is broken and/or you have insane users. Edge cases.
Charles
The hash may hang around for weeks because user preferences are stored in a db and linked to the hash stored in a cookie, which may have an expiration of as little as 24 hours -- usually though cookies are set to expire 21 days from last visit. Within that period several thousand users might have visited the site. By the way when I refer to session I'm not referring to session set by PHP that generally expires when the browser is closed. I hope this clears up any mix-ups.
Andre
Why are you doing that instead of attaching preferences to a user login and prompting users to log in?
Charles
Because not all preferences require that a user log in or for that matter be registered.
Andre
A: 

The article times out when I try to read it, but I can't think of a good reason to use a hash as a session identifier. Session identifiers should be unpredictable; given the title of the article, it sounds like the authors acknowledge that principle. Then, why not use a cryptographic random number generator to produce session identifiers?

A hash takes input, and if that input is predictable, so is the hash, and that's bad.

erickson
I've updated my original post to include the code for how the hash input is generated.
Andre
A: 

Well, the answer is, it depends. It depends on how many you need to generate and what is the maximum amount of Ids that could live simultaneously (I suppose session Ids are throwaway, after a time period).

For instance, if you used 32 bits, then if you generate 1+ 232 ids you are guaranteed a collision. You need to take into consideration how many you will be generating! Just talking about collision probability of a particular hashing function isn't sufficient, unless you also consider the number of hashes you generate with that.

I suggest you read up on Birthday Paradox and also read Birthday Attack.

Birthday paradox basically says that if you uniformly select random numbers from 1 to n, doing this ~sqrt(n) times gives you at least 50% chance of a collision. So basically if you take any 23 random people, there is a good chance some two have the same birthday (one of 365 days)!

So if the session Id you assigned was the birthday of the person, with 23 people you have at least 50% chance of a collision!

For 128 bit (uniform) hashes, with 8.2x1011 (I guess 820 billion) ids, collision probability is around 10-15. There is an informative table in the Birthday Paradox wiki page above with these values.

128 bits would very likely be good enough, but you could probably do it with fewer bits, depending on your needs.

btw, since you are storing it in a db, you could always query the db to check if a newly generated Id has been assigned and regenerate if already in use. Or is your goal to avoid the db query? Better yet, use sequence numbers provided by the db (if users are not malicious, that is :-)).

Moron
@Andre: Was this of any use to you? If not, I will just delete it.
Moron
Yes it was, I had previously read about birthday paradox but wasn't very clear in my initial post. I think my question should have spoke less about the hash itself and more about the size of the field in my db with respect to indexing. So if your able to offer any suggestions on the sensibility of a 64 char field, 96 char field or 128 char field I would be really greatful.
Andre
@ANdre: 128 bits = 16 characters and that is good enough for more than a billion simultaneous IDs according to that wiki page.
Moron
+1  A: 

Yes, size matters.

If it's too short, you run the risk of collisions. You also make it practical for an attacker to find someone else's session by brute-force attack.

Being too long matters less, but every byte of the session ID has to be transferred from the browser to the server with every request, so if you're really optimising things, you may not want an ID that's too long.

You don't have to use all the bits of a hash algorithm, though - there's nothing stopping you from using something like Whirlpool, then only taking the first 128 bits (32 characters in hex). Practically speaking, 128 bits is a good lower bound on length, too.

As erickson points out, though, using a hash is a bit odd. Unless you have at least as much entropy as input as the length of the ID you're using, you're vulnerable to attacks that guess the input to your hash.

Nick Johnson
If I'm going to only take the first 32 characters, wouldn't it then make sense to just go ahead and use md5? I say this because wouldn't truncating the whirlpool hash increase the chance of collision because the likely hood of the first 32 characters being the same in any number of hashes be higher than md5. This assumption is made because md5 natively generates a 32 character hex as opposed to the 128 characters from a whirlpool hash.I could be totally off in this regard.
Andre
There are two distinct causes of 'weakness' in a hash. There's the construction of the hash itself, and there's the length of the hash. MD5 has been repeatedly broken on the first count, but 128 bits remains a sufficiently large search space to make finding a collision by brute-force impractical. For any secure hash, you can take any arbitrary subset of it, and it will be as good as any other subset of the same length. If a substring of a large hash were weaker than the complete output of a shorter hash, that would mean the larger hash was broken, because its output was predictable in part.
Nick Johnson
Thanks for clearing that up.
Andre