views:

557

answers:

5

I'm using Python 2.5 on Linux, in multiple parallel FCGI processes. I use

    chars = string.ascii_letters + string.digits
    cookie = ''.join([random.choice(chars) for x in range(32)])

to generate distinct cookies. Assuming that the RNG is seeded from /dev/urandom, and that the sequence of random numbers comes from the Mersenne twister, I would expect that there is practically zero chance of collision.

However, I do see regular collisions, even though only a few (<100) users are logged in at any time.

Why are the random numbers not more random?

A: 

To avoid the problem, you can use a sequence of cookies, that are guaranteed to be different (you can e.g. use a set). Each time you give a cookie to someone, you take it from the sequence and you add another to it. Another option is to generate a UUID and use that as a cookie.

Another way to avoid the problem could be to hold a private key, and use a (e.g. MD5) checksum of the private key, with a counter value joined to it. The probability for collisions will then be very low. To be safer, add a few more variables to the checksum, like the current time, the ip address of the user, ...

Libraries to generate cookies exist. Any WSGI implementation probably contains a cookie generator.

If you're only interested in how random your strings are, you could generate a file with, say, one million cookies and perform randomness checks on that file. This, however, is not what I would recommend.

lbp
This isn't really my question - I don't want a work-around; I want to understand what's happening. My work-around is to use os.urandom. As for using sequences - that would be bad, since the cookie can be guessed. Using uuids: if the UUID generator uses the random module, they might not be unique.
Martin v. Löwis
A UUID *can't* be guaranteed to be unique. For theoretical reasons, because there are only 2**128 of them, and for practical reasons, because perhaps the code generating them is flawed - in particular if it is very similar to the code I posted, which should also generate unique values, but doesn't.
Martin v. Löwis
Better use someone else's "flawed" code that might get better in the future than trying your own stuff, of which you don't know what it actually does.
lbp
With <i>only 2**128</i> UUIDs, you can assign a unique UUID to every single sand grain on the planet. So I guess for at most a thousand users you will not be in trouble with UUIDs.
lbp
Better to use something that's flawed which you don't understand, because it might get better? Righto...
Glenn Maynard
+6  A: 

It shouldn't be generating duplicates.

import random
chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
def gen():
    return ''.join([random.choice(chars) for x in range(32)])

test = [gen() for i in range(100000)]
print len(test), len(set(test)) # 100000 100000

The chances of duplicates is significant with chars = "ab"; 126 duplicates in 1000000 iterations. It's nonexistant with 62.

That said, this isn't a good way to generate cookies, because session cookies need to be unpredictable, to avoid attacks involving stealing other people's session cookies. The Mersenne Twister is not designed for generating secure random numbers. This is what I do:

import os, hashlib
def gen():
    return hashlib.sha1(os.urandom(512)).hexdigest()

test = [gen() for i in range(100000)]
print len(test), len(set(test))

... which should be very secure (which is to say, difficult to take a string of session cookies and guess other existing session cookies from them).

Glenn Maynard
Why is the Mersenne Twister unsuitable for generating secure cookies? It has a period of 2**19937, so it shouldn't be possible to predict the next value even if you know a few subsequent ones.
Martin v. Löwis
From Wikipedia: "The algorithm in its native form is not suitable for cryptography (unlike Blum Blum Shub). Observing a sufficient number of iterates (624 in the case of MT19937) allows one to predict all future iterates." (http://en.wikipedia.org/wiki/Mersenne_twister)
Chris Lutz
Just because a random number generator has a long cycle doesn't imply that it's difficult to take a sequence within the cycle and figure out where it is. If I give you the sequence 0, 1, 2, 3..., it has a very long cycle (infinite), yet it's trivial to figure out what the next value is. You need a sequence that's cryptographically secure--where it's difficult to determine the state of the engine from its output. That's what secure hashes are for. I prefer hashing urandom through SHA-1, but hashing MT through SHA-1 is probably fine, too.
Glenn Maynard
@Chris: thanks, that clarifies it. @Glenn: why do you hash urandom? Isn't *that* unpredictable already?
Martin v. Löwis
@Martin, urandom isn't secure. Hashing it with SHA-1 helps, but isn't bullet proof.
orip
+1 for mentioning the cookies aren't secure
orip
@orip: why is urandom not secure?
Martin v. Löwis
It's not secure if the entropy pool is exhausted.
ilya n.
In practice, I think urandom internally uses SHA-1 to generate data, at least on Linux. The output of urandom *should* be cryptographically strong on its own. It's just very simple and cheap to ensure this explicitly by hashing data directly.
Glenn Maynard
It's secure as far as you trust SHA-1, which for most people is very far.
Glenn Maynard
I just checked: on Linux, /dev/urandom indeed hashes its output.
Martin v. Löwis
Apparently, it's not really possible to resolve the mystery. Perhaps I'm missing a detail that I haven't communicated. So I'm accepting this answer for being most helpful (along with the comments).
Martin v. Löwis
If you want to dig deeper: log session creation; find out what time the clashing sessions are created and by which processes; verify that the state of the PRNG at the start (after first call and seeding) is distinct in each process; see if it goes away if you create a separate PRNG instance just for session creation (should not be needed but would be a data point); see if you can isolate the problem reproducibly.
Glenn Maynard
Considering the amount of progress that has been made against SHA-1 (and MD5), I'd start looking to other hashes. Neither MD5 nor SHA-1 should be used in new designs.
derobert
+3  A: 

This is definitely not a normal collision scenario:

  • 32 characters with 62 options per character is equivalent to 190 bits (log2(62) * 32)
  • According to the birthday paradox, you should be receiving a collision naturally once every 2**95 cookies, which means never

Could this be a concurrency issue?

  • If so, use different random.Random instances for each thread
  • Can save these instances in thread-local storage (threading.local())
  • On linux, Python should seed them using os.urandom() - not system time - so you should get different streams for each thread.
orip
He said multiple FCGI processes, not threads. Was that correct, Martin, or did you mean threads?
Glenn Maynard
Processes, correct.
Martin v. Löwis
A: 

I had to erase my original answer, which suggested that generator is not seeded from /dev/urandom, since its source (for Python 3.x) clearly says that it is:

def seed(self, a=None):
    """Initialize internal state from hashable object.

    None or no argument seeds from current time or from an operating
    system specific randomness source if available.

    If a is not None or an int or long, hash(a) is used instead.
    """

    if a is None:
        try:
            a = int(_hexlify(_urandom(16)), 16)
        except NotImplementedError:
            import time
            a = int(time.time() * 256) # use fractional seconds

    super().seed(a)
    self.gauss_next = None

I therefore humbly accept that there are mysteries in the world that I may not be able to decipher.

ilya n.
Where do you see it seeded from some hash()? In random.py, around line 108, it's seeded from long(_hexlify(_urandom(16)), 16).
Martin v. Löwis
Indeed, I just read that myself.
ilya n.
If you're really looking into that -- perhaps the next step would be checking that the line `a = int(_hexlify(_urandom(16)), 16)` doesn't raise `NotImplementedError` for some strange reason?
ilya n.
Thanks for the idea. Alas, I tried, and it all works as it should (i.e. it does get seeded from urandom).
Martin v. Löwis
+1  A: 
  1. I don't know how your FCGI processes are being spawned, but is it possible that it's using fork() after the Python interpreter has started (and the random module has been imported by something), hence effectively seeding two processes' random._insts from the same source?

  2. Maybe put some debugging in to check that it is correctly seeding from urandom, and not falling back to the less rigorous time-based seed?

eta re comment: man! That's me stumped then; if the RNG always has different state at startup I can't see how you could possibly get collisions. Weird. Would have to put in a lot of state logging to investigate the particular cases which result in collisions, I guess, which sounds like a lot of work trawling through logs. Could it be (1a) the FCGI server usually doesn't fork, but occasionally does (maybe under load, or something)?

Or (3) some higher-level problem such as a broken HTTP proxy passing the same Set-Cookie to multiple clients?

bobince
Thanks for the ideas: 1. I have dumped the state of the RNG at startup, and they are all different. 2. I had it create files good (uses urandom) and bad (uses time); the "good" file was created; the bad file was not.
Martin v. Löwis