ansaurus

Question

Answer 1

A:

To avoid the problem, you can use a sequence of cookies, that are guaranteed to be different (you can e.g. use a set). Each time you give a cookie to someone, you take it from the sequence and you add another to it. Another option is to generate a UUID and use that as a cookie.

Another way to avoid the problem could be to hold a private key, and use a (e.g. MD5) checksum of the private key, with a counter value joined to it. The probability for collisions will then be very low. To be safer, add a few more variables to the checksum, like the current time, the ip address of the user, ...

Libraries to generate cookies exist. Any WSGI implementation probably contains a cookie generator.

If you're only interested in how random your strings are, you could generate a file with, say, one million cookies and perform randomness checks on that file. This, however, is not what I would recommend.

lbp 2009-09-02 06:59:59

This isn't really my question - I don't want a work-around; I want to understand what's happening. My work-around is to use os.urandom. As for using sequences - that would be bad, since the cookie can be guessed. Using uuids: if the UUID generator uses the random module, they might not be unique.

Martin v. Löwis 2009-09-02 07:08:01

A UUID *can't* be guaranteed to be unique. For theoretical reasons, because there are only 2**128 of them, and for practical reasons, because perhaps the code generating them is flawed - in particular if it is very similar to the code I posted, which should also generate unique values, but doesn't.

Martin v. Löwis 2009-09-02 07:26:42

Better use someone else's "flawed" code that might get better in the future than trying your own stuff, of which you don't know what it actually does.

lbp 2009-09-02 07:31:49

With <i>only 2**128</i> UUIDs, you can assign a unique UUID to every single sand grain on the planet. So I guess for at most a thousand users you will not be in trouble with UUIDs.

lbp 2009-09-02 07:40:10

Better to use something that's flawed which you don't understand, because it might get better? Righto...

Glenn Maynard 2009-09-02 07:41:13

Answer 2

+6 A:

It shouldn't be generating duplicates.

import random
chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
def gen():
    return ''.join([random.choice(chars) for x in range(32)])

test = [gen() for i in range(100000)]
print len(test), len(set(test)) # 100000 100000

The chances of duplicates is significant with chars = "ab"; 126 duplicates in 1000000 iterations. It's nonexistant with 62.

That said, this isn't a good way to generate cookies, because session cookies need to be unpredictable, to avoid attacks involving stealing other people's session cookies. The Mersenne Twister is not designed for generating secure random numbers. This is what I do:

import os, hashlib
def gen():
    return hashlib.sha1(os.urandom(512)).hexdigest()

test = [gen() for i in range(100000)]
print len(test), len(set(test))

... which should be very secure (which is to say, difficult to take a string of session cookies and guess other existing session cookies from them).

Glenn Maynard 2009-09-02 07:11:52

Why is the Mersenne Twister unsuitable for generating secure cookies? It has a period of 2**19937, so it shouldn't be possible to predict the next value even if you know a few subsequent ones.

Martin v. Löwis 2009-09-02 07:17:30

From Wikipedia: "The algorithm in its native form is not suitable for cryptography (unlike Blum Blum Shub). Observing a sufficient number of iterates (624 in the case of MT19937) allows one to predict all future iterates." (http://en.wikipedia.org/wiki/Mersenne_twister)

Chris Lutz 2009-09-02 07:19:34

Just because a random number generator has a long cycle doesn't imply that it's difficult to take a sequence within the cycle and figure out where it is. If I give you the sequence 0, 1, 2, 3..., it has a very long cycle (infinite), yet it's trivial to figure out what the next value is. You need a sequence that's cryptographically secure--where it's difficult to determine the state of the engine from its output. That's what secure hashes are for. I prefer hashing urandom through SHA-1, but hashing MT through SHA-1 is probably fine, too.

Glenn Maynard 2009-09-02 07:22:57

@Chris: thanks, that clarifies it. @Glenn: why do you hash urandom? Isn't *that* unpredictable already?

Martin v. Löwis 2009-09-02 07:29:50

@Martin, urandom isn't secure. Hashing it with SHA-1 helps, but isn't bullet proof.

orip 2009-09-02 07:40:09

+1 for mentioning the cookies aren't secure

orip 2009-09-02 07:40:52

@orip: why is urandom not secure?

Martin v. Löwis 2009-09-02 07:45:06

It's not secure if the entropy pool is exhausted.

ilya n. 2009-09-02 07:48:22

In practice, I think urandom internally uses SHA-1 to generate data, at least on Linux. The output of urandom *should* be cryptographically strong on its own. It's just very simple and cheap to ensure this explicitly by hashing data directly.

Glenn Maynard 2009-09-02 07:53:50

It's secure as far as you trust SHA-1, which for most people is very far.

Glenn Maynard 2009-09-02 07:54:40

I just checked: on Linux, /dev/urandom indeed hashes its output.

Martin v. Löwis 2009-09-02 08:14:20

Apparently, it's not really possible to resolve the mystery. Perhaps I'm missing a detail that I haven't communicated. So I'm accepting this answer for being most helpful (along with the comments).

Martin v. Löwis 2009-09-03 04:10:37

If you want to dig deeper: log session creation; find out what time the clashing sessions are created and by which processes; verify that the state of the PRNG at the start (after first call and seeding) is distinct in each process; see if it goes away if you create a separate PRNG instance just for session creation (should not be needed but would be a data point); see if you can isolate the problem reproducibly.

Glenn Maynard 2009-09-03 05:11:08

Considering the amount of progress that has been made against SHA-1 (and MD5), I'd start looking to other hashes. Neither MD5 nor SHA-1 should be used in new designs.

derobert 2009-09-06 05:44:27

Answer 3

+3 A:

This is definitely not a normal collision scenario:

32 characters with 62 options per character is equivalent to 190 bits (log2(62) * 32)
According to the birthday paradox, you should be receiving a collision naturally once every 2**95 cookies, which means never

Could this be a concurrency issue?

If so, use different random.Random instances for each thread
Can save these instances in thread-local storage (threading.local())
On linux, Python should seed them using os.urandom() - not system time - so you should get different streams for each thread.

orip 2009-09-02 07:17:25

He said multiple FCGI processes, not threads. Was that correct, Martin, or did you mean threads?

Glenn Maynard 2009-09-02 07:26:14

Processes, correct.

Martin v. Löwis 2009-09-02 07:42:50

Answer 4

A:

I had to erase my original answer, which suggested that generator is not seeded from /dev/urandom, since its source (for Python 3.x) clearly says that it is:

def seed(self, a=None):
    """Initialize internal state from hashable object.

    None or no argument seeds from current time or from an operating
    system specific randomness source if available.

    If a is not None or an int or long, hash(a) is used instead.
    """

    if a is None:
        try:
            a = int(_hexlify(_urandom(16)), 16)
        except NotImplementedError:
            import time
            a = int(time.time() * 256) # use fractional seconds

    super().seed(a)
    self.gauss_next = None

I therefore humbly accept that there are mysteries in the world that I may not be able to decipher.

ilya n. 2009-09-02 07:30:57

Where do you see it seeded from some hash()? In random.py, around line 108, it's seeded from long(_hexlify(_urandom(16)), 16).

Martin v. Löwis 2009-09-02 07:35:32

Indeed, I just read that myself.

ilya n. 2009-09-02 07:43:26

If you're really looking into that -- perhaps the next step would be checking that the line `a = int(_hexlify(_urandom(16)), 16)` doesn't raise `NotImplementedError` for some strange reason?

ilya n. 2009-09-02 07:51:32

Thanks for the idea. Alas, I tried, and it all works as it should (i.e. it does get seeded from urandom).

Martin v. Löwis 2009-09-02 18:28:08

Answer 5

+1 A:

I don't know how your FCGI processes are being spawned, but is it possible that it's using fork() after the Python interpreter has started (and the random module has been imported by something), hence effectively seeding two processes' random._insts from the same source?
Maybe put some debugging in to check that it is correctly seeding from urandom, and not falling back to the less rigorous time-based seed?

eta re comment: man! That's me stumped then; if the RNG always has different state at startup I can't see how you could possibly get collisions. Weird. Would have to put in a lot of state logging to investigate the particular cases which result in collisions, I guess, which sounds like a lot of work trawling through logs. Could it be (1a) the FCGI server usually doesn't fork, but occasionally does (maybe under load, or something)?

Or (3) some higher-level problem such as a broken HTTP proxy passing the same Set-Cookie to multiple clients?

bobince 2009-09-03 01:05:05

Thanks for the ideas: 1. I have dumped the state of the RNG at startup, and they are all different. 2. I had it create files good (uses urandom) and bad (uses time); the "good" file was created; the bad file was not.

Martin v. Löwis 2009-09-03 04:09:25

ansaurus

tags:

views:

answers:

random.choice not random

related questions