ansaurus

Question

Generating non-repeating random numbers in Python

Answer 1

A:

I'd rethink the problem itself... You don't seem to be doing anything sequential with the numbers... and you've got an index on the column which has them. Do they actually need to be numbers?

Consider a sha hash... you don't actually need the entire thing. Do what git or other url shortening services do, and take first 3/4/5 characters of the hash. Given that each character now has 36 possible values instead of 10, you have 2,176,782,336 combinations instead of 999,999 combinations (for six digits). Combine that with a quick check on whether the combination exists (a pure index query) and a seed like a timestamp + random number and it should do for almost any situation.

Sudhir Jonathan 2010-01-16 09:51:22

They need to be human readable (so serial numbers are better than (68123erghdqwoc which would let me express say MD5 in only 16 characters). If I arbitrarily shorten SHA1 I get collisions just as regularly as I would for random numbers and run into the problem in solution 1. Like I said I'd like to keep the numbers between 1 and a billion.

bigredbob 2010-01-16 09:54:01

There is no guarantee that SHA hashes (especially shortened) are unique.

Antoine P. 2010-01-16 11:05:58

I think you should watch your tone. using a long hash is definitely a viable solution. If you are so clever , try to make SHA256 hash break then.

2010-01-16 11:33:54

I'm sorry but suggesting to "take first 3/4/5 characters of the hash" when you want a genuinely unique number *is* stupid. Not to mention you have to take the hash of something anyway, so that something itself still has to be unique...

Antoine P. 2010-01-16 12:03:08

Hmmm. Not to get into a flame war here, but when what you want is a to be able to assign a different key to a lot of items, you could do it. It might also be worth looking into URL shortening services and Git, both of which have been doing this for quite some time now.

Sudhir Jonathan 2010-01-16 16:11:12

Answer 2

+3 A:

With some modular arithmic and prime numbers, you can create all numbers between 0 and a big prime, out of order. ~~If you choose your numbers carefully, the next number is hard to guess.~~

modulo = 87178291199 # prime
incrementor = 17180131327 # relative prime

current = 433494437 # some start value
for i in xrange(1, 100):
    print current
    current = (current + incrementor) % modulo

Sjoerd 2010-01-16 10:07:57

+1 Neat, don't have to store a list of numbers, and would be very satisfying for anyone who did reverse-engineer it! :)

Autopulated 2010-01-16 10:15:36

If someone had three serials which are issued right after each other, wouldn't it be easy to deduce the logic?

Johan 2010-01-16 10:34:05

@Johan: Not if the prime numbers are really large. All modern cryptographic algorithms that I know rely on the fact that primes are hard to factorize.

jbochi 2010-01-16 12:44:02

@Johan: You are right. If you have three or more serials the next number would be fairly easy to guess, no matter what your primes are. I was mistaken with another algoritm.

Sjoerd 2010-01-16 13:14:56

@jbochi: That has nothing to do with this very algorithm. Also: I'll factorize any prime number in O(1), without a computer. But I guess we know what you meant :)

balpha 2010-01-16 13:41:44

With a multiplicative ring, the next number would be harder to predict. E.g. use 999998971 as prime, 1 as start, and 43 as primitive root. For a serial, multiply the previous serial by 43 modulo 999998971.

Sjoerd 2010-01-16 15:42:55

Answer 3

A:

You are stating that you store the numbers in a database.

Wouldn't it then be easier to store all the numbers there, and ask the database for a random unused number? Most databases support such a request.

Examples

MySQL:

SELECT column FROM table
ORDER BY RAND()
LIMIT 1

PostgreSQL:

SELECT column FROM table
ORDER BY RANDOM()
LIMIT 1

Johan 2010-01-16 10:26:54

You'd need to populate a table with a billion rows, which is a pretty inefficient. Don't forget that you need to remove each number as you use it.

Glenn Maynard 2010-01-16 10:48:00

Answer 4

+2 A:

If they don't have to be random, but just not obviously linear (1, 2, 3, 4, ...), then here's a simple algorithm:

Pick two prime numbers. One of them will be the largest number you can generate, so it should be around one billion. The other should be fairly large.

max_value = 795028841
step = 360287471
previous_serial = 0
for i in xrange(0, max_value):
    previous_serial += step
    previous_serial %= max_value
    print "Serial: %09i" % previous_serial

Just store the previous serial each time so you know where you left off. I can't prove mathmatically that this works (been too long since those particular classes), but it's demonstrably correct with smaller primes:

s = set()
with open("test.txt", "w+") as f:
    previous_serial = 0
    for i in xrange(0, 2711):
        previous_serial += 1811
        previous_serial %= 2711
        assert previous_serial not in s
        s.add(previous_serial)

You could also prove it empirically with 9-digit primes, it'd just take a bit more work (or a lot more memory).

This does mean that given a few serial numbers, it'd be possible to figure out what your values are--but with only nine digits, it's not likely that you're going for unguessable numbers anyway.

Glenn Maynard 2010-01-16 10:35:13

(Ugh. Why is this site making well-established users randomly enter captchas? It's absurd.)

Glenn Maynard 2010-01-16 10:36:03

Answer 5

A:

Do you need this to be cryptographically secure or just hard to guess? How bad are collisions? Because if it needs to be cryptographically strong and have zero collisions, it is, sadly, impossible.

Andrew McGregor 2010-01-16 10:35:44

Since he is talking about serial numbers, and has a database table storing them, I assume collisions would be pretty annoying (as in "mixing up two different customers" perhaps).

Antoine P. 2010-01-16 11:06:54

Sure, but if they're more like software keys, then that may be a mere annoyance, as opposed to a serious bug.

Andrew McGregor 2010-01-16 11:50:06

Answer 6

+13 A:

balpha 2010-01-16 10:45:42

That's what I would advocate too.

Antoine P. 2010-01-16 11:04:15

Just add other 3 digits more if you want to be EXTRA sure! ;-)

Khelben 2010-01-16 12:23:49

The annoyance is that if I create a list, shuffle and store then it's deterministic with 0 chance of using an already used number, plus I don't have to use overly long numbers.

bigredbob 2010-01-17 02:42:30

@bigredbob: I get that. But 1) 12 digits isn't really overly long, and 2) I'm just saying that the drawbacks of the simplest solution aren't nearly as big as you think. Don't forget that the probabilities multiply.

balpha 2010-01-17 09:40:47

Answer 7

+2 A:

My first thought is to use a symmetric encryption algorithm (e.g. DES or AES) in counter mode. Your counter just goes from 0 upwards, and the encryption uses a key of your choice to turn it into a seemingly random value, that is guaranteed to never have collisions (because cryptographic algorithms create a 1:1 mapping). One benefit is that this is reversible, so you could take the serial number and decrypt it to get back to the simple counter value.

The one problem I see with this is that the encrypted value is 64 bits (for DES) which would give you a 20-digit decimal serial number which is probably impractically large.

Still, since you aren't really after a cryptographic solution, I wonder if you could take this general idea and adapt it for your needs. Interested in making a "32-bit block encryption algorithm"?!

(Actually, you mention timing attacks, so perhaps you do need something cryptographically secure?)

Craig McQueen 2010-01-16 12:10:10

Answer 8

+4 A:

I think you are overestimating the problems with approach 1). Unless you have hard-realtime requirements just checking by random choice terminates rather fast. The probability of needing more than a number of iterations decays exponentially. With 100M numbers outputted (10% fillfactor) you'll have one in billion chance of requiring more than 9 iterations. Even with 50% of numbers taken you'll on average need 2 iterations and have one in a billion chance of requiring more than 30 checks. Or even the extreme case where 99% of the numbers are already taken might still be reasonable - you'll average a 100 iterations and have 1 in a billion change of requiring 2062 iterations

Ants Aasma 2010-01-16 12:26:55

Answer 9

+2 A:

The standard Linear Congruential random number generator's seed sequence CANNOT repeat until the full set of numbers from the starting seed value have been generated. Then it MUST repeat precisely.

The internal seed is often large (48 or 64 bits). The generated numbers are smaller (32 bits usually) because the entire set of bits are not random. If you follow the seed values they will form a distinct non-repeating sequence.

The question is essentially one of locating a good seed that generates "enough" numbers. You can pick a seed, and generate numbers until you get back to the starting seed. That's the length of the sequence. It may be millions or billions of numbers.

There are some guidelines in Knuth for picking suitable seeds that will generate very long sequences of unique numbers.

S.Lott 2010-01-16 13:09:50

Answer 10

A:

I started trying to write an explanation of the approach used below, but just implementing it was easier and more accurate. This approach has the odd behavior that it gets faster the more numbers you've generated. But it works, and it doesn't require you to generate all the numbers in advance.

As a simple optimization, you could easily make this class use a probabilistic algorithm (generate a random number, and if it's not in the set of used numbers add it to the set and return it) at first, keep track of the collision rate, and switch over to the deterministic approach used here once the collision rate gets bad.

import random

class NonRepeatingRandom(object):

    def __init__(self, maxvalue):
        self.maxvalue = maxvalue
        self.used = set()

    def next(self):
        if len(self.used) >= self.maxvalue:
            raise StopIteration
        r = random.randrange(0, self.maxvalue - len(self.used))
        result = 0
        for i in range(1, r+1):
            result += 1
            while result in self.used:
                 result += 1
        self.used.add(result)
        return result

    def __iter__(self):
        return self

    def __getitem__(self):
        raise NotImplemented

    def get_all(self):
        return [i for i in self]

>>> n = NonRepeatingRandom(20)
>>> n.get_all()
[12, 14, 13, 2, 20, 4, 15, 16, 19, 1, 8, 6, 7, 9, 5, 11, 10, 3, 18, 17]

Robert Rossney 2010-01-16 18:14:16

Answer 11

A:

If you don't need something cryptographically secure, but just "sufficiently obfuscated"...

Galois Fields

You could try operations in Galois Fields, e.g. GF(2)³², to map a simple incrementing counter x to a seemingly random serial number y:

x = counter_value
y = some_galois_function(x)

Multiply by a constant
- Inverse is to multiply by the reciprocal of the constant
Raise to a power: xⁿ
Reciprocal x^-1
- Special case of raising to power n
- It is its own inverse
Exponentiation of a primitive element: a^x
- Note that this doesn't have an easily-calculated inverse (discrete logarithm)
- Ensure a is a primitive element, aka generator

Many of these operations have an inverse, which means, given your serial number, you can calculate the original counter value from which it was derived.

As for finding a library for Galois Field for Python... good question. If you don't need speed (which you wouldn't for this) then you could make your own. I haven't tried these:

NZMATH
Finite field Python package
Sage, although it's a whole environment for mathematical computing, much more than just a Python library

CRC

A related possibility is to use a CRC calculation. Based on the remainder of long-division with an irreducible polynomial in GF(2). Python code is readily available for CRCs (crcmod, pycrc), although you might want to pick a different irreducible polynomial than is normally used, for your purposes. I'm a little fuzzy on the theory, but I think a 32-bit CRC should generate a unique value for every possible combination of 4-byte inputs. Check this. It's quite easy to experimentally check this for e.g. 16-bit CRCs, even 24-bit these days, and 32-bit with the right code and sufficient RAM.

Craig McQueen 2010-01-17 01:04:53

Answer 12

A:

You can run 1) without running into the problem of too many wrong random numbers if you just decrease the random interval by one each time.

For this method to work, you will need to save the numbers already given (which you want to do anyway) and also save the quantity of numbers taken.

It is pretty obvious that, after having collected 10 numbers, your pool of possible random numbers will have been decreased by 10. Therefore, you must not choose a number between 1 and 1.000.000 but between 1 an 999.990. Of course this number is not the real number but only an index (unless the 10 numbers collected have been 999.991, 999.992, …); you’d have to count now from 1 omitting all the numbers already collected.

Of course, your algorithm should be smarter than just counting from 1 to 1.000.000 but I hope you understand the method.

I don’t like drawing random numbers until I get one which fits either. It just feels wrong.

Debilski 2010-01-18 01:18:16

ansaurus

tags:

views:

answers:

Generating non-repeating random numbers in Python

related questions