views:

1496

answers:

12

Ok this is one of those trickier than it sounds questions so I'm turning to stack overflow because I can't think of a good answer. Here is what I want: I need Python to generate a simple a list of numbers from 0 to 1,000,000,000 in random order to be used for serial numbers (using a random number so that you can't tell how many have been assigned or do timing attacks as easily, i.e. guessing the next one that will come up). These numbers are stored in a database table (indexed) along with the information linked to them. The program generating them doesn't run forever so it can't rely on internal state.

No big deal right? Just generate a list of numbers, shove them into an array and use Python "random.shuffle(big_number_array)" and we're done. Problem is I'd like to avoid having to store a list of numbers (and thus read the file, pop one off the top, save the file and close it). I'd rather generate them on the fly. Problem is that the solutions I can think of have problems:

1) Generate a random number and then check if it has already been used. If it has been used generate a new number, check, repeat as needed until I find an unused one. Problem here is that I may get unlucky and generate a lot of used numbers before getting one that is unused. Possible fix: use a very large pool of numbers to reduce the chances of this (but then I end up with silly long numbers).

2) Generate a random number and then check if it has already been used. If it has been used add or subtract one from the number and check again, keep repeating until I hit an unused number. Problem is this is no longer a random number as I have introduced bias (eventually I will get clumps of numbers and you'd be able to predict the next number with a better chance of success).

3) Generate a random number and then check if it has already been used. If it has been used add or subtract another randomly generated random number and check again, problem is we're back to simply generating random numbers and checking as in solution 1.

4) Suck it up and generate the random list and save it, have a daemon put them into a Queue so there are numbers available (and avoid constantly opening and closing a file, batching it instead).

5) Generate much larger random numbers and hash them (i.e. using MD5) to get a smaller numeric value, we should rarely get collisions, but I end up with larger than needed numbers again.

6) Prepend or append time based information to the random number (i.e. unix timestamp) to reduce chances of a collision, again I get larger numbers than I need.

Anyone have any clever ideas that will reduce the chances of a "collision" (i.e. generating a random number that is already taken) but will also allow me to keep the number "small" (i.e. less than a billion (or a thousand million for your europeans =)).

Answer and why I accepted it:

So I will simply go with 1, and hope it's not an issue, however if it is I will go with the deterministic solution of generating all the numbers and storing them so that there is a guarentee of getting a new random number, and I can use "small" numbers (i.e. 9 digits instead of an MD5/etc.).

A: 

I'd rethink the problem itself... You don't seem to be doing anything sequential with the numbers... and you've got an index on the column which has them. Do they actually need to be numbers?

Consider a sha hash... you don't actually need the entire thing. Do what git or other url shortening services do, and take first 3/4/5 characters of the hash. Given that each character now has 36 possible values instead of 10, you have 2,176,782,336 combinations instead of 999,999 combinations (for six digits). Combine that with a quick check on whether the combination exists (a pure index query) and a seed like a timestamp + random number and it should do for almost any situation.

Sudhir Jonathan
They need to be human readable (so serial numbers are better than (68123erghdqwoc which would let me express say MD5 in only 16 characters). If I arbitrarily shorten SHA1 I get collisions just as regularly as I would for random numbers and run into the problem in solution 1. Like I said I'd like to keep the numbers between 1 and a billion.
bigredbob
There is no guarantee that SHA hashes (especially shortened) are unique.
Antoine P.
I think you should watch your tone. using a long hash is definitely a viable solution. If you are so clever , try to make SHA256 hash break then.
I'm sorry but suggesting to "take first 3/4/5 characters of the hash" when you want a genuinely unique number *is* stupid. Not to mention you have to take the hash of something anyway, so that something itself still has to be unique...
Antoine P.
Hmmm. Not to get into a flame war here, but when what you want is a to be able to assign a different key to a lot of items, you could do it. It might also be worth looking into URL shortening services and Git, both of which have been doing this for quite some time now.
Sudhir Jonathan
+3  A: 

With some modular arithmic and prime numbers, you can create all numbers between 0 and a big prime, out of order. If you choose your numbers carefully, the next number is hard to guess.

modulo = 87178291199 # prime
incrementor = 17180131327 # relative prime

current = 433494437 # some start value
for i in xrange(1, 100):
    print current
    current = (current + incrementor) % modulo
Sjoerd
+1 Neat, don't have to store a list of numbers, and would be very satisfying for anyone who did reverse-engineer it! :)
Autopulated
If someone had three serials which are issued right after each other, wouldn't it be easy to deduce the logic?
Johan
@Johan: Not if the prime numbers are really large. All modern cryptographic algorithms that I know rely on the fact that primes are hard to factorize.
jbochi
@Johan: You are right. If you have three or more serials the next number would be fairly easy to guess, no matter what your primes are. I was mistaken with another algoritm.
Sjoerd
@jbochi: That has nothing to do with this very algorithm. Also: I'll factorize any prime number in O(1), without a computer. But I guess we know what you meant :)
balpha
With a multiplicative ring, the next number would be harder to predict. E.g. use 999998971 as prime, 1 as start, and 43 as primitive root. For a serial, multiply the previous serial by 43 modulo 999998971.
Sjoerd
A: 

You are stating that you store the numbers in a database.

Wouldn't it then be easier to store all the numbers there, and ask the database for a random unused number? Most databases support such a request.

Examples

MySQL:

SELECT column FROM table
ORDER BY RAND()
LIMIT 1

PostgreSQL:

SELECT column FROM table
ORDER BY RANDOM()
LIMIT 1
Johan
You'd need to populate a table with a billion rows, which is a pretty inefficient. Don't forget that you need to remove each number as you use it.
Glenn Maynard
+2  A: 

If they don't have to be random, but just not obviously linear (1, 2, 3, 4, ...), then here's a simple algorithm:

Pick two prime numbers. One of them will be the largest number you can generate, so it should be around one billion. The other should be fairly large.

max_value = 795028841
step = 360287471
previous_serial = 0
for i in xrange(0, max_value):
    previous_serial += step
    previous_serial %= max_value
    print "Serial: %09i" % previous_serial

Just store the previous serial each time so you know where you left off. I can't prove mathmatically that this works (been too long since those particular classes), but it's demonstrably correct with smaller primes:

s = set()
with open("test.txt", "w+") as f:
    previous_serial = 0
    for i in xrange(0, 2711):
        previous_serial += 1811
        previous_serial %= 2711
        assert previous_serial not in s
        s.add(previous_serial)

You could also prove it empirically with 9-digit primes, it'd just take a bit more work (or a lot more memory).

This does mean that given a few serial numbers, it'd be possible to figure out what your values are--but with only nine digits, it's not likely that you're going for unguessable numbers anyway.

Glenn Maynard
(Ugh. Why is this site making well-established users randomly enter captchas? It's absurd.)
Glenn Maynard
A: 

Do you need this to be cryptographically secure or just hard to guess? How bad are collisions? Because if it needs to be cryptographically strong and have zero collisions, it is, sadly, impossible.

Andrew McGregor
Since he is talking about serial numbers, and has a database table storing them, I assume collisions would be pretty annoying (as in "mixing up two different customers" perhaps).
Antoine P.
Sure, but if they're more like software keys, then that may be a mere annoyance, as opposed to a serious bug.
Andrew McGregor
+13  A: 
balpha
That's what I would advocate too.
Antoine P.
Just add other 3 digits more if you want to be EXTRA sure! ;-)
Khelben
The annoyance is that if I create a list, shuffle and store then it's deterministic with 0 chance of using an already used number, plus I don't have to use overly long numbers.
bigredbob
@bigredbob: I get that. But 1) 12 digits isn't really overly long, and 2) I'm just saying that the drawbacks of the simplest solution aren't nearly as big as you think. Don't forget that the probabilities multiply.
balpha
+2  A: 

My first thought is to use a symmetric encryption algorithm (e.g. DES or AES) in counter mode. Your counter just goes from 0 upwards, and the encryption uses a key of your choice to turn it into a seemingly random value, that is guaranteed to never have collisions (because cryptographic algorithms create a 1:1 mapping). One benefit is that this is reversible, so you could take the serial number and decrypt it to get back to the simple counter value.

The one problem I see with this is that the encrypted value is 64 bits (for DES) which would give you a 20-digit decimal serial number which is probably impractically large.

Still, since you aren't really after a cryptographic solution, I wonder if you could take this general idea and adapt it for your needs. Interested in making a "32-bit block encryption algorithm"?!

(Actually, you mention timing attacks, so perhaps you do need something cryptographically secure?)

Craig McQueen
+4  A: 

I think you are overestimating the problems with approach 1). Unless you have hard-realtime requirements just checking by random choice terminates rather fast. The probability of needing more than a number of iterations decays exponentially. With 100M numbers outputted (10% fillfactor) you'll have one in billion chance of requiring more than 9 iterations. Even with 50% of numbers taken you'll on average need 2 iterations and have one in a billion chance of requiring more than 30 checks. Or even the extreme case where 99% of the numbers are already taken might still be reasonable - you'll average a 100 iterations and have 1 in a billion change of requiring 2062 iterations

Ants Aasma
+2  A: 

The standard Linear Congruential random number generator's seed sequence CANNOT repeat until the full set of numbers from the starting seed value have been generated. Then it MUST repeat precisely.

The internal seed is often large (48 or 64 bits). The generated numbers are smaller (32 bits usually) because the entire set of bits are not random. If you follow the seed values they will form a distinct non-repeating sequence.

The question is essentially one of locating a good seed that generates "enough" numbers. You can pick a seed, and generate numbers until you get back to the starting seed. That's the length of the sequence. It may be millions or billions of numbers.

There are some guidelines in Knuth for picking suitable seeds that will generate very long sequences of unique numbers.

S.Lott
A: 

I started trying to write an explanation of the approach used below, but just implementing it was easier and more accurate. This approach has the odd behavior that it gets faster the more numbers you've generated. But it works, and it doesn't require you to generate all the numbers in advance.

As a simple optimization, you could easily make this class use a probabilistic algorithm (generate a random number, and if it's not in the set of used numbers add it to the set and return it) at first, keep track of the collision rate, and switch over to the deterministic approach used here once the collision rate gets bad.

import random

class NonRepeatingRandom(object):

    def __init__(self, maxvalue):
        self.maxvalue = maxvalue
        self.used = set()

    def next(self):
        if len(self.used) >= self.maxvalue:
            raise StopIteration
        r = random.randrange(0, self.maxvalue - len(self.used))
        result = 0
        for i in range(1, r+1):
            result += 1
            while result in self.used:
                 result += 1
        self.used.add(result)
        return result

    def __iter__(self):
        return self

    def __getitem__(self):
        raise NotImplemented

    def get_all(self):
        return [i for i in self]

>>> n = NonRepeatingRandom(20)
>>> n.get_all()
[12, 14, 13, 2, 20, 4, 15, 16, 19, 1, 8, 6, 7, 9, 5, 11, 10, 3, 18, 17]
Robert Rossney
A: 

If you don't need something cryptographically secure, but just "sufficiently obfuscated"...

Galois Fields

You could try operations in Galois Fields, e.g. GF(2)32, to map a simple incrementing counter x to a seemingly random serial number y:

x = counter_value
y = some_galois_function(x)
  • Multiply by a constant
    • Inverse is to multiply by the reciprocal of the constant
  • Raise to a power: xn
  • Reciprocal x-1
    • Special case of raising to power n
    • It is its own inverse
  • Exponentiation of a primitive element: ax

Many of these operations have an inverse, which means, given your serial number, you can calculate the original counter value from which it was derived.

As for finding a library for Galois Field for Python... good question. If you don't need speed (which you wouldn't for this) then you could make your own. I haven't tried these:

CRC

A related possibility is to use a CRC calculation. Based on the remainder of long-division with an irreducible polynomial in GF(2). Python code is readily available for CRCs (crcmod, pycrc), although you might want to pick a different irreducible polynomial than is normally used, for your purposes. I'm a little fuzzy on the theory, but I think a 32-bit CRC should generate a unique value for every possible combination of 4-byte inputs. Check this. It's quite easy to experimentally check this for e.g. 16-bit CRCs, even 24-bit these days, and 32-bit with the right code and sufficient RAM.

Craig McQueen
A: 

You can run 1) without running into the problem of too many wrong random numbers if you just decrease the random interval by one each time.

For this method to work, you will need to save the numbers already given (which you want to do anyway) and also save the quantity of numbers taken.

It is pretty obvious that, after having collected 10 numbers, your pool of possible random numbers will have been decreased by 10. Therefore, you must not choose a number between 1 and 1.000.000 but between 1 an 999.990. Of course this number is not the real number but only an index (unless the 10 numbers collected have been 999.991, 999.992, …); you’d have to count now from 1 omitting all the numbers already collected.

Of course, your algorithm should be smarter than just counting from 1 to 1.000.000 but I hope you understand the method.

I don’t like drawing random numbers until I get one which fits either. It just feels wrong.

Debilski