views:

253

answers:

3

This article says:

Every prime number can be expressed as 30k±1, 30k±7, 30k±11, or 30k±13 for some k. That means we can use eight bits per thirty numbers to store all the primes; a million primes can be compressed to 33,334 bytes


"That means we can use eight bits per thirty numbers to store all the primes"

This "eight bits per thirty numbers" would be for k, correct? But each k value will not necessarily take-up just one bit. Shouldn't it be eight k values instead?


"a million primes can be compressed to 33,334 bytes"

I am not sure how this is true.

We need to indicate two things:

  • VALUE of k (can be arbitrarily large)

  • STATE from one of the eight states (-13,-11,-7,-1,1,7,11,13)

I am not following how "33,334 bytes" was arrived at, but I can say one thing: as the prime numbers become larger and larger in value, we will need more space to store the value of k.

How, then can we fix it at "33,334 bytes"?

+9  A: 

You don't need to store each value of k. If you want to store the prime numbers below 1 million, use 33,334 bytes - the first byte corresponds to k=0, the second to k=1 etc. Then, in each byte, use 1 bit to indicate "prime" or "composite" for 30k+1, 30k+7 etc.

Paul Baker
+11  A: 

The article is a bit misleading: we can't store 1 million primes, but we can store all primes below 1 million.

The value of k comes from its position in the list. We only need 1 bit for each of those 8 permutations (-13,-11..,11,13)

In other words, we'll use 8 bits to store for k=0, 8 to store for k=1, 8 to store for k=2, etc. By letting these follow sequentially, we don't need to specify the value of k for each 8 bits - it's simply the value for the previous 8 bits + 1.

Since 1,000,000 / 30 = 33,333 1/3, we can store 33,334 of these 8 bit sequences to represent which values below 1 million are prime, since we cover all of the values k can have without 30k-13 exceeding the limit of 1 million.

Michael Madsen
+1  A: 

It's a bitmask--one bit for each of the 8 values out of 30 that might be prime, so 8 bits per 30 numbers. To tabulate all primes up to 10^6, you thus need 8*10^6/30 = 2666667 bits = 33334 bytes.

To explain why this is a good way to go, you need to look at the obvious alternatives.

A more naive way to go would just be to use a bitmask. You need a million bits, 125000 bytes.

You could also store the values of the primes themselves. Up to 1000000, the values fit in 20 bits, and there are 78498 primes, so this gives a disappointing 1569960 bits (196245 bytes).

Another way to go--though less useful for looking up primes--is to store the differences between each prime and the next. Under a million, this fits in 6 bits (as long as you remember that the primes are all odd at that point, so you only need to store even differences and can thus throw away the lowest bit), for 470998 bits == 58874 bytes. (You could shave off another bit by counting how many mod-30 slots you had to jump.)

Now, there's nothing particularly special about 30 except that 30 = 2*3*5, so this lookup is actually walking you up through a bitmask representation of the Sieve of Eratosthanes pattern just after you've gotten started. You could instead use 2*3*5*7 = 210, and then you'd have to consider +- 1, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, for 48 values. If you were doing this with 7 blocks of 30, you'd need 7*8=56 bits, so this is a slight improvement, but ugh...hardly worth the hassle.

So this is one of the better tricks out there for compactly storing reasonably small prime numbers.

(P.S. It's interesting to note that if primes appeared randomly (but the same number appeared up to 1000000 as actually appear) the amount of information stored in the primality of a number between 1 and 10^6 would be ~0.397 bits per number. Thus, under naive information-theoretic assumptions, you'd think that the best you could possibly do to store the first million primes was to use 1000000*0.397 bits, or 49609 bytes.)

Rex Kerr
@Rex Kerr: thanks for the comparisons. that makes things much more clear. one thing though: how did you arrive at `~0.397 bits per number` figure?
Lazer
p(prime) ~= 0.0785 because there are 78.5k primes in the first 1M numbers. The formula for entropy in bits is H = sum(-p*log2(p)); we have p(prime) and p(not prime)=1-p(prime). Plug in: -0.0785*log2(0.0785) - 0.9215*log2(0.9215) = 0.288 + 0.109 = 0.397
Rex Kerr