546

7
+21  Q:

## Why do digits 1, 2 and 3 appear so frequently using C rand() function?

What I am trying to do is to generate some random numbers (not necessarily single digit) like

``````29106
7438
5646
4487
9374
28671
92
13941
25226
10076
``````

and then count the number of digits I get:

``````count[0] =       3  Percentage =  6.82
count[1] =       5  Percentage = 11.36
count[2] =       6  Percentage = 13.64
count[3] =       3  Percentage =  6.82
count[4] =       6  Percentage = 13.64
count[5] =       2  Percentage =  4.55
count[6] =       7  Percentage = 15.91
count[7] =       5  Percentage = 11.36
count[8] =       3  Percentage =  6.82
count[9] =       4  Percentage =  9.09
``````

This is the code I am using:

``````#include <stdio.h>
#include <time.h>
#include <stdlib.h>

int main() {

int i;
srand(time(NULL));
FILE* fp = fopen("random.txt", "w");
// for(i = 0; i < 10; i++)
for(i = 0; i < 1000000; i++)
fprintf(fp, "%d\n", rand());
fclose(fp);

int dummy;
long count[10] = {0,0,0,0,0,0,0,0,0,0};
fp = fopen("random.txt", "r");
while(!feof(fp)) {
fscanf(fp, "%1d", &dummy);
count[dummy]++;
}
fclose(fp);

long sum = 0;
for(i = 0; i < 10; i++)
sum += count[i];

for(i = 0; i < 10; i++)
printf("count[%d] = %7ld  Percentage = %5.2f\n",
i, count[i], ((float)(100 * count[i])/sum));

}
``````

If I generate a large number of random numbers (1000000), this is the result I get:

``````count[0] =  387432  Percentage =  8.31
count[1] =  728339  Percentage = 15.63
count[2] =  720880  Percentage = 15.47
count[3] =  475982  Percentage = 10.21
count[4] =  392678  Percentage =  8.43
count[5] =  392683  Percentage =  8.43
count[6] =  392456  Percentage =  8.42
count[7] =  391599  Percentage =  8.40
count[8] =  388795  Percentage =  8.34
count[9] =  389501  Percentage =  8.36
``````

Notice that 1, 2 and 3 have too many hits. I have tried running this several times and each time I get very similar results.

I am trying to understand what could cause 1, 2 and 3 to appear much more frequently than any other digit.

Taking hint from what Matt Joiner and Pascal Cuoq pointed out,

I changed the code to use

``````for(i = 0; i < 1000000; i++)
fprintf(fp, "%04d\n", rand() % 10000);
// pretty prints 0
// generates numbers in range 0000 to 9999
``````

and this is what I get (similar results on multiple runs):

``````count[0] =  422947  Percentage = 10.57
count[1] =  423222  Percentage = 10.58
count[2] =  414699  Percentage = 10.37
count[3] =  391604  Percentage =  9.79
count[4] =  392640  Percentage =  9.82
count[5] =  392928  Percentage =  9.82
count[6] =  392737  Percentage =  9.82
count[7] =  392634  Percentage =  9.82
count[8] =  388238  Percentage =  9.71
count[9] =  388352  Percentage =  9.71
``````

What can be the reason that 0, 1 and 2 are favored?

Thanks everyone. Using

``````int rand2(){
int num = rand();
return (num > 30000? rand2():num);
}

fprintf(fp, "%04d\n", rand2() % 10000);
``````

I get

``````count[0] =  399629  Percentage =  9.99
count[1] =  399897  Percentage = 10.00
count[2] =  400162  Percentage = 10.00
count[3] =  400412  Percentage = 10.01
count[4] =  399863  Percentage = 10.00
count[5] =  400756  Percentage = 10.02
count[6] =  399980  Percentage = 10.00
count[7] =  400055  Percentage = 10.00
count[8] =  399143  Percentage =  9.98
count[9] =  400104  Percentage = 10.00
``````
+4  A:

Looks like Benford's Law - see http://en.wikipedia.org/wiki/Benford%27s_law, or alternatively a not very good RNG.

Benfords law was my first thought as well, but doesnt it hold just for "real-life" data, i.e. empirically retrieved data?
1.23% of statistics will not comply with Benford's law, except for on 3/12/2013. Sorry - couldn't resist. My belief is that this is indeed just for real-life data.
Benford's Law explains the same observation but not under the given circumstances. I assume a pseudo-random uniform distribution. Benford's law applys to distributions which have uniform logarithms.
+35  A:

`rand()` generates a value from `0` to `RAND_MAX`. `RAND_MAX` is set to `INT_MAX` on most platforms, which may be `32767` or `2147483647`.

For your example given above, it appears that `RAND_MAX` is `32767`. This will place an unusually high frequency of `1`, `2` and `3` for the most significant digit for the values from `10000` to `32767`. You can observe that to a lesser degree, values up to `6` and `7` will also be slightly favored.

Beat me to it - good call.
Why should 6 and 7 be slightly favored?
'cause for any number > 32700, the fourth digit can be as high as 6. For any number > 32760, the fourth digit can be as high as 7.
Much more important that the bias for six and seven is the bias against zero. 00012 is pretty-printed "12" but 11112 is pretty-printed "11112". All leading zeroes that would make statistics balanced if the range was a power of ten are omitted by `printf`.
Thanks Will and Pascal, very good observations/points.
A:

rand() implementations vary wildly. IIRC, linear congruential random generators tend to have less randomness in lower-order bits. Quoting from the linux rand() man page:

The versions of rand() and srand() in the Linux C Library use the same random number generator as random(3) and srandom(3), so the lower-order bits should be as random as the higher-order bits. However, on older rand() implementations, and on current implementations on different systems, the lower-order bits are much less random than the higher-order bits. Do not use this function in applications intended to be portable when good randomness is needed.
+1  A:

That's because you generate numbers between `0` and `RAND_MAX`. The generated numbers are evenly distributed (i.e. approx. same probability for each number), however, the digits 1,2,3 occur more often than others in this range. Try generating between `0` and `10`, where each digit occurs with the same probability and you'll get a nice distribution.

+15  A:
A:

When you want to generate random value from range `[0, x)`, instead of doing `rand()%x`, you should apply formula `x*((double)rand()/RAND_MAX)`, which will give you nicely distributed random values.

Say, RAND_MAX is equal to 15, so `rand` will give you integers from 0 to 15. When you use modulo operator to get random numbers from `[0, 10)`, values `[0,5]` will have higher frequency than `[6,9]`, because `3 == 3%10 == 13%10`.

+1  A:

If I understand what the OP (person asking the question) wants, they want to make better random numbers.

rand() and random(), quite frankly, don't make very good random numbers; they both do poorly when tested against diehard and dieharder (two packages for testing the quality of random numbers).

The Mersenne twister is a popular random number generator which is good for pretty much everything except crypto-strong random numbers; it passes all of the diehard(er) tests with flying colors.

If one needs crypto-strong random numbers (numbers that can not be guessed, even if someone knows which particular crypto-strong algorithm is being used), there are a number of stream ciphers out there. The one I like to use is called RadioGatún[32], and here’s a compact C representation of it:

``````/*Placed in the public domain by Sam Trenholme*/
#include <stdint.h>
#include <stdio.h>
#define p uint32_t
#define f(a) for(c=0;c<a;c++)
#define n f(3){b[c*13]^=s[c];a[16+c]^=s[c];}k(a,b
k(p *a,p *b){p A[19],x,y,r,q[3],c,i;f(3){q[c]=b[c
*13+12];}for(i=12;i;i--){f(3){b[c*13+i]=b[c*13+i-
1];}}f(3){b[c*13]=q[c];}f(12){i=c+1+((c%3)*13);b[
i]^=a[c+1];}f(19){y=(c*7)%19;r=((c*c+c)/2)%32;x=a
[y]^(a[(y+1)%19]|(~a[(y+2)%19]));A[c]=(x>>r)|(x<<
(32-r));}f(19){a[c]=A[c]^A[(c+1)%19]^A[(c+4)%19];
}a[0]^=1;f(3){a[c+13]^=q[c];}}l(p *a,p *b,char *v
){p s[3],q,c,r,x,d=0;for(;;){f(3){s[c]=0;}for(r=0
;r<3;r++){for(q=0;q<4;q++){if(!(x=*v&255)){d=x=1;
}v++;s[r]|=x<<(q*8);if(d){n);return;}}}n);}}main(
int j,char **h){p a[39],b[39],c,e,g;if(j==2){f(39
){a[c]=b[c]=0;}l(a,b,h[1]);f(16){k(a,b);}f(4){k(a
,b);for(j=1;j<3;++j){g=a[j];for(e=4;e;e--){printf
("%02x",g&255);g>>=8;}}}printf("\n");}}
``````

There are also a lot of other really good random number generators out there.