views:

546

answers:

7

What I am trying to do is to generate some random numbers (not necessarily single digit) like

29106
7438
5646
4487
9374
28671
92
13941
25226
10076

and then count the number of digits I get:

count[0] =       3  Percentage =  6.82
count[1] =       5  Percentage = 11.36
count[2] =       6  Percentage = 13.64
count[3] =       3  Percentage =  6.82
count[4] =       6  Percentage = 13.64
count[5] =       2  Percentage =  4.55
count[6] =       7  Percentage = 15.91
count[7] =       5  Percentage = 11.36
count[8] =       3  Percentage =  6.82
count[9] =       4  Percentage =  9.09

This is the code I am using:

#include <stdio.h>
#include <time.h>
#include <stdlib.h>

int main() {

    int i;
    srand(time(NULL));
    FILE* fp = fopen("random.txt", "w");    
    // for(i = 0; i < 10; i++)
    for(i = 0; i < 1000000; i++)
        fprintf(fp, "%d\n", rand());
    fclose(fp);

    int dummy;
    long count[10] = {0,0,0,0,0,0,0,0,0,0};
    fp = fopen("random.txt", "r");
    while(!feof(fp)) {
        fscanf(fp, "%1d", &dummy);
        count[dummy]++;                 
    }
    fclose(fp);

    long sum = 0;
    for(i = 0; i < 10; i++)
        sum += count[i];

    for(i = 0; i < 10; i++)
        printf("count[%d] = %7ld  Percentage = %5.2f\n",
            i, count[i], ((float)(100 * count[i])/sum));

}

If I generate a large number of random numbers (1000000), this is the result I get:

count[0] =  387432  Percentage =  8.31
count[1] =  728339  Percentage = 15.63
count[2] =  720880  Percentage = 15.47
count[3] =  475982  Percentage = 10.21
count[4] =  392678  Percentage =  8.43
count[5] =  392683  Percentage =  8.43
count[6] =  392456  Percentage =  8.42
count[7] =  391599  Percentage =  8.40
count[8] =  388795  Percentage =  8.34
count[9] =  389501  Percentage =  8.36

Notice that 1, 2 and 3 have too many hits. I have tried running this several times and each time I get very similar results.

I am trying to understand what could cause 1, 2 and 3 to appear much more frequently than any other digit.


Taking hint from what Matt Joiner and Pascal Cuoq pointed out,

I changed the code to use

for(i = 0; i < 1000000; i++)
    fprintf(fp, "%04d\n", rand() % 10000);
// pretty prints 0
// generates numbers in range 0000 to 9999

and this is what I get (similar results on multiple runs):

count[0] =  422947  Percentage = 10.57
count[1] =  423222  Percentage = 10.58
count[2] =  414699  Percentage = 10.37
count[3] =  391604  Percentage =  9.79
count[4] =  392640  Percentage =  9.82
count[5] =  392928  Percentage =  9.82
count[6] =  392737  Percentage =  9.82
count[7] =  392634  Percentage =  9.82
count[8] =  388238  Percentage =  9.71
count[9] =  388352  Percentage =  9.71

What can be the reason that 0, 1 and 2 are favored?


Thanks everyone. Using

int rand2(){
    int num = rand();
    return (num > 30000? rand2():num);     
}

    fprintf(fp, "%04d\n", rand2() % 10000);

I get

count[0] =  399629  Percentage =  9.99
count[1] =  399897  Percentage = 10.00
count[2] =  400162  Percentage = 10.00
count[3] =  400412  Percentage = 10.01
count[4] =  399863  Percentage = 10.00
count[5] =  400756  Percentage = 10.02
count[6] =  399980  Percentage = 10.00
count[7] =  400055  Percentage = 10.00
count[8] =  399143  Percentage =  9.98
count[9] =  400104  Percentage = 10.00
+4  A: 

Looks like Benford's Law - see http://en.wikipedia.org/wiki/Benford%27s_law, or alternatively a not very good RNG.

anon
Benfords law was my first thought as well, but doesnt it hold just for "real-life" data, i.e. empirically retrieved data?
phimuemue
1.23% of statistics will not comply with Benford's law, except for on 3/12/2013. Sorry - couldn't resist. My belief is that this is indeed just for real-life data.
Will A
Benford's Law explains the same observation but not under the given circumstances. I assume a pseudo-random uniform distribution. Benford's law applys to distributions which have uniform logarithms.
Peter G.
1111222334 Nice link
Matt Joiner
+35  A: 

rand() generates a value from 0 to RAND_MAX. RAND_MAX is set to INT_MAX on most platforms, which may be 32767 or 2147483647.

For your example given above, it appears that RAND_MAX is 32767. This will place an unusually high frequency of 1, 2 and 3 for the most significant digit for the values from 10000 to 32767. You can observe that to a lesser degree, values up to 6 and 7 will also be slightly favored.

Matt Joiner
Beat me to it - good call.
Will A
Why should 6 and 7 be slightly favored?
Hippo
'cause for any number > 32700, the fourth digit can be as high as 6. For any number > 32760, the fourth digit can be as high as 7.
Will A
Much more important that the bias for six and seven is the bias against zero. 00012 is pretty-printed "12" but 11112 is pretty-printed "11112". All leading zeroes that would make statistics balanced if the range was a power of ten are omitted by `printf`.
Pascal Cuoq
Thanks Will and Pascal, very good observations/points.
Matt Joiner
A: 

rand() implementations vary wildly. IIRC, linear congruential random generators tend to have less randomness in lower-order bits. Quoting from the linux rand() man page:

The versions of rand() and srand() in the Linux C Library use the same random number generator as random(3) and srandom(3), so the lower-order bits should be as random as the higher-order bits. However, on older rand() implementations, and on current implementations on different systems, the lower-order bits are much less random than the higher-order bits. Do not use this function in applications intended to be portable when good randomness is needed.
ninjalj
+1  A: 

That's because you generate numbers between 0 and RAND_MAX. The generated numbers are evenly distributed (i.e. approx. same probability for each number), however, the digits 1,2,3 occur more often than others in this range. Try generating between 0 and 10, where each digit occurs with the same probability and you'll get a nice distribution.

phimuemue
+15  A: 
KennyTM
A: 

When you want to generate random value from range [0, x), instead of doing rand()%x, you should apply formula x*((double)rand()/RAND_MAX), which will give you nicely distributed random values.

Say, RAND_MAX is equal to 15, so rand will give you integers from 0 to 15. When you use modulo operator to get random numbers from [0, 10), values [0,5] will have higher frequency than [6,9], because 3 == 3%10 == 13%10.

el.pescado
+1  A: 

If I understand what the OP (person asking the question) wants, they want to make better random numbers.

rand() and random(), quite frankly, don't make very good random numbers; they both do poorly when tested against diehard and dieharder (two packages for testing the quality of random numbers).

The Mersenne twister is a popular random number generator which is good for pretty much everything except crypto-strong random numbers; it passes all of the diehard(er) tests with flying colors.

If one needs crypto-strong random numbers (numbers that can not be guessed, even if someone knows which particular crypto-strong algorithm is being used), there are a number of stream ciphers out there. The one I like to use is called RadioGatún[32], and here’s a compact C representation of it:

/*Placed in the public domain by Sam Trenholme*/
#include <stdint.h>
#include <stdio.h> 
#define p uint32_t
#define f(a) for(c=0;c<a;c++)
#define n f(3){b[c*13]^=s[c];a[16+c]^=s[c];}k(a,b 
k(p *a,p *b){p A[19],x,y,r,q[3],c,i;f(3){q[c]=b[c
*13+12];}for(i=12;i;i--){f(3){b[c*13+i]=b[c*13+i- 
1];}}f(3){b[c*13]=q[c];}f(12){i=c+1+((c%3)*13);b[
i]^=a[c+1];}f(19){y=(c*7)%19;r=((c*c+c)/2)%32;x=a
[y]^(a[(y+1)%19]|(~a[(y+2)%19]));A[c]=(x>>r)|(x<<
(32-r));}f(19){a[c]=A[c]^A[(c+1)%19]^A[(c+4)%19];
}a[0]^=1;f(3){a[c+13]^=q[c];}}l(p *a,p *b,char *v
){p s[3],q,c,r,x,d=0;for(;;){f(3){s[c]=0;}for(r=0
;r<3;r++){for(q=0;q<4;q++){if(!(x=*v&255)){d=x=1;
}v++;s[r]|=x<<(q*8);if(d){n);return;}}}n);}}main(
int j,char **h){p a[39],b[39],c,e,g;if(j==2){f(39
){a[c]=b[c]=0;}l(a,b,h[1]);f(16){k(a,b);}f(4){k(a
,b);for(j=1;j<3;++j){g=a[j];for(e=4;e;e--){printf
("%02x",g&255);g>>=8;}}}printf("\n");}}

There are also a lot of other really good random number generators out there.

samiam