Mark's answer is almost correct except for two issues:
- If a line is longer than
length - 1
characters (including the newline), then the while
loop will increment count
at least twice for the same line: once for the first length - 1
characters, another for the next length - 1
characters, etc.
- The calculation of
rand() * count
can cause an integer overflow.
To solve the first problem, you can call fgets
into a trash buffer until it returns NULL
(indicating an I/O error or EOF with no data read) or the trash buffer contains a newline:
count = 0;
while (fgets(line, length, stream) != NULL)
{
char *p = strchr(line, '\n');
if (p != NULL) {
assert(*p == '\n');
*p = '\0'; // trim the newline
}
else { // haven't reached EOL yet. Read & discard the rest of the line.
#define TRASH_LENGTH 1024
char trash[TRASH_LENGTH];
while((p = fgets(trash, TRASH_LENGTH, stream)) != NULL) {
if ((p = strchr(trash, '\n')) != NULL) // reached EOL
break;
}
}
assert(strchr(line, '\n') == NULL); // `line` does not contain a newline
count++;
// ...
The second problem can be solved with @tvanfosson's suggestion if floating-point arithmetic is not available:
int one_chance_in(size_t n)
{
if (rand() % n == 0) // `rand` returns an integer in [0, `RAND_MAX`]
return 1;
else
return 0;
}
But note that rand() % n
is not a uniform, discrete random variable even if rand()
is assumed to be one because the probability that rand() % n == 0
can be as much as 1/RAND_MAX
higher than the desired probability 1/n
. On my machine, RAND_MAX
is 2147483647, so the difference is 4.66 × 10-10, but the C standard only requires that RAND_MAX
be at least 32767 (3.05 × 10-5 difference).
Also, for anyone left wondering why this scheme works (as I was), it might be helpful to work through the calculation of the probability that the first line remains in keptline
if there are m lines and generalize: In the first iteration of the loop, the probability that the first line is copied to keptline
is 1/1. In the second iteration of the loop, the probability that the second line does not overwrite the first line is 1/2. In the third iteration, the probability that the third line does not overwrite the first line is 2/3. Continuing, the probability that the last line does not overwrite the first line is (m - 1)/m. Thus, the probability that the first line remains in keptline
after iterating over all lines is:
1/1 × 1/2 × 2/3 × 3/4 × ... × (m - 2)/(m - 1) × (m - 1)/m = 1/m
The probability that the second line remains in keptline
is:
1/2 × 2/3 × 3/4 × ... × (m - 2)/(m - 1) × (m - 1)/m = 1/m
The probability that the third line remains in keptline
is:
1/3 × 3/4 × ... × (m - 2)/(m - 1) × (m - 1)/m = 1/m
Etc. They're all 1/m.