ansaurus

Question

What's the best way to return a random line in a text file using C?

Answer 1

+7 A:

This method is good because:

i) You can keep generating random lines at no big cost

ii) You only have to read the file a total of 1 time + 1 line at a time per random line you want. The excess read data is only equal to the size of the file.

iii) It gives each line a fair chance no matter what its position is in the file.

iv) It gives each line a fair chance no matter what its length is in the file.

The suggestion:

I would suggest a 2 pass algorithm. Well really it's a 1 pass + N lines. Where N is the number of random lines you want.

The first pass you would use to calculate how many lines and the start positions of each line.

You then take a random number from 0 to the number of lines minus 1. Use that random number, which is your line index, get the start position for that line index. Seek to that position.

You then have only 1 more read needed, and you know the exact size. (until the start index of the next line)

How to store the number of lines and the index of each line:

To store the number of lines, you can obviously just use an int.

If you can use a vector then you can add each line index into the vector. If not you can just create an array of ints with the max number of lines you think there will be. Then index into that array.

Other answers:

Another answer mentioned that you can pick a random number from 1 to the size of the file, and then use the closest newline. But this won't work. For example you might have 1 line that is really long and the others that are not that long. In that case you would have an uneven distribution.

Brian R. Bondy 2008-10-24 02:01:11

See Mark Random's solution

1800 INFORMATION 2008-10-24 02:05:51

His solution is not good if you have to keep generating random lines. A lot of extra reading that you won't need.

Brian R. Bondy 2008-10-24 02:12:46

It took a couple of tries but you finally came up with a decent answer so I reversed my downvote. I have solved this same problem before using both your method and Mark's, they are both fine solutions, which is best depends on the situation.

Robert Gamble 2008-10-24 02:18:27

If you want a uniform distribution (even odds of picking any line), you need to do something like this. Mark Random's is heavily biased to the first lines in the file (which may or may not be fine for the application).

Mr Fooz 2008-10-24 02:20:55

Mark Random? I don't think anybody's called me that before, but in this case it fits...

Mark Ransom 2008-10-24 02:22:16

I had random on the brain

1800 INFORMATION 2008-10-24 02:23:01

Your solution is not good it the content of the file is potentially changing - it's a tradeoff

1800 INFORMATION 2008-10-24 02:24:10

Mark's and Adam's solutions each "work": they do sample from a random distribution. It just isn't a uniform distribution (which is probably what's desired). If all lines are of similar length, Adam's could be a good approximation.

Mr Fooz 2008-10-24 02:26:21

@Mr Fooz, Mark's solution (if I interpreted it correctly) is a well-known algorithm for this very problem and appears in TAoCP. At first it might appear to be biased but when you work out the math you can see that this is not the case.

Robert Gamble 2008-10-24 02:26:29

1800 I guess you could make a copy if you knew it would be changing, but still needed a fair random distribution. Or you could use some kind of file locking.

Brian R. Bondy 2008-10-24 02:26:34

+1: Based on the clarification that the file does not change. This is a _much much_ better solution than the accepted answer. Of course, this assumes that nintendo DS supports seeking in a file. If it does not, then this is a _much_ better solution than the accepted solution.

Moron 2010-06-09 21:37:09

Ya a lot of people don't consider that 2 pass is not the end of the world considering sometimes people use O(n^2). O(2n) = O(n). That's a great point about if the file changes. Maybe some locking mechanism could be used or else a copy of the file.

Brian R. Bondy 2010-06-10 00:49:38

Answer 2

+17 A:

Read each line, and use a random number to choose whether to keep that line or ignore it. For the first line, you want odds of 1:1 to keep; for the second, you want odds of 1:2, etc.

count = 0;
while (fgets(line, length, stream) != NULL)
{
    count++;
    if ((rand() * count) / RAND_MAX == 0)
        strcpy(keptline, line);
}

I haven't verified that this has the proper random qualities, but it seems right at first glance.

It has been pointed out that integer overflow would quickly become a problem with the way the comparison is coded, and I had independently reached the same conclusion myself. There are probably many ways to fix it, but this is the first that comes to mind:

if ((rand() / (float)RAND_MAX) <= (1.0 / count))

Mark Ransom 2008-10-24 02:01:54

I like the use of direct probability over choosing randomly from a list of lines.

Bernard 2008-10-24 02:03:56

I upvoted the answer because I think in theory it's right. However, integer overflow does cause the end result to be skewed.

Chris Jester-Young 2008-10-24 02:12:36

1st line: 50% chance. 2nd line: (100-50%)*(1/3)=16.7%, 3rd line: (100-50-16.7%)*(1/4)=8.3%, ... This is a valid distribution, but it isn't all uniform (and the OP didn't say it had to be uniform).

Mr Fooz 2008-10-24 02:18:14

just be careful that rand() * count doesn't cause an overflow for larger files (MAX_INT / RAND_MAX is the limit on the number of lines).

MattSmith 2008-10-24 02:19:15

actually with this method, you will get the first line of the file half the time!

MattSmith 2008-10-24 02:26:50

There's surely something that I don't understand, If you have 1000 lines, the last line has a very low probability of being kept, but the first line still has 50%. What am I missing?

Martin Cote 2008-10-24 02:27:33

What the heck is everybody talking about? This solution **does** pick every line with equal probability, regardless of how many lines there are. First line has a 1/1 chance of being picked, but an (n-1)/n chance of being replaced later on.

ephemient 2008-10-24 02:36:30

tvan and Martin Cote - you are both wrong. See e.g., here for a detailed explanation of the math behind this answer http://www.unix.com.ua/orelly/perl/cookbook/ch08_07.htm

1800 INFORMATION 2008-10-24 02:37:41

tvanoffsen - perhaps your misunderstanding is that it doesn't stop when it has "picked" a line - subsequent lines can replace it (with diminishing probability)

1800 INFORMATION 2008-10-24 02:39:31

Oh I see now, for some reason when I was reading and re-reading the example it didn't occur to me that he didn't exit the loop once he had selected a match (nevermind there was no code there to even do that, I guess I was just seeing what wasn't there).

cfeduke 2008-10-24 02:40:40

Says "Aha" and slaps head. Sorry.

tvanfosson 2008-10-24 02:42:37

Now that I see how it works, what about using rand() % count == 0 as the test?

tvanfosson 2008-10-24 03:20:32

And seeding the pseudorandom number generator...

tvanfosson 2008-10-24 03:22:01

This is clever because the number of lines can remain unknown until the end of the file, when you then have your answer. But, if one could know the number of lines, it would be preferred to pick a number, and fetch that. Alternative... I can't give in 64 characters! -frown- BIGGER COMMENTS, PLEASE!

Richard T 2008-10-24 03:35:36

Even though clever, if the file does not change, this becomes a worse solution than Brian's.

Moron 2010-06-09 21:40:13

@Moron, Brian's solution is only better if you have somewhere to store the number of lines and the starting position of each line, and if you need to select more than one line over the life of that storage. I don't think either condition can be taken as a given.

Mark Ransom 2010-06-09 22:04:47

@Mark: Agreed, I will remove the downvote, as there is really no way to tell. Sorry, you didn't deserve that downvote. Apologies.

Moron 2010-06-09 22:57:10

@Mark: I took the liberty to edit (added a space) to revert my vote.

Moron 2010-06-09 22:58:07

Answer 3

+3 A:

Get the length of the file.
Pick a random position in the file.
Seek to that position.
Iterate forward until you find a newline character.
If you don't find a newline character, seek back to the beginning.
Use gets() to read the line.

Adam Pierce 2008-10-24 02:02:13

There is a tradeoff here because your method will favour lines that are longer over shorter lines. Mark Ransom's solution doesn't have this problem, but it needs to read the entire file (but not all at once).

1800 INFORMATION 2008-10-24 02:05:16

Nice answer but it's not a good solution because you won't know the size of each line.

Brian R. Bondy 2008-10-24 02:05:22

This is a good choice for operating within the limited resources of the DS (or any such device).

cfeduke 2008-10-24 02:05:48

Nice. Why not just seek backwards though?

Draemon 2008-10-24 02:26:16

@Draemon - seeking backwards won't help, consider a file where line 1 is 98 bytes and lines 2 and 3 are both 1 byte. You have a 98% chance of landing in line 1, no matter what you do from there you will have a 98% chance of hitting the same line every time you run the process.

Robert Gamble 2008-10-24 03:06:26

It's sloppy but it's fast and that may be a reasonable tradeoff depending on the kind of data the OP is working with (eg. larger files with not too much variation in line length).

Adam Pierce 2008-10-24 03:15:27

Answer 4

A:

Can you afford to run through the file once to count lines, and then again to get a line? I can't see any way to do it that doesn't involve at least one pass through the file to actually find the line you choose.

2008-10-24 02:12:39

Answer 5

A:

I have an alternative solution. Since the platform is the DS, you'd probably not want to try to hold the file in memory. This reads the file twice. Once to count the lines and the 2nd time to find the line it wants. This would run slower than the other solutions suggested so far but it uses hardly any memory. I even wrote it in C for you (I omitted error handling):

main(int argc, char **argv)
{
    FILE *f;
    int nLines = 0;
    char line[1024];
    int randLine;
    int i;

    srand(time(0));
    f = fopen(argv[1], "r");

/* 1st pass - count the lines. */
    while(!feof(f))
    {
     fgets(line, 1024, f);
     nLines++;
    }

    randLine = rand() % nLines;
    printf("Chose %d of %d lines\n", randLine, nLines);

/* 2nd pass - find the line we want. */
    fseek(f, 0, SEEK_SET);
    for(i = 0; !feof(f) && i <= randLine; i++)
     fgets(line, 1024, f);

    printf("%s", line);
}

UPDATE: Oops, I should have read Brian R. Bondy's answer before I posted this but I was kinda obsessing over writing the code and didn't notice. This is almost the same except it doesn't store the line positions in an array. You could do it either way depending on how big the file is and whether speed is more important than saving memory.

Adam Pierce 2008-10-24 02:21:43

Answer 6

A:

Use a combination of Adam's random offset into the file approach and Mark's probability approach. Adam's method can get you randomly to a section of the file. Then you use Mark's approach to avoid preferring the larger strings. Mark's algorithm will prefer the first few strings from wherever it starts,

MattSmith 2008-10-24 02:29:15

Adam's algorithm is fatally flawed, no matter what you try to do to accommodate for this you cannot overcome the fact that the initial position chosen is going to be skewed based on line length which destroys any possibility of a uniform distribution.

Robert Gamble 2008-10-24 03:03:51

Yeah, I misunderstood Mark's algorithm as having a bias and this was supposed to be a cheap way of getting a uniform-ish distribution

MattSmith 2008-10-27 01:55:23

Answer 7

A:

paperhorse 2008-11-01 02:29:38

Answer 8

A:

Just a quick note about Mark Ransom's way of avoiding integer overflow: the DS has no FPU, so floating point division will be emulated in software and very slow. You'll want to avoid typecasting/promotion to float or double at all costs, if speed is a concern.

Here's a different way of avoiding integer overflow that avoids any floating point math:

if(rand() <= RAND_MAX / count)

The probabilities may be slightly skewed due to integer division, but this should certainly run much faster on a DS.

chazomaticus 2008-11-06 02:42:24

Answer 9

+3 A:

Mark's answer is almost correct except for two issues:

If a line is longer than length - 1 characters (including the newline), then the while loop will increment count at least twice for the same line: once for the first length - 1 characters, another for the next length - 1 characters, etc.
The calculation of rand() * count can cause an integer overflow.

To solve the first problem, you can call fgets into a trash buffer until it returns NULL (indicating an I/O error or EOF with no data read) or the trash buffer contains a newline:

count = 0;
while (fgets(line, length, stream) != NULL)
{
    char *p = strchr(line, '\n');
    if (p != NULL) {
        assert(*p == '\n');
        *p = '\0'; // trim the newline
    }
    else { // haven't reached EOL yet. Read & discard the rest of the line.
#define TRASH_LENGTH 1024
        char trash[TRASH_LENGTH];
        while((p = fgets(trash, TRASH_LENGTH, stream)) != NULL) {
            if ((p = strchr(trash, '\n')) != NULL) // reached EOL
                break;
        }
    }
    assert(strchr(line, '\n') == NULL); // `line` does not contain a newline
    count++;
    // ...

The second problem can be solved with @tvanfosson's suggestion if floating-point arithmetic is not available:

int one_chance_in(size_t n)
{
    if (rand() % n == 0) // `rand` returns an integer in [0, `RAND_MAX`]
        return 1;
    else
        return 0;
}

But note that rand() % n is not a uniform, discrete random variable even if rand() is assumed to be one because the probability that rand() % n == 0 can be as much as 1/RAND_MAX higher than the desired probability 1/n. On my machine, RAND_MAX is 2147483647, so the difference is 4.66 × 10^-10, but the C standard only requires that RAND_MAX be at least 32767 (3.05 × 10^-5 difference).

Also, for anyone left wondering why this scheme works (as I was), it might be helpful to work through the calculation of the probability that the first line remains in keptline if there are m lines and generalize: In the first iteration of the loop, the probability that the first line is copied to keptline is 1/1. In the second iteration of the loop, the probability that the second line does not overwrite the first line is 1/2. In the third iteration, the probability that the third line does not overwrite the first line is 2/3. Continuing, the probability that the last line does not overwrite the first line is (m - 1)/m. Thus, the probability that the first line remains in keptline after iterating over all lines is:

1/1 × 1/2 × 2/3 × 3/4 × ... × (m - 2)/(m - 1) × (m - 1)/m = 1/m

The probability that the second line remains in keptline is:

1/2 × 2/3 × 3/4 × ... × (m - 2)/(m - 1) × (m - 1)/m = 1/m

The probability that the third line remains in keptline is:

1/3 × 3/4 × ... × (m - 2)/(m - 1) × (m - 1)/m = 1/m

Etc. They're all 1/m.

Daniel Trebbien 2010-05-28 23:44:29

ansaurus

tags:

views:

answers:

What's the best way to return a random line in a text file using C?

related questions