ansaurus

Question

How do I read N random lines out of a file without storing the file in memory?

Answer 1

+1 A:

Quite the first time I see some Perl code ... it is incredible unreadable ... ;) But that should not matter. Why don't you just repeat the cryptic line N times?

If I would have to write this, I would just seek a random position in the file, read to the end of the line (the next newline), and then read one line up to the next newline. Add some error handling if you just seeked into the last line, repeat all this N times and you are done. I guess

srand;
rand($.) < 1 && ($line = $_) while <>;

is the Perl way to do such a single step. You could also read backwards from the initial position up to the priviouse newline or the begining of the file and then read a line forward again. But this doesn't really matter.

UPDATE

I have to admit that seeking somewhere into the file will not generate a perfect uniform distribution because of the different line lengths. If this fluctuation matters depends on the usage scenario, of course.

If you need a perfect uniform distribution, you need to read the whole file at least once to get the number of lines. In this case the algorithm given by Guffa is probably the cleverest solution because it requires reading the file exactly once.

Daniel Brückner 2009-04-23 00:23:48

this is not single pass. you are reading the entire file once for every line.

2009-04-23 00:33:21

I cannot comment on Perl - never wrote a single line of code. But in C# I would just get the file length, generate a random position somewhere in the file, open the file (for access byte by byte, not line by line), seek the position (no read up to now), and finally read a few bytes until I got a complete line. So I read at most two lines (the one hit by the seek and the next complete one) from the file to get one line. If repeated N times, I will read at most 2N lines.

Daniel Brückner 2009-04-23 00:40:08

Answer 2

A:

I'd say:

Read the file and search for the amount of \n. That's the number of lines - let's call that L
Store their positions in a small array in memory
Get two random lines lower than L, fetch their offsets and you're done.

You'd use just a small array and read the whole file once + 2 lines afterwards.

Seb 2009-04-23 00:24:44

Answer 3

A:

You could do a 2 pass algorithm. First get the positions of each newline, pushing those positions into a vector. Then pick random items in that vector, call this i.

Read from the file at position v[i] to v[i+1] to get your line.

During the first pass you read the file with a small buffer as to not read it all into RAM at once.

Brian R. Bondy 2009-04-23 00:25:01

Answer 4

+1 A:

Now my Perl is not what used to be, but trusting the implicit claim on your reference (that the distribution of line numbers thus selected is uniform), it seems this should work:

srand;
(rand($.) < 1 && ($line1 = $_)) || (rand($.) <1 && ($line2 = $_)) while <>;

Just like the original algorithm, this is one-pass and constant memory.

Edit I just realized you need N, and not 2. You can repeat the OR-ed expression N times if you know N in advance.

2009-04-23 00:30:53

Could you explain that? I never wrote only one line of Perl code but I would like to know what it does but without learning Perl late in the night.

Daniel Brückner 2009-04-23 00:34:46

This would allow the very same line to be selected twice, which I assume is perfect for random passwords.

Arjan 2009-04-23 00:35:08

@arjan: it is possible for the same line to be selected twice, which is ok if we want independent selection (e.g. if we know what one of the lines is, this shouldn't help us figure out what the other one is).

2009-04-23 00:37:23

@daniel: the while <> part reads a line from the current file and puts in $_. $. is the current line number and it gets incremented. the expression withing the outer parenthesis is repeated for every line in the file. because of the ||, both expressions within are executed. a line is selected only if rand($.) < 1.

2009-04-23 00:40:35

This is what I call operator overloading ... $. $_ $- $% $? $$ $$$ ... :D Thanks for the explanation. It is ad hoc hard to believe that this generates an uniform distribution. But when Donald E. Knuth proofed it, I trust in it. Either way I think it is extremly inefficient to select a random line by scaning through the whole file until the random number generator say "Pick this line!". You will read half the file in the average case while you need only a single line.

Daniel Brückner 2009-04-23 00:48:12

also you always read the entire file

2009-04-23 01:03:12

@Daniel: Actually, the whole file is always read, and a random line might be selected multiple times. As the line number starts at 1, for the first line rand(1) will always be < 1, so $line1 gets an initial value. This may be overwritten by the second line, but that change is about 0.5, as rand(2) yields [0..2> (exclusive), and so on...

Arjan 2009-04-23 01:04:01

I unterstood the algorithm but I had to calculate the probability myself because it is not that obviouse that it yields a uniform distribution. But I calculated it and it is uniform. The probability for picking line i out of n lines is 1/i * ((n-1)! / (i-1)!) / (n!/i!) = 1/i * (n-1)!/n! * i!/(i-1)! = 1/i * 1/n * i = 1/n.

Daniel Brückner 2009-04-23 01:16:57

@Daniel, careful of [proofs by Knuth.][1] [1]: http://haacked.com/archive/2007/11/29/awesome-knuth-quote-on-code-proofs.aspx

SPWorley 2009-04-23 09:25:52

Fantastic! I implemented it in Perl and it cut runtime in half.

Schwern 2009-04-24 08:57:12

Answer 5

+10 A:

The algorithm is not implemented in a very good and clear way in that example... Some pseudo code that better explains it would be:

cnt = 0
while not end of file {
   read line
   cnt = cnt + 1
   if random(1 to cnt) = 1 {
      result = line
   }
}

As you see, the idea is that you read each line in the file and calculate the probability that the line should be the one chosen. After reading the first line the probability is 100%, after reading the second line the probability is 50%, and so on.

This can be expanded to picking N items by keeping an array with the size N instead of a single variable, and calculate the probability for a line to replace one of the current ones in the array:

var result[1..N]
cnt = 0
while not end of file {
   read line
   cnt = cnt + 1
   if cnt <= N {
      result[cnt] = line
   } else if random(1 to cnt) <= N {
      result[random(1 to N)] = line
   }
}

Edit:
Here's the code implemented in C#:

public static List<string> GetRandomLines(string path, int count) {
 List<string> result = new List<string>();
 Random rnd = new Random();
 int cnt = 0;
 string line;
 using (StreamReader reader = new StreamReader(path)) {
  while ((line = reader.ReadLine()) != null) {
   cnt++;
   int pos = rnd.Next(cnt);
   if (cnt <= count) {
    result.Insert(pos, line);
   } else {
    if (pos < count) {
     result[pos] = line;
    }
   }
  }
 }
 return result;
}

I made a test by running the method 100000 times, picking 5 lines out of 20, and counted the occurances of the lines. This is the result:

As you see, the distribution is as good as you could ever want. :)

(I moved the creation of the Random object out of the method when running the test, to avoid seeding problems as the seed is taken from the system clock.)

Note:
You might want to scramble the order in the resulting array if you want them to be randomly ordered. As the first N lines are placed in order in the array, they are not randomly placed if they remain at the end. For exmaple if N is three or larger and the third line is picked, it will always be at the third position in the array.

Edit 2:
I changed the code to use a List<string> instead of a string[]. That makes it easy to insert the first N items in a random order. I updated the test data from a new test run, so that you can see that the distribution is still good.

Guffa 2009-04-23 00:45:30

Nice algorithm. I am not sure if the distribution is really uniform, but I trust Donald E. Knuth and you. +1

Daniel Brückner 2009-04-23 00:58:36

So far this is the only answer out of 8 that is correct and does it in one pass.The method in general is called Reservoir Sampling.

SPWorley 2009-04-23 01:10:55

this not correct. the result[k] can never be any of the first k-1 lines, which shows that the distribution is not uniform.

2009-04-23 01:37:09

"As you see, the distribution is as good as you could ever want. :)" -- unless you want a uniform distribution.

2009-04-23 01:38:15

@slavy: Read the note that I added ten minutes before you wrote the comment...

Guffa 2009-04-23 01:46:26

"After reading the first line the probability is 100%" - doesn't that mean line 1 is always chosen? This appears to skew the selections towards the beginning of the file.

paxdiablo 2009-04-23 01:55:00

and would you care to instruct us how to scramble?

2009-04-23 01:59:23

@Pax: Not at all. The probability is based on the number of lines known that far, and at the first line the number of lines known is one. At the second line the number of lines known is two, and it's 50% chance that the first line is replaced by the second as the chosed one. The test that I ran shows that the selection is not skewed at all.

Guffa 2009-04-23 02:06:53

@slavy: That's not hard. Just randomly swap some of the items in the array. Just like lining up some playing cards and shuffling them by swapping them.

Guffa 2009-04-23 02:12:02

@Guffa, I get it now, you're actually loading up the first N lines regardless and THEN starting the random process. You will need the random swaps at the end since degenerate case of choosing 5 from 5 will always put them in order, but good solution, +1.

paxdiablo 2009-04-23 02:30:04

I think the distribution is a little bit skewed. I think all lines have to be inserted randomly allowing line 2 to override line 1. (gut feel) But this creates the problem that you cannot gurantee to pick N items - you may always put items in the first place and leave the rest uninitialized. May be I am going to calculate that tomorow or look it up somewhere.

Daniel Brückner 2009-04-23 02:39:32

@Daniel: Keeping the first N items doesn't screw up the distribution. When I have read N items, the probability that those should be kept at that stage is 100%. It's first when reading N+1 items that the probability gets lower than 100%.

Guffa 2009-04-23 06:21:26

Answer 6

A:

Quick and dirty bash

function randomLine {
  numlines=`wc -l $1| awk {'print $1'}`
  t=`date +%s`
  t=`expr $t + $RANDOM`
  a=`expr $t % $numlines + 1`
  RETURN=`head -n $a $1|tail -n 1`
  return 0
}

randomLine test.sh
echo $RETURN

Stefano Borini 2009-04-23 00:50:48

What prevents this from selecting the same line twice by accident?

SPWorley 2009-04-23 01:13:07

nothing. If it's not in the requirements, it won't be in the implementation ;)

Stefano Borini 2009-04-23 01:34:17

Answer 7

A:

Pick a random point in the file, look backwards for previous EOL, search forward for current EOL, and return the line.

FILE * file = fopen("words.txt");
int fs = filesize("words.txt");
int ptr = rand(fs); // 0 to fs-1
int start = min(ptr - MAX_LINE_LENGTH, 0);
int end = min(ptr + MAX_LINE_LENGTH, fs - 1);
int bufsize = end - start;

fseek(file, start);
char *buf = malloc(bufsize);
read(file, buf, bufsize);

char *startp = buf + ptr - start;
char *finp = buf + ptr - start + 1;

while (startp > buf  && *startp != '\n') {
    startp--;
}

while (finp < buf + bufsize && *finp != '\n') {
    finp++;
}

*finp = '\0';
startp++;
return startp;

Lots of one off errors and crap in there, bad memory management, and other horrors. If this actually compiles, you get a nickel. (Please send self addressed stamped envelope and $5 handling to receive free nickle.)

But you should get the idea.

Longer lines statistically have a higher chance of being selected than shorter lines. But the run time of this is effectively constant regardless of file size. If you have a lot of words of mostly similar length, the statisticians won't be happy (they never are anyway), but in practice it will be close enough.

Will Hartung 2009-04-23 01:06:11

ansaurus

tags:

views:

answers:

How do I read N random lines out of a file without storing the file in memory?

related questions