views:

1058

answers:

7

I'm familiar with the algorithm for reading a single random line from a file without reading the whole file into memory. I wonder if this technique can be extended to N random lines?

The use case is for a password generator which concatenates N random words pulled out of a dictionary file, one word per line (like /usr/share/dict/words). You might come up with angela.ham.lewis.pathos. Right now it reads the whole dictionary file into an array and picks N random elements from that array. I would like to eliminate the array, or any other in-memory storage of the file, and only read the file once.

(No, this isn't a practical optimization exercise. I'm interested in the algorithm.)

Update: Thank you all for your answers.

Answers fell into three categories: modifications of the full read algorithm, random seek, or index the lines and seek to them randomly.

The random seek is much faster, and constant with respect to file size, but distributes based on file size not on number of words. It also allows duplicates (that can be avoided but it makes the algorithm O(inf)). Here's my reimplementation of my password generator using that algorithm. I realize that by reading forward from the seek point, rather than backwards, it has an off-by-one error should the seek fall in the last line. Correcting is left as an exercise for the editor.

#!/usr/bin/perl -lw

my $Words       = "/usr/share/dict/words";
my $Max_Length  = 8;
my $Num_Words   = 4;

my $size = -s $Words;

my @words;
open my $fh, "<", $Words or die $!;

for(1..$Num_Words) {
    seek $fh, int rand $size, 0 or die $!;
    <$fh>;
    my $word = <$fh>;
    chomp $word;
    redo if length $word > $Max_Length;
    push @words, $word;
}
print join ".", @words;

And then there's Guffa's answer, which was what I was looking for; an extension of the original algorithm. Slower, it has to read the whole file, but distributes by word, allows filtering without changing the efficiency of the algorithm and (I think) has no duplicates.

#!/usr/bin/perl -lw

my $Words       = "/usr/share/dict/words";
my $Max_Length  = 8;
my $Num_Words   = 4;

my @words;
open my $fh, "<", $Words or die $!;
my $count = 0;
while(my $line = <$fh>) {
    chomp $line;
    $count++;
    if( $count <= $Num_Words ) {
        $words[$count-1] = $line;
    }
    elsif( rand($count) <= $Num_Words ) {
        $words[rand($Num_Words)] = $line;
    }
}

print join ".", @words;

Finally, the index and seek algorithm has the advantage of distributing by word rather than file size. The disadvantage is it reads the whole file and memory usage scales linearly with the number of words in the file. Might as well use Guffa's algorithm.

+1  A: 

Quite the first time I see some Perl code ... it is incredible unreadable ... ;) But that should not matter. Why don't you just repeat the cryptic line N times?

If I would have to write this, I would just seek a random position in the file, read to the end of the line (the next newline), and then read one line up to the next newline. Add some error handling if you just seeked into the last line, repeat all this N times and you are done. I guess

srand;
rand($.) < 1 && ($line = $_) while <>;

is the Perl way to do such a single step. You could also read backwards from the initial position up to the priviouse newline or the begining of the file and then read a line forward again. But this doesn't really matter.

UPDATE

I have to admit that seeking somewhere into the file will not generate a perfect uniform distribution because of the different line lengths. If this fluctuation matters depends on the usage scenario, of course.

If you need a perfect uniform distribution, you need to read the whole file at least once to get the number of lines. In this case the algorithm given by Guffa is probably the cleverest solution because it requires reading the file exactly once.

Daniel Brückner
this is not single pass. you are reading the entire file once for every line.
I cannot comment on Perl - never wrote a single line of code. But in C# I would just get the file length, generate a random position somewhere in the file, open the file (for access byte by byte, not line by line), seek the position (no read up to now), and finally read a few bytes until I got a complete line. So I read at most two lines (the one hit by the seek and the next complete one) from the file to get one line. If repeated N times, I will read at most 2N lines.
Daniel Brückner
A: 

I'd say:

  • Read the file and search for the amount of \n. That's the number of lines - let's call that L
  • Store their positions in a small array in memory
  • Get two random lines lower than L, fetch their offsets and you're done.

You'd use just a small array and read the whole file once + 2 lines afterwards.

Seb
A: 

You could do a 2 pass algorithm. First get the positions of each newline, pushing those positions into a vector. Then pick random items in that vector, call this i.

Read from the file at position v[i] to v[i+1] to get your line.

During the first pass you read the file with a small buffer as to not read it all into RAM at once.

Brian R. Bondy
+1  A: 

Now my Perl is not what used to be, but trusting the implicit claim on your reference (that the distribution of line numbers thus selected is uniform), it seems this should work:

srand;
(rand($.) < 1 && ($line1 = $_)) || (rand($.) <1 && ($line2 = $_)) while <>;

Just like the original algorithm, this is one-pass and constant memory.

Edit I just realized you need N, and not 2. You can repeat the OR-ed expression N times if you know N in advance.

Could you explain that? I never wrote only one line of Perl code but I would like to know what it does but without learning Perl late in the night.
Daniel Brückner
This would allow the very same line to be selected twice, which I assume is perfect for random passwords.
Arjan
@arjan: it is possible for the same line to be selected twice, which is ok if we want independent selection (e.g. if we know what one of the lines is, this shouldn't help us figure out what the other one is).
@daniel: the while <> part reads a line from the current file and puts in $_. $. is the current line number and it gets incremented. the expression withing the outer parenthesis is repeated for every line in the file. because of the ||, both expressions within are executed. a line is selected only if rand($.) < 1.
This is what I call operator overloading ... $. $_ $- $% $? $$ $$$ ... :D Thanks for the explanation. It is ad hoc hard to believe that this generates an uniform distribution. But when Donald E. Knuth proofed it, I trust in it. Either way I think it is extremly inefficient to select a random line by scaning through the whole file until the random number generator say "Pick this line!". You will read half the file in the average case while you need only a single line.
Daniel Brückner
also you always read the entire file
@Daniel: Actually, the whole file is always read, and a random line might be selected multiple times. As the line number starts at 1, for the first line rand(1) will always be < 1, so $line1 gets an initial value. This may be overwritten by the second line, but that change is about 0.5, as rand(2) yields [0..2> (exclusive), and so on...
Arjan
I unterstood the algorithm but I had to calculate the probability myself because it is not that obviouse that it yields a uniform distribution. But I calculated it and it is uniform. The probability for picking line i out of n lines is 1/i * ((n-1)! / (i-1)!) / (n!/i!) = 1/i * (n-1)!/n! * i!/(i-1)! = 1/i * 1/n * i = 1/n.
Daniel Brückner
@Daniel, careful of [proofs by Knuth.][1] [1]: http://haacked.com/archive/2007/11/29/awesome-knuth-quote-on-code-proofs.aspx
SPWorley
Fantastic! I implemented it in Perl and it cut runtime in half.
Schwern
+10  A: 

The algorithm is not implemented in a very good and clear way in that example... Some pseudo code that better explains it would be:

cnt = 0
while not end of file {
   read line
   cnt = cnt + 1
   if random(1 to cnt) = 1 {
      result = line
   }
}

As you see, the idea is that you read each line in the file and calculate the probability that the line should be the one chosen. After reading the first line the probability is 100%, after reading the second line the probability is 50%, and so on.

This can be expanded to picking N items by keeping an array with the size N instead of a single variable, and calculate the probability for a line to replace one of the current ones in the array:

var result[1..N]
cnt = 0
while not end of file {
   read line
   cnt = cnt + 1
   if cnt <= N {
      result[cnt] = line
   } else if random(1 to cnt) <= N {
      result[random(1 to N)] = line
   }
}

Edit:
Here's the code implemented in C#:

public static List<string> GetRandomLines(string path, int count) {
 List<string> result = new List<string>();
 Random rnd = new Random();
 int cnt = 0;
 string line;
 using (StreamReader reader = new StreamReader(path)) {
  while ((line = reader.ReadLine()) != null) {
   cnt++;
   int pos = rnd.Next(cnt);
   if (cnt <= count) {
    result.Insert(pos, line);
   } else {
    if (pos < count) {
     result[pos] = line;
    }
   }
  }
 }
 return result;
}

I made a test by running the method 100000 times, picking 5 lines out of 20, and counted the occurances of the lines. This is the result:

25105
24966
24808
24966
25279
24824
25068
24901
25145
24895
25087
25272
24971
24775
25024
25180
25027
25000
24900
24807

As you see, the distribution is as good as you could ever want. :)

(I moved the creation of the Random object out of the method when running the test, to avoid seeding problems as the seed is taken from the system clock.)

Note:
You might want to scramble the order in the resulting array if you want them to be randomly ordered. As the first N lines are placed in order in the array, they are not randomly placed if they remain at the end. For exmaple if N is three or larger and the third line is picked, it will always be at the third position in the array.

Edit 2:
I changed the code to use a List<string> instead of a string[]. That makes it easy to insert the first N items in a random order. I updated the test data from a new test run, so that you can see that the distribution is still good.

Guffa
Nice algorithm. I am not sure if the distribution is really uniform, but I trust Donald E. Knuth and you. +1
Daniel Brückner
So far this is the only answer out of 8 that is correct and does it in one pass.The method in general is called Reservoir Sampling.
SPWorley
this not correct. the result[k] can never be any of the first k-1 lines, which shows that the distribution is not uniform.
"As you see, the distribution is as good as you could ever want. :)" -- unless you want a uniform distribution.
@slavy: Read the note that I added ten minutes before you wrote the comment...
Guffa
"After reading the first line the probability is 100%" - doesn't that mean line 1 is always chosen? This appears to skew the selections towards the beginning of the file.
paxdiablo
and would you care to instruct us how to scramble?
@Pax: Not at all. The probability is based on the number of lines known that far, and at the first line the number of lines known is one. At the second line the number of lines known is two, and it's 50% chance that the first line is replaced by the second as the chosed one. The test that I ran shows that the selection is not skewed at all.
Guffa
@slavy: That's not hard. Just randomly swap some of the items in the array. Just like lining up some playing cards and shuffling them by swapping them.
Guffa
@Guffa, I get it now, you're actually loading up the first N lines regardless and THEN starting the random process. You will need the random swaps at the end since degenerate case of choosing 5 from 5 will always put them in order, but good solution, +1.
paxdiablo
I think the distribution is a little bit skewed. I think all lines have to be inserted randomly allowing line 2 to override line 1. (gut feel) But this creates the problem that you cannot gurantee to pick N items - you may always put items in the first place and leave the rest uninitialized. May be I am going to calculate that tomorow or look it up somewhere.
Daniel Brückner
@Daniel: Keeping the first N items doesn't screw up the distribution. When I have read N items, the probability that those should be kept at that stage is 100%. It's first when reading N+1 items that the probability gets lower than 100%.
Guffa
A: 

Quick and dirty bash

function randomLine {
  numlines=`wc -l $1| awk {'print $1'}`
  t=`date +%s`
  t=`expr $t + $RANDOM`
  a=`expr $t % $numlines + 1`
  RETURN=`head -n $a $1|tail -n 1`
  return 0
}

randomLine test.sh
echo $RETURN
Stefano Borini
What prevents this from selecting the same line twice by accident?
SPWorley
nothing. If it's not in the requirements, it won't be in the implementation ;)
Stefano Borini
A: 

Pick a random point in the file, look backwards for previous EOL, search forward for current EOL, and return the line.

FILE * file = fopen("words.txt");
int fs = filesize("words.txt");
int ptr = rand(fs); // 0 to fs-1
int start = min(ptr - MAX_LINE_LENGTH, 0);
int end = min(ptr + MAX_LINE_LENGTH, fs - 1);
int bufsize = end - start;

fseek(file, start);
char *buf = malloc(bufsize);
read(file, buf, bufsize);

char *startp = buf + ptr - start;
char *finp = buf + ptr - start + 1;

while (startp > buf  && *startp != '\n') {
    startp--;
}

while (finp < buf + bufsize && *finp != '\n') {
    finp++;
}

*finp = '\0';
startp++;
return startp;

Lots of one off errors and crap in there, bad memory management, and other horrors. If this actually compiles, you get a nickel. (Please send self addressed stamped envelope and $5 handling to receive free nickle.)

But you should get the idea.

Longer lines statistically have a higher chance of being selected than shorter lines. But the run time of this is effectively constant regardless of file size. If you have a lot of words of mostly similar length, the statisticians won't be happy (they never are anyway), but in practice it will be close enough.

Will Hartung