views:

533

answers:

7

Following up on this question, I need to get exactly n lines at random out of a file (or stdin). This would be similar to head or tail, except I want some from the middle.

Now, other than looping over the file with the solutions to the linked question, what's the best way to get exactly n lines in one run?

For reference, I tried this:

#!/usr/bin/perl -w
use strict;
my $ratio = shift;
print $ratio, "\n";
while () {
    print if ((int rand $ratio) == 1); 
}

where $ratio is the rough percentage of lines I want. For instance, if I want 1 in 10 lines:

random_select 10 a.list

However, this doesn't give me an exact amount:

aaa> foreach i ( 0 1 2 3 4 5 6 7 8 9 )
foreach? random_select 10 a.list | wc -l
foreach? end
4739
4865
4739
4889
4934
4809
4712
4842
4814
4817

The other thought I had was slurping the input file and then choosing n at random from the array, but that's a problem if I have a really big file.

Any ideas?

Edit: This is an exact duplicate of this question.

+1  A: 

Possible solution:

  1. scan one time to count the number of lines
  2. decide the line number to pick randomly
  3. scan again, pick the line
kcwu
On stdin, scanning twice may be a problem.
Eyal
A: 

In pseudo-code:

use List::Util qw[shuffle];

# read and shuffle the whole file
@list = shuffle(<>);

# take the first 'n' from the list
splice(@list, ...);

This is the most trivial implementation, but you do have to read the whole file first, which will require that you have sufficient memory available.

Alnitak
this will not work if the file is really huge
kcwu
This is exactly the issue I had. The file I'm working on is 63MB and it takes forever.
Nathan Fellman
file size 63MB ? How many MB ram do you have? I think this size shouldn't be a problem.
kcwu
well, it worked, but it took forever
Nathan Fellman
Determine the number of lines, make an array that is that long, shuffle that array, slice off the number of lines you desire, sort that list, then iterate over the file and output the line numbers that the list specifies. should be about as fast as reading the (whole) file.
jrockway
+4  A: 

Here's a nice one-pass algorithm that I just came up with, having O(N) time complexity and O(M) space complexity, for reading M lines from an N-line file.

Assume M <= N.

  1. Let S be the set of chosen lines. Initialize S to the first M lines of the file. If the ordering of the final result is important, shuffle S now.
  2. Read in the next line l. So far, we have read n = M + 1 total lines. The probability that we want to choose l as one of our final lines is therefore M/n.
  3. Accept l with probability M/n; use a RNG to decide whether to accept or reject l.
  4. If l has been accepted, randomly choose one of the lines in S and replace it with l.
  5. Repeat steps 2-4 until the file has been exhausted of lines, incrementing n with each new line read.
  6. Return the set S of chosen lines.
kquinn
Nice, but I think you meant M <= N
Alnitak
The flipped sign is the eternal enemy of mathematicians. Fixed, with a sigh.
kquinn
also, isn't there a bias towards the original M lines unless N >> M ?
Alnitak
Not as far as I can tell; think about choosing 5 lines from a 6-line file. One of the lines will be excluded; with probability 5/6 it'll be one of the first 5, and with probability 1/6 it'll be the last one; this is exactly what you want. The tricky bit of this algorithm is that `n`, and with it the rejection probability, changes as more lines are read in.
kquinn
on a stream based file system (most modern stuff including windows and unix), finding a "line" is an expensive proposition. (lots of comparisons to find line terminators) My solution below solves this by using seek to seek randomly in the file then searching forward to capture the next full line.
rmeden
+1  A: 
@result = ();

$k = 0;
while(<>) {
    $k++;
    if (scalar @result < $n) {
     push @result, $_;
    } else {
     if (rand <= $n/$k) {
      $result[int rand $n] = $_;
     }
    }
}

print for @result;
kcwu
your rand test is wrong - it should be $n / $k, not 1.0 / $k;
Alnitak
thanks. corrected.
kcwu
+1  A: 

This takes a single command-line argument, which is the number of line you want, N. The first N lines are held, as you might not see any more. Thereafter, you randomly decide whether to take the next line. And if you do, you randomly decide which line in the current list-of-N to overwrite.

#!/usr/bin/perl
my $bufsize = shift;
my @list = ();

srand();
while (<>)
{
    push(@list, $_), next if (@list < $bufsize);
    $list[ rand(@list) ] = $_ if (rand($. / $bufsize) < 1);
}
print foreach @list;
Elbin
A: 

Here's some verbose Perl code that should work with large files.

The heart of this code is that it does not store the whole file in memory, but only stores offsets in the file.

Use tell to get the offsets. Then seek to the appropriate places to recover the lines.

Better specification of target file and number of lines to get is left as an exercise for those less lazy than I. Those problems have been well solved.

#!/usr/bin/perl

use strict;
use warnings;

use List::Util qw(shuffle);

my $GET_LINES = 10; 

my @line_starts;
open( my $fh, '<', 'big_text_file' )
    or die "Oh, fudge: $!\n";

do {
    push @line_starts, tell $fh
} while ( <$fh> );

my $count = @line_starts;
print "Got $count lines\n";

my @shuffled_starts = (shuffle @line_starts)[0..$GET_LINES-1];

for my $start ( @shuffled_starts ) {

    seek $fh, $start, 0
        or die "Unable to seek to line - $!\n";

    print scalar <$fh>;
}
daotoad
+1  A: 

There's no need to know the actual line number in the file. Simply seek to a random place and keep the next line. (The current line will most likely be a partial line.)

This approach should be very fast for large files, but it will not work for STDIN. Heck, nothing sort of caching the entire file in memory will work for STDIN. So, if you must have STDIN, I don't see how you can be fast/cheap for large files.

You could detect STDIN and switch to a cached approach, otherwise be fast.

#!perl
use strict;

my $file='file.txt';
my $count=shift || 10;
my $size=-s $file;

open(FILE,$file) || die "Can't open $file\n";

while ($count--) {
   seek(FILE,int(rand($size)),0);
   $_=readline(FILE);                         # ignore partial line
   redo unless defined ($_ = readline(FILE)); # catch EOF
   print $_;
}
rmeden
Note that this approach will *not* pick lines uniformly from a file. The probability of a line being chosen will be weighted by the length of the preceding line; if all lines have the same length, this is no problem. But if you need a strictly uniform distribution of lines from a file with lines of varying length, you'll need a different approach.
kquinn
grrrr you're right... oh well.. it *is* fast :) but useful if the record length is static.. or pretty close.
rmeden