ansaurus

Question

How can I get exactly n random lines from a file with Perl?

Answer 1

+1 A:

Possible solution:

scan one time to count the number of lines
decide the line number to pick randomly
scan again, pick the line

kcwu 2009-05-13 07:15:35

On stdin, scanning twice may be a problem.

Eyal 2009-05-13 07:21:59

Answer 2

A:

In pseudo-code:

use List::Util qw[shuffle];

# read and shuffle the whole file
@list = shuffle(<>);

# take the first 'n' from the list
splice(@list, ...);

This is the most trivial implementation, but you do have to read the whole file first, which will require that you have sufficient memory available.

Alnitak 2009-05-13 07:21:23

this will not work if the file is really huge

kcwu 2009-05-13 07:23:04

This is exactly the issue I had. The file I'm working on is 63MB and it takes forever.

Nathan Fellman 2009-05-13 07:25:07

file size 63MB ? How many MB ram do you have? I think this size shouldn't be a problem.

kcwu 2009-05-13 07:28:11

well, it worked, but it took forever

Nathan Fellman 2009-05-13 09:55:52

Determine the number of lines, make an array that is that long, shuffle that array, slice off the number of lines you desire, sort that list, then iterate over the file and output the line numbers that the list specifies. should be about as fast as reading the (whole) file.

jrockway 2009-05-13 10:19:20

Answer 3

+4 A:

Here's a nice one-pass algorithm that I just came up with, having O(N) time complexity and O(M) space complexity, for reading M lines from an N-line file.

Assume M <= N.

Let S be the set of chosen lines. Initialize S to the first M lines of the file. If the ordering of the final result is important, shuffle S now.
Read in the next line l. So far, we have read n = M + 1 total lines. The probability that we want to choose l as one of our final lines is therefore M/n.
Accept l with probability M/n; use a RNG to decide whether to accept or reject l.
If l has been accepted, randomly choose one of the lines in S and replace it with l.
Repeat steps 2-4 until the file has been exhausted of lines, incrementing n with each new line read.
Return the set S of chosen lines.

kquinn 2009-05-13 07:33:19

Nice, but I think you meant M <= N

Alnitak 2009-05-13 07:35:21

The flipped sign is the eternal enemy of mathematicians. Fixed, with a sigh.

kquinn 2009-05-13 07:39:36

also, isn't there a bias towards the original M lines unless N >> M ?

Alnitak 2009-05-13 07:57:40

Not as far as I can tell; think about choosing 5 lines from a 6-line file. One of the lines will be excluded; with probability 5/6 it'll be one of the first 5, and with probability 1/6 it'll be the last one; this is exactly what you want. The tricky bit of this algorithm is that `n`, and with it the rejection probability, changes as more lines are read in.

kquinn 2009-05-13 08:04:56

on a stream based file system (most modern stuff including windows and unix), finding a "line" is an expensive proposition. (lots of comparisons to find line terminators) My solution below solves this by using seek to seek randomly in the file then searching forward to capture the next full line.

rmeden 2009-06-22 06:51:22

Answer 4

+1 A:

@result = ();

$k = 0;
while(<>) {
    $k++;
    if (scalar @result < $n) {
     push @result, $_;
    } else {
     if (rand <= $n/$k) {
      $result[int rand $n] = $_;
     }
    }
}

print for @result;

kcwu 2009-05-13 07:39:17

your rand test is wrong - it should be $n / $k, not 1.0 / $k;

Alnitak 2009-05-13 10:21:23

thanks. corrected.

kcwu 2009-05-13 11:49:14

Answer 5

+1 A:

This takes a single command-line argument, which is the number of line you want, N. The first N lines are held, as you might not see any more. Thereafter, you randomly decide whether to take the next line. And if you do, you randomly decide which line in the current list-of-N to overwrite.

#!/usr/bin/perl
my $bufsize = shift;
my @list = ();

srand();
while (<>)
{
    push(@list, $_), next if (@list < $bufsize);
    $list[ rand(@list) ] = $_ if (rand($. / $bufsize) < 1);
}
print foreach @list;

Elbin 2009-05-13 07:47:08

Answer 6

A:

Here's some verbose Perl code that should work with large files.

The heart of this code is that it does not store the whole file in memory, but only stores offsets in the file.

Use tell to get the offsets. Then seek to the appropriate places to recover the lines.

Better specification of target file and number of lines to get is left as an exercise for those less lazy than I. Those problems have been well solved.

#!/usr/bin/perl

use strict;
use warnings;

use List::Util qw(shuffle);

my $GET_LINES = 10; 

my @line_starts;
open( my $fh, '<', 'big_text_file' )
    or die "Oh, fudge: $!\n";

do {
    push @line_starts, tell $fh
} while ( <$fh> );

my $count = @line_starts;
print "Got $count lines\n";

my @shuffled_starts = (shuffle @line_starts)[0..$GET_LINES-1];

for my $start ( @shuffled_starts ) {

    seek $fh, $start, 0
        or die "Unable to seek to line - $!\n";

    print scalar <$fh>;
}

daotoad 2009-05-14 00:33:42

Answer 7

+1 A:

There's no need to know the actual line number in the file. Simply seek to a random place and keep the next line. (The current line will most likely be a partial line.)

This approach should be very fast for large files, but it will not work for STDIN. Heck, nothing sort of caching the entire file in memory will work for STDIN. So, if you must have STDIN, I don't see how you can be fast/cheap for large files.

You could detect STDIN and switch to a cached approach, otherwise be fast.

#!perl
use strict;

my $file='file.txt';
my $count=shift || 10;
my $size=-s $file;

open(FILE,$file) || die "Can't open $file\n";

while ($count--) {
   seek(FILE,int(rand($size)),0);
   $_=readline(FILE);                         # ignore partial line
   redo unless defined ($_ = readline(FILE)); # catch EOF
   print $_;
}

rmeden 2009-05-14 05:40:42

Note that this approach will *not* pick lines uniformly from a file. The probability of a line being chosen will be weighted by the length of the preceding line; if all lines have the same length, this is no problem. But if you need a strictly uniform distribution of lines from a file with lines of varying length, you'll need a different approach.

kquinn 2009-06-22 07:35:18

grrrr you're right... oh well.. it *is* fast :) but useful if the record length is static.. or pretty close.

rmeden 2009-06-24 06:42:53

ansaurus

tags:

views:

answers:

How can I get exactly n random lines from a file with Perl?

related questions