views:

232

answers:

6

I have a file with contents

abc
def
high
lmn
...
...

There are more than 2 million lines in the files. I want to randomly sample lines from the files and output 50K lines. Any thoughts on how to approach this problem? I was thinking along the lines of Perl and its rand function (Or a handy shell command would be neat).

Related (Possibly Duplicate) Questions:

+9  A: 

Assuming you basically want to output about 2.5% of all lines, this would do:

print if 0.025 > rand while <$input>;
Sinan Ünür
pretty neat trick.
cloudhead
If the file size varies, you could calculate the percentage by counting the lines (cf. perlfaq5) and dividing that into the number of lines desired.
Michael Carman
This is a really good solution because it avoids the naive approaches to solving this problem, which involve jumping to random points in the file or (even worse!) sorting the input.
James Thompson
@James Thompson: while it might look like a good solution, it actually is not a correct solution for the question. There is no way to guarantee that it will return 50k rows.
depesz
Mine is a perfectly fine solution if you want to sample roughly 2.5% of all lines. Mine is not the right solution if the requirement is to output exactly 50,000 lines. I explicitly stated the approximate nature of this. For the latter problem, I believe I once read a single-pass algorithm but I cannot remember it now.
Sinan Ünür
The algorithm that Sinan is thinking of is called the Reservoir Sampling Algorithm. It's covered well on this site and elsewhere on the Internet.
James Thompson
A: 

You could do this:

cat file.txt | perl -ne 'print rand(), $_' | sort | head -50000 | cut -d' ' -f2-

This adds a random number to the start of each line, then sorts the file, then prints the first 50k lines of the file, then strips the random number back off again.

This does not preserve the order of lines in the original file, so if you need that then a different approach will be needed.

Greg Hewgill
+3  A: 

Shell way:

sort -R file | head -n 50000
depesz
Which sort is this? Mine (GNU coreutils 5.93) doesn't support -R.
Igor Krivokon
[sinan@kas ~]$ sort --versionsort (GNU coreutils) 7.4Copyright (C) 2009 Free Software Foundation, Inc.
Sinan Ünür
=> sort --versionsort (GNU coreutils) 6.10
depesz
+3  A: 

If you need to extract an exact number of lines:

use strict;
use warnings;

# Number of lines to pick and file to pick from
# Error checking omitted!
my ($pick, $file) = @ARGV;

open(my $fh, '<', $file)
    or die "Can't read file '$file' [$!]\n";

# count lines in file
my ($lines, $buffer);
while (sysread $fh, $buffer, 4096) {
    $lines += ($buffer =~ tr/\n//);
}

# limit number of lines to pick to number of lines in file
$pick = $lines if $pick > $lines;

# build list of N lines to pick, use a hash to prevent picking the
# same line multiple times
my %picked;
for (1 .. $pick) {
    my $n = int(rand($lines));
    redo if $picked{$n}++
}

# loop over file extracting selected lines
seek($fh, 0, 0);
while (<$fh>) {
    print if $picked{$.};
}
close $fh;
Michael Carman
Really nice approach. The only thing missing is check if $pick <= $lines - otherwise it will hang on the for() loop.
depesz
Good catch. I've updated it to correct that.
Michael Carman
+2  A: 

Perl way:

use CPAN. There is module File::RandomLine that does exactly what you need.

depesz
There is also http://search.cpan.org/perldoc?File::Random
Sinan Ünür
+1  A: 

From perlfaq5: "How do I select a random line from a file?"


Short of loading the file into a database or pre-indexing the lines in the file, there are a couple of things that you can do.

Here's a reservoir-sampling algorithm from the Camel Book:

srand;
rand($.) < 1 && ($line = $_) while <>;

This has a significant advantage in space over reading the whole file in. You can find a proof of this method in The Art of Computer Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.

You can use the File::Random module which provides a function for that algorithm:

use File::Random qw/random_line/;
my $line = random_line($filename);

Another way is to use the Tie::File module, which treats the entire file as an array. Simply access a random array element.

brian d foy