ansaurus

Question

Randomly Pick Lines From a File Without Slurping It With Unix

Answer 1

+3 A:

You used awk, but I don't know if it's required. If it's not, here's a trivial way to do w/ perl (and without loading the entire file into memory):

cat your_file.txt | perl -n -e 'print if (rand() < .01)'

Bill 2009-03-28 06:02:08

Theoretically that could print no values.

Steven Huwig 2009-03-28 07:55:14

See my comments on cadrian's answer

David Zaslavsky 2009-03-28 18:05:50

@Steven, read the original post. His file had 10^7 lines. The odds of him not getting output are .99^100000000. Saying that isn't acceptable is ridiculous. If you're worried about that level of error, you shouldn't be using a computer. You're more likely to get incorrect output due to comsic rays.

Bill 2009-03-31 04:57:48

additionally, <a href='http://partmaps.org/era/unix/award.html#cat'>UUOC</a>: perl -ne 'print if (rand() < .01)' your_file.txt

glenn jackman 2009-04-03 12:26:31

Okay, now *that's* legit. :-)

Bill 2009-04-04 22:12:29

Answer 2

+1 A:

Instead of waiting until the end to randomly pick your 1% of lines, do it every 100 lines in "/^$/". That way, you only hold 100 lines at a time.

Travis Jensen 2009-03-28 06:03:30

This leads to a different distribution of random lines. E.g., you'll never have two from the same 100-line set.

derobert 2009-03-28 07:51:16

It would also affect the order. You'd never have a line from the third set before a line from the second set. Important considerations, for sure.

Travis Jensen 2009-03-28 17:42:48

Answer 3

+7 A:

if you have that many lines, are you sure you want exactly 1% or a statistical estimate would be enough?

In that second case, just randomize at 1% at each line...

awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}'

cadrian 2009-03-28 06:04:56

The original can technically try to choose an element beyond the end of the array, since it's using delete to avoid duplicates. This one doesn't have that problem.

paxdiablo 2009-03-28 06:24:30

Theoretically that could print no values.

Steven Huwig 2009-03-28 07:55:47

@Steven: true, but statistically with such a huge file I don't think it will. I tried with "man awk" which is 2300 lines long and I always had a ~20 lines (between 17 and 22) extract.

cadrian 2009-03-28 08:05:53

@Steven: Theoretically yes, practically no. The probability of printing no lines, for the OP's 10 million line file (given by the Poisson distribution), is about 10^-43430. Equivalent to cracking a 144 *kilobyte* encryption key by guessing on the first try.

David Zaslavsky 2009-03-28 17:53:29

Also, for the OP's case (10^7 lines) there's a 99.8% chance that the chosen number of elements will be within 1% of 10,000.

David Zaslavsky 2009-03-28 18:09:37

Good point, but still, a correct _algorithm_ needs to be correct, not "extremely likely to be correct." IMHO anyway.

Steven Huwig 2009-03-28 19:55:47

Note that this algorithm will have a bias towards longer lines. If you want to correct against this, you must weight your sampling (e.g. sample lines that are 2x longer 1/2 as often). Precisely how to do this is left as an exercise for the reader (hint: build an empirical distribution of line lengths).

dataspora 2009-09-05 23:20:15

Answer 4

+2 A:

You could do it in two passes:

Run through the file once, just to count how many lines there are
Randomly select the line numbers of the lines you want to print, storing them in a sorted list (or a set)
Run through the file once more and pick out the lines at the selected positions

Example in python:

fn = '/usr/share/dict/words'

from random import randint
from sys import stdout

count = 0
with open(fn) as f:
   for line in f:
      count += 1

selected = set()
while len(selected) < count//100:
   selected.add(randint(0, count-1))

index = 0
with open(fn) as f:
   for line in f:
      if index in selected:
          stdout.write(line)
      index += 1

sth 2009-03-28 06:23:19

Answer 5

+6 A:

I wrote this exact code in Gawk -- you're in luck. It's long partially because it preserves input order. There are probably performance enhancements that can be made.

This algorithm is correct without knowing the input size in advance. I posted a rosetta stone here about it. (I didn't post this version because it does unnecessary comparisons.)

Original thread: Submitted for your review -- random sampling in awk.

# Waterman's Algorithm R for random sampling
# by way of Knuth's The Art of Computer Programming, volume 2

BEGIN {
    if (!n) {
        print "Usage: sample.awk -v n=[size]"
        exit
    }
    t = n
    srand()

}

NR <= n {
    pool[NR] = $0
    places[NR] = NR
    next

}

NR > n {
    t++
    M = int(rand()*t) + 1
    if (M <= n) {
        READ_NEXT_RECORD(M)
    }

}

END {
    if (NR < n) {
        print "sample.awk: Not enough records for sample" \
            > "/dev/stderr"
        exit
    }
    # gawk needs a numeric sort function
    # since it doesn't have one, zero-pad and sort alphabetically
    pad = length(NR)
    for (i in pool) {
        new_index = sprintf("%0" pad "d", i)
        newpool[new_index] = pool[i]
    }
    x = asorti(newpool, ordered)
    for (i = 1; i <= x; i++)
        print newpool[ordered[i]]

}

function READ_NEXT_RECORD(idx) {
    rec = places[idx]
    delete pool[rec]
    pool[NR] = $0
    places[idx] = NR  
}

Steven Huwig 2009-03-28 07:46:07

This is "randomly choose n lines" instead of "randomly choose 1/100 lines", so a little precalculation is needed. Still cool, though.

ephemient 2009-03-30 17:49:41

Yep, though I commented on the question that "randomly choose n lines" is better -- sample size isn't in direct proportion to population size.

Steven Huwig 2009-03-30 17:54:34

Answer 6

+4 A:

This should work on most any GNU/Linux machine.

$ shuf -n $(( $(wc -l $file | cut -d' ' -f1) / 100)) $file

I'd be surprised if memory management was done inappropriately by the GNU shuf command.

ashawley 2009-03-29 05:36:11

ansaurus

tags:

views:

answers:

Randomly Pick Lines From a File Without Slurping It With Unix

related questions