views:

1509

answers:

6

Hi all,

I have a 10^7 lines file, in which I want to choose 1/100 of lines randomly from the file. This is the AWK code I have, but it slurps all the file content before hand. My PC memory cannot handle such slurps. Is there other approach to do it?

awk 'BEGIN{srand()}
!/^$/{ a[c++]=$0}
END {  
  for ( i=1;i<=c ;i++ )  { 
    num=int(rand() * c)
    if ( a[num] ) {
        print a[num]
        delete a[num]
        d++
    }
    if ( d == c/100 ) break
  }
 }' file
+3  A: 

You used awk, but I don't know if it's required. If it's not, here's a trivial way to do w/ perl (and without loading the entire file into memory):

cat your_file.txt | perl -n -e 'print if (rand() < .01)'
Bill
Theoretically that could print no values.
Steven Huwig
See my comments on cadrian's answer
David Zaslavsky
@Steven, read the original post. His file had 10^7 lines. The odds of him not getting output are .99^100000000. Saying that isn't acceptable is ridiculous. If you're worried about that level of error, you shouldn't be using a computer. You're more likely to get incorrect output due to comsic rays.
Bill
additionally, <a href='http://partmaps.org/era/unix/award.html#cat'>UUOC</a>: perl -ne 'print if (rand() < .01)' your_file.txt
glenn jackman
Okay, now *that's* legit. :-)
Bill
+1  A: 

Instead of waiting until the end to randomly pick your 1% of lines, do it every 100 lines in "/^$/". That way, you only hold 100 lines at a time.

Travis Jensen
This leads to a different distribution of random lines. E.g., you'll never have two from the same 100-line set.
derobert
It would also affect the order. You'd never have a line from the third set before a line from the second set. Important considerations, for sure.
Travis Jensen
+7  A: 

if you have that many lines, are you sure you want exactly 1% or a statistical estimate would be enough?

In that second case, just randomize at 1% at each line...

awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}'
cadrian
The original can technically try to choose an element beyond the end of the array, since it's using delete to avoid duplicates. This one doesn't have that problem.
paxdiablo
Theoretically that could print no values.
Steven Huwig
@Steven: true, but statistically with such a huge file I don't think it will. I tried with "man awk" which is 2300 lines long and I always had a ~20 lines (between 17 and 22) extract.
cadrian
@Steven: Theoretically yes, practically no. The probability of printing no lines, for the OP's 10 million line file (given by the Poisson distribution), is about 10^-43430. Equivalent to cracking a 144 *kilobyte* encryption key by guessing on the first try.
David Zaslavsky
Also, for the OP's case (10^7 lines) there's a 99.8% chance that the chosen number of elements will be within 1% of 10,000.
David Zaslavsky
Good point, but still, a correct _algorithm_ needs to be correct, not "extremely likely to be correct." IMHO anyway.
Steven Huwig
Note that this algorithm will have a bias towards longer lines. If you want to correct against this, you must weight your sampling (e.g. sample lines that are 2x longer 1/2 as often). Precisely how to do this is left as an exercise for the reader (hint: build an empirical distribution of line lengths).
dataspora
+2  A: 

You could do it in two passes:

  • Run through the file once, just to count how many lines there are
  • Randomly select the line numbers of the lines you want to print, storing them in a sorted list (or a set)
  • Run through the file once more and pick out the lines at the selected positions

Example in python:

fn = '/usr/share/dict/words'

from random import randint
from sys import stdout

count = 0
with open(fn) as f:
   for line in f:
      count += 1

selected = set()
while len(selected) < count//100:
   selected.add(randint(0, count-1))

index = 0
with open(fn) as f:
   for line in f:
      if index in selected:
          stdout.write(line)
      index += 1
sth
+6  A: 

I wrote this exact code in Gawk -- you're in luck. It's long partially because it preserves input order. There are probably performance enhancements that can be made.

This algorithm is correct without knowing the input size in advance. I posted a rosetta stone here about it. (I didn't post this version because it does unnecessary comparisons.)

Original thread: Submitted for your review -- random sampling in awk.

# Waterman's Algorithm R for random sampling
# by way of Knuth's The Art of Computer Programming, volume 2

BEGIN {
    if (!n) {
        print "Usage: sample.awk -v n=[size]"
        exit
    }
    t = n
    srand()

}

NR <= n {
    pool[NR] = $0
    places[NR] = NR
    next

}

NR > n {
    t++
    M = int(rand()*t) + 1
    if (M <= n) {
        READ_NEXT_RECORD(M)
    }

}

END {
    if (NR < n) {
        print "sample.awk: Not enough records for sample" \
            > "/dev/stderr"
        exit
    }
    # gawk needs a numeric sort function
    # since it doesn't have one, zero-pad and sort alphabetically
    pad = length(NR)
    for (i in pool) {
        new_index = sprintf("%0" pad "d", i)
        newpool[new_index] = pool[i]
    }
    x = asorti(newpool, ordered)
    for (i = 1; i <= x; i++)
        print newpool[ordered[i]]

}

function READ_NEXT_RECORD(idx) {
    rec = places[idx]
    delete pool[rec]
    pool[NR] = $0
    places[idx] = NR  
}
Steven Huwig
This is "randomly choose n lines" instead of "randomly choose 1/100 lines", so a little precalculation is needed. Still cool, though.
ephemient
Yep, though I commented on the question that "randomly choose n lines" is better -- sample size isn't in direct proportion to population size.
Steven Huwig
+4  A: 

This should work on most any GNU/Linux machine.

$ shuf -n $(( $(wc -l $file | cut -d' ' -f1) / 100)) $file

I'd be surprised if memory management was done inappropriately by the GNU shuf command.

ashawley