views:

172

answers:

6

I have bunch of files and very file has a header of 5 lines. In the rest of the file, pair of line form an entry. I need to randomly select entry from these files. How can i select random files and random entry(pair of line, excluding header) ?

+6  A: 

You may find perlfaq5 useful.

eugene y
Its worth noting that its a single-pass algorithm which only keeps two lines in memory at any time.
Schwern
I would have posted this as CW, since you're (presumably) not the author of this faq item.
Ether
+1 Clever algorithm. I learned something today. Nit: I guess nowadays, we should drop the preceding "srand".
tsee
+2  A: 
sed "1,5d" < FILENAME | sort -R | head -2
zed_0xff
Interesting. but where is the "-R" switch (presumably for "Randomize" supported? I just checked on Linux (RHEL5.4, coreutils 5.97...), Mac OS X (10.5.8), and FreeBSD (6.4).
Jim Dennis
I have coreutils-8.4
zed_0xff
+1  A: 

Answer is in Python. Assuming you can read a whole file into memory.

#using python 2.6
import sys
import os
import itertools
import random

def main(directory, num_files=5, num_entries=5):
    file_paths = os.listdir(directory)

    # get a random sampling of the available paths
    chosen_paths = random.sample(file_paths, num_files)

    for path in chosen_paths:
        chosen_entries = get_random_entries(path, num_entries)
        for entry in chosen_entries:
            # do something with your chosen entries
            print entry

def get_random_entries(file_path, num_entries):
    with open(file_path, 'r') as file:
        # read the lines and slice off the headers
        lines = file.readlines()[5:]

        # group the lines into pairs (i.e. entries)
        entries = list(itertools.izip_longest(*[iter(lines)]*2))

        # return a random sampling of entries
        return random.sample(entries, num_entries)

if __name__ == '__main__':
    #use optparse here to do fancy things with the command line args
    main(sys.argv[1:])

Links: itertools, random, optparse

tgray
My python version on the server where the log is . 2.4.3
AlgoMan
@EnTerr It's exactly as many lines as your answer if you take out the whitespace and comments; both of which are lacking in yours.
Jon-Eric
ugh, loading all lines in memory is silly - say there a million lines in file of 50MB?
Nas Banov
@EnTerr, I wouldn't worry about 50MB or a million lines. The problems might begin when we get up near 500MB - 1GB. because at any given point there could be up to two copies of the files content in memory (plus a little overhead from the other variables).
tgray
A: 

Another Python option; reading the contents of all files into memory:

import random
import fileinput

def openhook(filename, mode):
    f = open(filename, mode)
    headers = [f.readline() for _ in range(5)]
    return f

num_entries = 3
lines = list(fileinput.input(openhook=openhook))
print random.sample(lines, num_entries)
Matt Anderson
file is huge, will be silly to allocate multiple MB for that
Nas Banov
My python version on the server where the log is . 2.4.3
AlgoMan
+5  A: 

If the file is small enough, read the pairs of lines into memory and select randomly from that data structure. If the file is too large, Eugene Y provides the right answer: use reservoir sampling.

Here's an intuitive explanation for the algorithm.

Process the file line by line.
pick = line, with probability 1/N, where N = line number

In other words, on line 1, we will pick line 1 with 1/1 probability. On line 2, we will change the pick to line 2, with 1/2 probability. On line 3, we will change the pick to line 3, with 1/3 probability. Etc.

For an intuitive proof, imagine a file with 3 lines:

        1            Pick line 1.
       / \
     .5  .5
     /     \
    2       1        Switch to line 2?
   / \     / \
 .67 .33 .33 .67
 /     \ /     \
2       3       1    Switch to line 3?

The probability for each outcome:

Line 1: .5 * .67     = 1/3
Line 2: .5 * .67     = 1/3
Line 3: .5 * .33 * 2 = 1/3

From there, the rest is induction. For example, suppose the file has 4 lines. We've already convinced ourselves that as of line 3, every line so far (1, 2, 3) will have an equal chance of being our current selection. When we advance to line 4, it will have a 1/4 chance of being picked -- exactly what it should be, thus reducing the probabilities on the previous 3 lines by exactly the right amount (1/3 * 3/4 = 1/4).

Here's the Perl FAQ answer, adapted to your problem.

use strict;
use warnings;

# Ignore 5 lines.
<> for 1 .. 5;

# Use reservoir sampling to select pairs from remaining lines.
my (@picks, $n);
until (eof){
    my @lines;
    $lines[$_] = <> for 0 .. 1;

    $n ++;
    @picks = @lines if rand($n) < 1;
}

print @picks;
FM
Great explanation of the algorithm. Nit: I think the OP said that two lines make up an entry, so your program would need a minor modification to account for that (i.e. add another readline, divide no. of lines in the rand() call by two). Crazy idea: one might be able to use `File::Stream`, which let's you use a regexp in $/ to read two lines at once. Of course, that would be a bad idea in production because the module's extremely slow.
tsee
+2  A: 

Python solution - reads file only once and requires little memory

Invoke like so getRandomItems(file('myHuge.log'), 5, 2) - will return list of 2 lines

from random import randrange

def getRandomItems(f, skipFirst=0, numItems=1):
    for _ in xrange(skipFirst):
        f.next()
    n = 0; r = []
    while True:
        try:
            nxt = [f.next() for _ in range(numItems)]
        except StopIteration: break
        n += 1
        if not randrange(n):
            r = nxt
    return r

Returns empty list if it could not get the first passable items from f. The code's only requirement is that argument f is an iterator (supports next() method). Hence we can pass something different than file, say we want to see the distribution:

>>> s={}
>>> for i in xrange(5000):
...     r = getRandomItems(iter(xrange(50)))[0]
...     s[r] = 1 + s.get(r,0)
... 
>>> for i in s: 
...     print i, '*' * s[i]
... 
0 ***********************************************************************************************
1 **************************************************************************************************************
2 ******************************************************************************************************
3 ***************************************************************************
4 *************************************************************************************************************************
5 ********************************************************************************
6 **********************************************************************************************
7 ***************************************************************************************
8 ********************************************************************************************
9 ********************************************************************************************
10 ***********************************************************************************************
11 ************************************************************************************************
12 *******************************************************************************************************************
13 *************************************************************************************************************
14 ***************************************************************************************************************
15 *****************************************************************************************************
16 ********************************************************************************************************
17 ****************************************************************************************************
18 ************************************************************************************************
19 **********************************************************************************
20 ******************************************************************************************
21 ********************************************************************************************************
22 ******************************************************************************************************
23 **********************************************************************************************************
24 *******************************************************************************************************
25 ******************************************************************************************
26 ***************************************************************************************************************
27 ***********************************************************************************************************
28 *****************************************************************************************************
29 ****************************************************************************************************************
30 ********************************************************************************************************
31 ********************************************************************************************
32 ****************************************************************************************************
33 **********************************************************************************************
34 ****************************************************************************************************
35 **************************************************************************************************
36 *********************************************************************************************
37 ***************************************************************************************
38 *******************************************************************************************************
39 **********************************************************************************************************
40 ******************************************************************************************************
41 ********************************************************************************************************
42 ************************************************************************************
43 ****************************************************************************************************************************
44 ****************************************************************************************************************************
45 ***********************************************************************************************
46 *****************************************************************************************************
47 ***************************************************************************************
48 ***********************************************************************************************************
49 ****************************************************************************************************************
Nas Banov