I have bunch of files and very file has a header of 5 lines. In the rest of the file, pair of line form an entry. I need to randomly select entry from these files. How can i select random files and random entry(pair of line, excluding header) ?
Answer is in Python. Assuming you can read a whole file into memory.
#using python 2.6
import sys
import os
import itertools
import random
def main(directory, num_files=5, num_entries=5):
file_paths = os.listdir(directory)
# get a random sampling of the available paths
chosen_paths = random.sample(file_paths, num_files)
for path in chosen_paths:
chosen_entries = get_random_entries(path, num_entries)
for entry in chosen_entries:
# do something with your chosen entries
print entry
def get_random_entries(file_path, num_entries):
with open(file_path, 'r') as file:
# read the lines and slice off the headers
lines = file.readlines()[5:]
# group the lines into pairs (i.e. entries)
entries = list(itertools.izip_longest(*[iter(lines)]*2))
# return a random sampling of entries
return random.sample(entries, num_entries)
if __name__ == '__main__':
#use optparse here to do fancy things with the command line args
main(sys.argv[1:])
Another Python option; reading the contents of all files into memory:
import random
import fileinput
def openhook(filename, mode):
f = open(filename, mode)
headers = [f.readline() for _ in range(5)]
return f
num_entries = 3
lines = list(fileinput.input(openhook=openhook))
print random.sample(lines, num_entries)
If the file is small enough, read the pairs of lines into memory and select randomly from that data structure. If the file is too large, Eugene Y provides the right answer: use reservoir sampling.
Here's an intuitive explanation for the algorithm.
Process the file line by line.
pick = line, with probability 1/N, where N = line number
In other words, on line 1, we will pick line 1 with 1/1
probability. On line 2, we will change the pick to line 2, with 1/2
probability. On line 3, we will change the pick to line 3, with 1/3
probability. Etc.
For an intuitive proof, imagine a file with 3 lines:
1 Pick line 1.
/ \
.5 .5
/ \
2 1 Switch to line 2?
/ \ / \
.67 .33 .33 .67
/ \ / \
2 3 1 Switch to line 3?
The probability for each outcome:
Line 1: .5 * .67 = 1/3
Line 2: .5 * .67 = 1/3
Line 3: .5 * .33 * 2 = 1/3
From there, the rest is induction. For example, suppose the file has 4 lines. We've already convinced ourselves that as of line 3, every line so far (1, 2, 3) will have an equal chance of being our current selection. When we advance to line 4, it will have a 1/4
chance of being picked -- exactly what it should be, thus reducing the probabilities on the previous 3 lines by exactly the right amount (1/3 * 3/4 = 1/4
).
Here's the Perl FAQ answer, adapted to your problem.
use strict;
use warnings;
# Ignore 5 lines.
<> for 1 .. 5;
# Use reservoir sampling to select pairs from remaining lines.
my (@picks, $n);
until (eof){
my @lines;
$lines[$_] = <> for 0 .. 1;
$n ++;
@picks = @lines if rand($n) < 1;
}
print @picks;
Python solution - reads file only once and requires little memory
Invoke like so getRandomItems(file('myHuge.log'), 5, 2)
- will return list of 2 lines
from random import randrange
def getRandomItems(f, skipFirst=0, numItems=1):
for _ in xrange(skipFirst):
f.next()
n = 0; r = []
while True:
try:
nxt = [f.next() for _ in range(numItems)]
except StopIteration: break
n += 1
if not randrange(n):
r = nxt
return r
Returns empty list if it could not get the first passable items from f. The code's only requirement is that argument f
is an iterator (supports next()
method). Hence we can pass something different than file, say we want to see the distribution:
>>> s={}
>>> for i in xrange(5000):
... r = getRandomItems(iter(xrange(50)))[0]
... s[r] = 1 + s.get(r,0)
...
>>> for i in s:
... print i, '*' * s[i]
...
0 ***********************************************************************************************
1 **************************************************************************************************************
2 ******************************************************************************************************
3 ***************************************************************************
4 *************************************************************************************************************************
5 ********************************************************************************
6 **********************************************************************************************
7 ***************************************************************************************
8 ********************************************************************************************
9 ********************************************************************************************
10 ***********************************************************************************************
11 ************************************************************************************************
12 *******************************************************************************************************************
13 *************************************************************************************************************
14 ***************************************************************************************************************
15 *****************************************************************************************************
16 ********************************************************************************************************
17 ****************************************************************************************************
18 ************************************************************************************************
19 **********************************************************************************
20 ******************************************************************************************
21 ********************************************************************************************************
22 ******************************************************************************************************
23 **********************************************************************************************************
24 *******************************************************************************************************
25 ******************************************************************************************
26 ***************************************************************************************************************
27 ***********************************************************************************************************
28 *****************************************************************************************************
29 ****************************************************************************************************************
30 ********************************************************************************************************
31 ********************************************************************************************
32 ****************************************************************************************************
33 **********************************************************************************************
34 ****************************************************************************************************
35 **************************************************************************************************
36 *********************************************************************************************
37 ***************************************************************************************
38 *******************************************************************************************************
39 **********************************************************************************************************
40 ******************************************************************************************************
41 ********************************************************************************************************
42 ************************************************************************************
43 ****************************************************************************************************************************
44 ****************************************************************************************************************************
45 ***********************************************************************************************
46 *****************************************************************************************************
47 ***************************************************************************************
48 ***********************************************************************************************************
49 ****************************************************************************************************************