tags:

views:

108

answers:

4

Is there a built-in method to do it? If not how can I do this without costing too much overhead?

A: 

Seek to a random position, read a line and discard it, then read another line. The distribution of lines won't be normal, but that doesn't always matter.

Ignacio Vazquez-Abrams
In particular, this makes it impossible to ever select the first line (as well as picking other lines with a probability proportional to the length of each previous line). My A doesn't produce a normal distribution either (that would be weird -- what mean, what variance?!), but a uniform one, which seems somewhat more likely to meet the OP's meaning for "random".
Alex Martelli
+8  A: 

Not built-in, but algorithm R(3.4.2) (Waterman's "Reservoir Algorithm") from Knuth's "The Art of Computer Programming" is good (in a very simplified version):

import random

def random_line(afile):
    line = next(afile)
    for num, aline in enumerate(afile):
      if random.randrange(num + 2): continue
      line = aline
    return line

The num + 2 produces the sequence 2, 3, 4... The randrange will therefore be 0 with a probablity of 1.0/(num + 2) -- and that's the probability with which we must replace the currently selected line (the special-case of sample size 1 of the referenced algorithm -- see Knuth's book for proof of correctness == and of course we're also in the case of a small-enough "reservoir" to fit in memory;-)... and exactly the probability with which we do so.

Alex Martelli
+1 for translating from MIX to python
aaronasterling
This is reservoir sampling, right?
HenryR
I've always thought that the `random.choice()` function should work on arbitrary iterators as well as sequences, implementing exactly the above algorithm.
Greg Hewgill
@Greg Hewgill, that would be nice but every tenth question would then be "where did my iterator go"
aaronasterling
@aaron, right -- same reason, e.g., there is no `len` for iterators... the "algorithm" is not hard to see, but consuming the iterator is considered a too-often-surprising effect. It's a series of hard design decisions, of course (e.g., `sum` _does_ consume the iterator -- the decision there is that the summation may well be all the user requires while the length or one random item is less likely to be so... always iffy decisions either way -- if we had a way to clearly mark a name as "having side effects", like Ruby's trailing bang, the design choices might be different).
Alex Martelli
@Henry, right - edited the A to attribute it properly, tx for the reminder.
Alex Martelli
+1  A: 

It depends what do you mean by "too much" overhead. If storing whole file in memory is possible, then something like

import random

random_lines = random.choice(open("file").readlines())

would do the trick.

cji
+1  A: 
import random
lines = open('file.txt').read().splitlines()
myline =random.choice(lines)
print(myline)

For very long file: seek to random place in file based on it's length and find two newline characters after position (or newline and end of file). Do again 100 characters before or from beginning of file if original seek position was <100 if we ended up inside the last line.

Tony Veijalainen