ansaurus

Question

How do I read a random line from one file in python?

Answer 1

A:

Seek to a random position, read a line and discard it, then read another line. The distribution of lines won't be normal, but that doesn't always matter.

Ignacio Vazquez-Abrams 2010-08-22 05:32:16

In particular, this makes it impossible to ever select the first line (as well as picking other lines with a probability proportional to the length of each previous line). My A doesn't produce a normal distribution either (that would be weird -- what mean, what variance?!), but a uniform one, which seems somewhat more likely to meet the OP's meaning for "random".

Alex Martelli 2010-08-22 05:38:38

Answer 2

+8 A:

Not built-in, but algorithm R(3.4.2) (Waterman's "Reservoir Algorithm") from Knuth's "The Art of Computer Programming" is good (in a very simplified version):

import random

def random_line(afile):
    line = next(afile)
    for num, aline in enumerate(afile):
      if random.randrange(num + 2): continue
      line = aline
    return line

The num + 2 produces the sequence 2, 3, 4... The randrange will therefore be 0 with a probablity of 1.0/(num + 2) -- and that's the probability with which we must replace the currently selected line (the special-case of sample size 1 of the referenced algorithm -- see Knuth's book for proof of correctness == and of course we're also in the case of a small-enough "reservoir" to fit in memory;-)... and exactly the probability with which we do so.

Alex Martelli 2010-08-22 05:35:59

+1 for translating from MIX to python

aaronasterling 2010-08-22 05:45:12

This is reservoir sampling, right?

HenryR 2010-08-22 05:50:20

I've always thought that the `random.choice()` function should work on arbitrary iterators as well as sequences, implementing exactly the above algorithm.

Greg Hewgill 2010-08-22 05:54:39

@Greg Hewgill, that would be nice but every tenth question would then be "where did my iterator go"

aaronasterling 2010-08-22 06:08:36

@aaron, right -- same reason, e.g., there is no `len` for iterators... the "algorithm" is not hard to see, but consuming the iterator is considered a too-often-surprising effect. It's a series of hard design decisions, of course (e.g., `sum` _does_ consume the iterator -- the decision there is that the summation may well be all the user requires while the length or one random item is less likely to be so... always iffy decisions either way -- if we had a way to clearly mark a name as "having side effects", like Ruby's trailing bang, the design choices might be different).

Alex Martelli 2010-08-22 14:26:34

@Henry, right - edited the A to attribute it properly, tx for the reminder.

Alex Martelli 2010-08-22 14:44:29

Answer 3

+1 A:

It depends what do you mean by "too much" overhead. If storing whole file in memory is possible, then something like

import random

random_lines = random.choice(open("file").readlines())

would do the trick.

cji 2010-08-22 05:36:26

Answer 4

+1 A:

import random
lines = open('file.txt').read().splitlines()
myline =random.choice(lines)
print(myline)

For very long file: seek to random place in file based on it's length and find two newline characters after position (or newline and end of file). Do again 100 characters before or from beginning of file if original seek position was <100 if we ended up inside the last line.

Tony Veijalainen 2010-08-22 05:47:34

ansaurus

tags:

views:

answers:

How do I read a random line from one file in python?

related questions