tags:

views:

85

answers:

4

python: what is the quickest way to split a file into two files, each file having half of the number of lines in the original file, such that the lines in each of the two files are random?

for example: if the file is 1 2 3 4 5 6 7 8 9 10

it could be split into:

3 2 10 9 1

4 6 8 5 7

A: 
import random
data=open("file").readlines()
random.shuffle(data)
c=1
f=open("test."+str(c),"w")
for n,i in enumerate(data):
     if n==len(data)/2:
         c+=1
         f.close()
         f=open("test."+str(c),"w")
     f.write(i)
ghostdog74
+2  A: 

This sort of operation is often called "partition". Although there isn't a built-in partition function, I found this article: Partition in Python.

Given that definition, you can do this:

import random

def partition(l, pred):
    yes, no = [], []
    for e in l:
        if pred(e):
            yes.append(e)
        else:
            no.append(e)
    return yes, no

lines = open("file.txt").readlines()
lines1, lines2 = partition(lines, lambda x: random.random() < 0.5)

Note that this won't necessarily exactly split the file in two, but it will on average.

Greg Hewgill
Now shuffle lines1 and lines2, and write them to new files, and you're done.
Ned Batchelder
This doesn't guarantee that the file will be split evenly. It will only be split evenly on average (for doing this a large number of times).
Justin Peel
+4  A: 

You can just load the file, call random.shuffle on the resulting list, and then split it into two files (untested code):

def shuffle_split(infilename, outfilename1, outfilename2):
    from random import shuffle

    with open(infilename, 'r') as f:
        lines = f.readlines()

    # append a newline in case the last line didn't end with one
    lines[-1] = lines[-1].rstrip('\n') + '\n'

    shuffle(lines)

    with open(outfilename1, 'w') as f:
        f.writelines(lines[:len(lines) // 2])
    with open(outfilename2, 'w') as f:
        f.writelines(lines[len(lines) // 2:])

random.shuffle shuffles lines in-place, and pretty much does all the work here. Python's sequence indexing system (e.g. lines[len(lines) // 2:]) makes things really convenient.

I'm assuming that the file isn't huge, i.e. that it will fit comfortably in memory. If that's not the case, you'll need to do something a bit more fancy, probably using the linecache module to read random line numbers from your input file. I think probably you would want to generate two lists of line numbers, using a similar technique to what's shown above.

update: changed / to // to evade issues when __future__.division is enabled.

intuited
The code does not run, shuffle returns None as it changes lines in place. I wrote corrected version.
Tony Veijalainen
Thanks for pointing that out, it's corrected now.
intuited
A: 

Other version:

from random import shuffle

def shuffle_split(infilename, outfilename1, outfilename2):
    with open(infilename, 'r') as f:
        lines = f.read().splitlines()

    shuffle(lines)
    half_lines = len(lines) // 2

    with open(outfilename1, 'w') as f:
        f.write('\n'.join(lines.pop() for count in range(half_lines)))
    with open(outfilename2, 'w') as f:
        f.writelines('\n'.join(lines))
Tony Veijalainen