ansaurus

Question

python: quickest way to split a file into two files randomly

Answer 1

A:

import random
data=open("file").readlines()
random.shuffle(data)
c=1
f=open("test."+str(c),"w")
for n,i in enumerate(data):
     if n==len(data)/2:
         c+=1
         f.close()
         f=open("test."+str(c),"w")
     f.write(i)

ghostdog74 2010-10-09 02:47:33

Answer 2

+2 A:

This sort of operation is often called "partition". Although there isn't a built-in partition function, I found this article: Partition in Python.

Given that definition, you can do this:

import random

def partition(l, pred):
    yes, no = [], []
    for e in l:
        if pred(e):
            yes.append(e)
        else:
            no.append(e)
    return yes, no

lines = open("file.txt").readlines()
lines1, lines2 = partition(lines, lambda x: random.random() < 0.5)

Note that this won't necessarily exactly split the file in two, but it will on average.

Greg Hewgill 2010-10-09 02:50:51

Now shuffle lines1 and lines2, and write them to new files, and you're done.

Ned Batchelder 2010-10-09 03:08:13

This doesn't guarantee that the file will be split evenly. It will only be split evenly on average (for doing this a large number of times).

Justin Peel 2010-10-09 06:23:06

Answer 3

+4 A:

You can just load the file, call random.shuffle on the resulting list, and then split it into two files (untested code):

def shuffle_split(infilename, outfilename1, outfilename2):
    from random import shuffle

    with open(infilename, 'r') as f:
        lines = f.readlines()

    # append a newline in case the last line didn't end with one
    lines[-1] = lines[-1].rstrip('\n') + '\n'

    shuffle(lines)

    with open(outfilename1, 'w') as f:
        f.writelines(lines[:len(lines) // 2])
    with open(outfilename2, 'w') as f:
        f.writelines(lines[len(lines) // 2:])

random.shuffle shuffles lines in-place, and pretty much does all the work here. Python's sequence indexing system (e.g. lines[len(lines) // 2:]) makes things really convenient.

I'm assuming that the file isn't huge, i.e. that it will fit comfortably in memory. If that's not the case, you'll need to do something a bit more fancy, probably using the linecache module to read random line numbers from your input file. I think probably you would want to generate two lists of line numbers, using a similar technique to what's shown above.

update: changed / to // to evade issues when __future__.division is enabled.

intuited 2010-10-09 03:34:28

The code does not run, shuffle returns None as it changes lines in place. I wrote corrected version.

Tony Veijalainen 2010-10-09 18:25:51

Thanks for pointing that out, it's corrected now.

intuited 2010-10-09 20:09:54

Answer 4

A:

Other version:

from random import shuffle

def shuffle_split(infilename, outfilename1, outfilename2):
    with open(infilename, 'r') as f:
        lines = f.read().splitlines()

    shuffle(lines)
    half_lines = len(lines) // 2

    with open(outfilename1, 'w') as f:
        f.write('\n'.join(lines.pop() for count in range(half_lines)))
    with open(outfilename2, 'w') as f:
        f.writelines('\n'.join(lines))

Tony Veijalainen 2010-10-09 18:21:59

ansaurus

tags:

views:

answers:

python: quickest way to split a file into two files randomly

related questions