views:

433

answers:

3

I have two files - prefix.txt and terms.txt both have about 100 lines. I'd like to write out a third file with the Cartesian product

http://en.wikipedia.org/wiki/Join_(SQL)#Cross_join

-about 10000 lines.

What is the best way to approach this in Python?

Secondly, is there a way to write the 10,000 lines to the third file in a random order?

+1  A: 

A Cartesian product enumerates all combinations. The easiest way to enumerate all combinations is to use nested loops.

You cannot write files in a random order very easily. To write to a "random" position, you must use file.seek(). How will you know what position to which you will seek? How do you know how long each part (prefix+term) will be?

You can, however, read entire files into memory (100 lines is nothing) and process the in-memory collections in "random" orders. This will assure that the output is randomized.

S.Lott
If I try to process the 100 line files in random order... I need to prevent duplicates in the 10000 line output. How would I do that?
tyndall
Read the random module documentation, look for methods like shuffle. http://docs.python.org/library/random.html
S.Lott
+1  A: 
from random import shuffle
a = list(open('prefix.txt'))
b = list(open('terms.txt'))
c = [x.strip() + y.strip() for x in a for y in b]
shuffle(c)
open('result.txt', 'w').write('\n'.join(c))

Certainly, not the best way in terms of speed and memory, but 10000 is not big enough to sacrifice brevity anyway. You should normally close your file objects and you can loop through at least one of the files without storing its content in RAM. This: [:-1] removes the trailing newlline from each element of a and b.

Edit: using s.strip() instead of s[:-1] to get rid of the newlines---it's more portable.

ngn
I would consider at the very least shuffling the cross product of the indices, rather than the actual lines... that way you can pick lines from a and b, and only have to store the contents of the two files plus the shuffled list of integer pairs, which may be cheaper than storing all 10,000 lines of the file to be output. Depends on line length, I suppose.
Blair Conrad
As I said, this is nowhere near optimal, but quite concise. You know, with an input small as this, "best way to approach" could well mean "write readable code".
ngn
Blair, shuffling the lines will be exactly as fast as shuffling the integers (a list is "an array of pointers" after all!) and your approach would incur the cost of an indirection which would no doubt make it slower than ngn's solution. Storing a megabyte or two of text in memory (if the average line in input files is 50-100 bytes) is NOT a problem worth worrying about these days.
Alex Martelli
+4  A: 

You need itertools.product.

for prefix, term in itertools.product(open('prefix.txt'), open('terms.txt')):
    print(prefix.strip() + term.strip())

Print them, or accumulate them, or write them directly. You need the .strip() because of the newline that comes with each of them.

Afterwards, you can shuffle them using random.shuffle(list(open('thirdfile.txt')), but I don't know how fast that will be on a file of the sizes you are using.

sykora
+1 wow. very cool. I like this itertools library.
tyndall
Do I need to do anything special to "import" itertools? It breaks in IronPython.
tyndall
I'm sorry, I have no experience with IronPython. itertools is part of the CPython standard library, so it should be available. Maybe someone can help me (and him) out here?
sykora