ansaurus

Question

Python - lines from files - all combinations

Answer 1

+1 A:

A Cartesian product enumerates all combinations. The easiest way to enumerate all combinations is to use nested loops.

You cannot write files in a random order very easily. To write to a "random" position, you must use file.seek(). How will you know what position to which you will seek? How do you know how long each part (prefix+term) will be?

You can, however, read entire files into memory (100 lines is nothing) and process the in-memory collections in "random" orders. This will assure that the output is randomized.

S.Lott 2009-04-26 13:41:50

If I try to process the 100 line files in random order... I need to prevent duplicates in the 10000 line output. How would I do that?

tyndall 2009-04-26 16:03:27

Read the random module documentation, look for methods like shuffle. http://docs.python.org/library/random.html

S.Lott 2009-04-26 17:14:45

Answer 2

+1 A:

from random import shuffle
a = list(open('prefix.txt'))
b = list(open('terms.txt'))
c = [x.strip() + y.strip() for x in a for y in b]
shuffle(c)
open('result.txt', 'w').write('\n'.join(c))

Certainly, not the best way in terms of speed and memory, but 10000 is not big enough to sacrifice brevity anyway. You should normally close your file objects and you can loop through at least one of the files without storing its content in RAM. ~~This: [:-1] removes the trailing newlline from each element of a and b.~~

Edit: using s.strip() instead of s[:-1] to get rid of the newlines---it's more portable.

ngn 2009-04-26 13:52:35

I would consider at the very least shuffling the cross product of the indices, rather than the actual lines... that way you can pick lines from a and b, and only have to store the contents of the two files plus the shuffled list of integer pairs, which may be cheaper than storing all 10,000 lines of the file to be output. Depends on line length, I suppose.

Blair Conrad 2009-04-26 14:23:26

As I said, this is nowhere near optimal, but quite concise. You know, with an input small as this, "best way to approach" could well mean "write readable code".

ngn 2009-04-26 14:56:19

Blair, shuffling the lines will be exactly as fast as shuffling the integers (a list is "an array of pointers" after all!) and your approach would incur the cost of an indirection which would no doubt make it slower than ngn's solution. Storing a megabyte or two of text in memory (if the average line in input files is 50-100 bytes) is NOT a problem worth worrying about these days.

Alex Martelli 2009-04-26 16:00:46

Answer 3

+4 A:

You need itertools.product.

for prefix, term in itertools.product(open('prefix.txt'), open('terms.txt')):
    print(prefix.strip() + term.strip())

Print them, or accumulate them, or write them directly. You need the .strip() because of the newline that comes with each of them.

Afterwards, you can shuffle them using random.shuffle(list(open('thirdfile.txt')), but I don't know how fast that will be on a file of the sizes you are using.

sykora 2009-04-26 14:00:12

+1 wow. very cool. I like this itertools library.

tyndall 2009-04-26 16:08:00

Do I need to do anything special to "import" itertools? It breaks in IronPython.

tyndall 2009-04-27 01:58:36

I'm sorry, I have no experience with IronPython. itertools is part of the CPython standard library, so it should be available. Maybe someone can help me (and him) out here?

sykora 2009-04-27 10:37:55

ansaurus

tags:

views:

answers:

Python - lines from files - all combinations

related questions