ansaurus

Question

Python class to merge sorted files, how can this be improved?

Answer 1

+7 A:

Note that in python2.6, heapq has a new merge function which will do this for you.

To handle the custom key function, you can just wrap the file iterator with something that decorates it so that it compares based on the key, and strip it out afterwards:

def decorated_file(f, key):
    for line in f: 
        yield (key(line), line)

filenames = ['file1.txt','file2.txt','file3.txt']
files = map(open, filenames)
outfile = open('merged.txt')

for line in heapq.merge(*[decorated_file(f, keyfunc) for f in files]):
    outfile.write(line[1])

[Edit] Even in earlier versions of python, it's probably worthwhile simply to take the implementation of merge from the later heapq module. It's pure python, and runs unmodified in python2.5, and since it uses a heap to get the next minimum should be very efficient when merging large numbers of files.

You should be able to simply copy the heapq.py from a python2.6 installation, copy it to your source as "heapq26.py" and use "from heapq26 import merge" - there are no 2.6 specific features used in it. Alternatively, you could just copy the merge function (rewriting the heappop etc calls to reference the python2.5 heapq module).

Brian 2009-06-16 13:50:56

Actually, I'm still using python 2.5.

tgray 2009-06-16 14:02:11

This is a great answer though, I searched Google for weeks and couldn't find this.

tgray 2009-06-16 14:06:37

Answer 2

+2 A:

<< This "answer" is a comment on the original questioner's resultant code >>

Suggestion: using eval() is ummmm and what you are doing restricts the caller to using lambda -- key extraction may require more than a one-liner, and in any case don't you need the same function for the preliminary sort step?

So replace this:

def mergeSortedFiles(paths, output_path, dedup=True, key="line.split('\t')[1].lower().replace(' ', '')"):
    keyfunc = eval("lambda line: %s" % key)

with this:

def my_keyfunc(line):
    return line.split('\t', 2)[1].replace(' ', '').lower()
    # minor tweaks may speed it up a little

def mergeSortedFiles(paths, output_path, keyfunc, dedup=True):

John Machin 2009-06-17 00:33:01

Thanks, the eval() felt wierd to me too, but I didn't know the alternative. I had gotten the method from this recipe: http://code.activestate.com/recipes/576755/

tgray 2009-06-17 20:09:09

That recipe provides the eval() gimmick only as an optional feature for those who are brave enough to type their key extraction function's source into the command-line when they're running a multi-GB sort :-) You'll notice that this was cleanly separated; both the merge and sort functions take a function for the key arg, not a string.

John Machin 2009-06-17 22:48:14

ansaurus

tags:

views:

answers:

Python class to merge sorted files, how can this be improved?

related questions