ansaurus

Question

merge sort in python

Answer 1

+4 A:

If your files are not very large, then simply read them all into memory (as S. Lott suggests). That would definitely be simplest.

However, you mention collation creates one "massive" file. If its too massive to fit in memory, then perhaps use heapq.merge. It may be a little harder to set up, but it has the advantage of not requiring that all the iterables be pulled into memory at once.

import heapq
import contextlib

class Domain(object):
    def __init__(self,domain):
        self.domain=domain
    @property
    def tld(self):
        # Put your function for calculating TLD here
        return self.domain.split('.',1)[0]
    def __lt__(self,other):
        return self.tld<=other.tld
    def __str__(self):
        return self.domain

class DomFile(file):
    def next(self):
        return Domain(file.next(self).strip())

filenames=('data1.txt','data2.txt')
with contextlib.nested(*(DomFile(filename,'r') for filename in filenames)) as fhs:
    for elt in heapq.merge(*fhs):
        print(elt)

with data1.txt:

google.com
stackoverflow.com
yahoo.com

and data2.txt:

standards.freedesktop.org
www.imagemagick.org

yields:

google.com
stackoverflow.com
standards.freedesktop.org
www.imagemagick.org
yahoo.com

unutbu 2010-08-24 18:48:35

I wouldnt know how to make this work in my case. I need to use the 'key' function of .sort() because i'm sorting based on TLD rather than first character in the line

d-c 2010-08-24 18:53:07

I've edited my answer to show how you could sort things other than numbers.

unutbu 2010-08-24 19:03:07

@~unutbu: You can use `sorted( lst, key=Domain)` instead of explicitly mapping.

THC4k 2010-08-24 19:41:06

@THC4k: Thanks for the suggestion, but I'm not sure I follow. `sorted` will return strings. I need `dom1` to be an iterable of `Domain` objects. (Otherwise, `heapq.merge` will brainlessly merge them as strings instead of according to TLD.)

unutbu 2010-08-24 19:44:16

@~unutbu: sorry, you're right and i was too hasty.

THC4k 2010-08-24 19:49:05

Answer 2

A:

Unless your file is incomprehensibly huge, it will fit into memory.

Your pseudo-code is hard to read. Please indent your pseudo-code correctly. The final "loop by reading next line" makes no sense.

Basically, it's this.

all_data= []
for f in list_of_files:
    with open(f,'r') as source:
        all_data.extend( source.readlines() )
all_data.sort(... whatever your keys are... )

You're done. You can write all_data to a file, or process it further or whatever you want to do with it.

S.Lott 2010-08-24 18:53:57

is a few gigs too huge?

d-c 2010-08-24 21:06:33

@d-c: Nope. A few gigs is fine. You have to get to more gigs than you have virtual memory configured in your OS swap space before it begins to matter.

S.Lott 2010-08-24 22:47:30

Answer 3

A:

Another option (again, only if all your data won't fit into memory) is to create a SQLite3 database and do the sorting there and write it to file after.

JonC 2010-08-24 19:34:42

ansaurus

tags:

views:

answers:

merge sort in python

related questions