views:

109

answers:

5

this is similar to the question in http://stackoverflow.com/questions/3559807/merge-sort-in-python I'm restating because I don't think I explained the problem very well over there.

basically I have a series of about 1000 files all containing domain names. altogether the data is > 1gig so I'm trying to avoid loading all the data into ram. each individual file has been sorted using .sort(get_tld) which has sorted the data according to its TLD (not according to its domain name. sorted all the .com's together, .orgs together, etc)

a typical file might look like

something.ca
somethingelse.ca
somethingnew.com
another.net
whatever.org
etc.org

but obviosuly longer.

I now want to merge all the files into one, maintaining the sort so that in the end the one large file will still have all the .coms together, .orgs together, etc.

What I want to do basically is

open all the files
loop:
    read 1 line from each open file
    put them all in a list and sort with .sort(get_tld)
    write each item from the list to a new file

the problem I'm having is that I can't figure out how to loop over the files I can't use with open() as because I don't have 1 file open to loop over, I have many. Also they're all of variable length so I have to make sure to get all the way through the longest one.

any advice is much appreciated.

+6  A: 

Whether you're able to keep 1000 files at once is a separate issue and depends on your OS and its configuration; if not, you'll have to proceed in two steps -- merge groups of N files into temporary ones, then merge the temporary ones into the final-result file (two steps should suffice, as they let you merge a total of N squared files; as long as N is at least 32, merging 1000 files should therefore be possible). In any case, this is a separate issue from the "merge N input files into one output file" task (it's only an issue of whether you call that function once, or repeatedly).

The general idea for the function is to keep a priority queue (module heapq is good at that;-) with small lists containing the "sorting key" (the current TLD, in your case) followed by the last line read from the file, and finally the open file ready for reading the next line (and something distinct in between to ensure that the normal lexicographical order won't accidentally end up trying to compare two open files, which would fail). I think some code is probably the best way to explain the general idea, so next I'll edit this answer to supply the code (however I have no time to test it, so take it as pseudocode intended to communicate the idea;-).

import heapq

def merge(inputfiles, outputfile, key):
  """inputfiles: list of input, sorted files open for reading.
     outputfile: output file open for writing.
     key: callable supplying the "key" to use for each line.
  """
  # prepare the heap: items are lists with [thekey, k, theline, thefile]
  # where k is an arbitrary int guaranteed to be different for all items,
  # theline is the last line read from thefile and not yet written out,
  # (guaranteed to be a non-empty string), thekey is key(theline), and
  # thefile is the open file
  h = [(k, i.readline(), i) for k, i in enumerate(inputfiles)]
  h = [[key(s), k, s, i] for k, s, i in h if s]
  heapq.heapify(h)

  while h:
    # get and output the lowest available item (==available item w/lowest key)
    item = heapq.heappop(h)
    outputfile.write(item[2])

    # replenish the item with the _next_ line from its file (if any)
    item[2] = item[3].readline()
    if not item[2]: continue  # don't reinsert finished files

    # compute the key, and re-insert the item appropriately
    item[0] = key(item[2])
    heapq.heappush(h, item)

Of course, in your case, as the key function you'll want one that extracts the top-level domain given a line that's a domain name (with trailing newline) -- in a previous question you were already pointed to the urlparse module as preferable to string manipulation for this purpose. If you do insist on string manipulation,

def tld(domain):
  return domain.rsplit('.', 1)[-1].strip()

or something along these lines is probably a reasonable approach under this constraint.

If you use Python 2.6 or better, heapq.merge is the obvious alternative, but in that case you need to prepare the iterators yourself (including ensuring that "open file objects" never end up being compared by accident...) with a similar "decorate / undecorate" approach from that I use in the more portable code above.

Alex Martelli
+3  A: 

You want to use merge sort, e.g. heapq.merge. I'm not sure if your OS allows you to open 1000 files simultaneously. If not you may have to do it in 2 or more passes.

Wai Yip Tung
+2  A: 

Why don't you divide the domains by first letter, so you would just split the source files into 26 or more files which could be named something like: domains-a.dat, domains-b.dat. Then you can load these entirely into RAM and sort them and write them out to a common file.

So: 3 input files split into 26+ source files 26+ source files could be loaded individually, sorted in RAM and then written to the combined file.

If 26 files are not enough, I'm sure you could split into even more files... domains-ab.dat. The point is that files are cheap and easy to work with (in Python and many other languages), and you should use them to your advantage.

orvado
+1  A: 

Your algorithm for merging sorted files is incorrect. What you do is read one line from each file, find the lowest-ranked item among all the lines read, and write it to the output file. Repeat this process (ignoring any files that are at EOF) until the end of all files has been reached.

kindall
+1, but that's basically what Alex's solution does.
EOL
A: 
#! /usr/bin/env python

"""Usage: unconfuse.py file1 file2 ... fileN

Reads a list of domain names from each file, and writes them to standard output grouped by TLD.
"""

import sys, os

spools = {}

for name in sys.argv[1:]:
    for line in file(name):
        if (line == "\n"): continue
        tld = line[line.rindex(".")+1:-1]
        spool = spools.get(tld, None)
        if (spool == None):
            spool = file(tld + ".spool", "w+")
            spools[tld] = spool
        spool.write(line)

for tld in sorted(spools.iterkeys()):
    spool = spools[tld]
    spool.seek(0)
    for line in spool:
        sys.stdout.write(line)
    spool.close()
    os.remove(spool.name)
Tom Anderson
Either everyone else has misinterpreted the question, and is trying to solve a harder problem that is not the one @d-c is trying to solve, or i have, and this is useless. To be precise, he is *not* actually interested in sorting at all, just in merging some already-sorted lists, for a small value of 'sorted'.
Tom Anderson
Disregard that, i suck Blochs. We've all understood the problem correctly, and mine is merely a simple but rather inefficient solution.
Tom Anderson