ansaurus

Question

Python synchronised reading of sorted files

Answer 1

A:

I would suggest pre-merge.

Reading a file takes a lot of processor time. Reading two files, twice as much. Since your program will be dealing with a large input (lots of files, esp in Group A), I think it would be better to get it over with in one file read, and have all your relevant data in that one file. It would also reduce the number of variables and read statements you will need.

This will improve the runtime of your algorithm, and I think that's a good enough reason in this scenario to decide to use this approach

Hope this helps

inspectorG4dget 2010-01-09 23:04:22

Answer 2

+2 A:

im thinking importing it into a db (mysql, sqlite, etc) will give better performance than merging it in script. the db typically has optimized routines for loading csv and the join will be probably be as fast or much faster than merging 2 dicts (one being very large) in python.

jspcal 2010-01-09 23:05:05

I had thought of that but we are talking about data-warehouse levels of records. In total I have well over 100GB of data (millions and millions of records). Maybe I was too quick to dismiss it but I thought that using a db to do a merge/sort could be done more quickly/elegantly within Python considering the set-up of files/records.

belvoir 2010-01-10 22:25:25

Answer 3

A:

You could read from the files in chunks of, say, 10000 records (or whatever number further profiling tells you to be optimal) and merge on the fly. Possibly using a custom class to encapsulate the IO; the actual records could then be accessed through the generator protocol (__iter__ + next).

This would be memory friendly, probably very good in terms of total time to complete the operation and would enable you to produce output incrementally.

A sketch:

class Foo(object):

    def __init__(self, env_filenames=[], event_filenames=[]):
        # open the files etc.

    def next(self):
        if self._cache = []:
            # take care of reading more records
        else:
            # return the first record and pop it from the cache

    # ... other stuff you need ...

Michał Marczyk 2010-01-09 23:14:00

Answer 4

+2 A:

"YYYY-MM-DD HH:MM:SS" can be sorted with a simple ascii compare. How about reusing external merge logic? If the first field is the key then:

for entry in os.popen("sort -m -t, -k1,1 file1 file2"):
    process(entry)

pixelbeat 2010-01-10 00:00:13

Answer 5

+1 A:

This is a similar to a relational join. Since your timestamps don't have to match, it's called a non-equijoin.

Sort-Merge is one of several popular algorithms. For non-equijoins, it works well. I think this would be what you're called "pre-merge". I don't know what you mean by "merge in real time", but I suspect it's still a simple sort-merge, which is a fine technique, heavily used by real databases.

Nested Loops can also work. In this case, you read the smaller table in the outer loop. In the inner loop you find all of the "matching" rows from the larger table. This is effectively a sort-merge, but with an assumption that there will be multiple rows from the big table that will match the small table.

This, BTW, will allow you to more properly assign meaning to the relationship between Event Data and Environmental Data. Rather than reading the result of a massive sort merge and trying to determine which kind of record you've got, the nested loops handle that well.

Also, you can do "lookups" into the smaller table while reading the larger table.

This is hard when you're doing non-equal comparisons because you don't have a proper key to do a simple retrieval from a simple dict. However, you can easily extend dict (override __contains__ and __getitem__) to do range comparisons on a key instead of simple equality tests.

S.Lott 2010-01-10 00:47:20

ansaurus

tags:

views:

answers:

Python synchronised reading of sorted files

related questions