views:

82

answers:

5

I have two groups of files that contain data in CSV format with a common key (Timestamp) - I need to walk through all the records chronologically.

  • Group A: 'Environmental Data'

    • Filenames are in format A_0001.csv, A_0002.csv, etc.
    • Pre-sorted ascending
    • Key is Timestamp, i.e.YYYY-MM-DD HH:MM:SS
    • Contains environmental data in CSV/column format
    • Very large, serveral GBs worth of data
  • Group B: 'Event Data'

    • Filenames are in format B_0001.csv, B_0002.csv
    • Pre-sorted ascending
    • Key is Timestamp, i.e.YYYY-MM-DD HH:MM:SS
    • Contains event based data in CSV/column format
    • Relatively small compared to Group A files, < 100 MB

What is best approach?

  • Pre-merge: Use one of the various recipes out there to merge the files into a single sorted output and then read it for processing
  • Real-time merge: Implement code to 'merge' the files in real-time

I will be running lots of iterations of the post-processing side of things. Any thoughts or suggestions? I am using Python.

A: 

I would suggest pre-merge.

Reading a file takes a lot of processor time. Reading two files, twice as much. Since your program will be dealing with a large input (lots of files, esp in Group A), I think it would be better to get it over with in one file read, and have all your relevant data in that one file. It would also reduce the number of variables and read statements you will need.

This will improve the runtime of your algorithm, and I think that's a good enough reason in this scenario to decide to use this approach

Hope this helps

inspectorG4dget
+2  A: 

im thinking importing it into a db (mysql, sqlite, etc) will give better performance than merging it in script. the db typically has optimized routines for loading csv and the join will be probably be as fast or much faster than merging 2 dicts (one being very large) in python.

jspcal
I had thought of that but we are talking about data-warehouse levels of records. In total I have well over 100GB of data (millions and millions of records). Maybe I was too quick to dismiss it but I thought that using a db to do a merge/sort could be done more quickly/elegantly within Python considering the set-up of files/records.
belvoir
A: 

You could read from the files in chunks of, say, 10000 records (or whatever number further profiling tells you to be optimal) and merge on the fly. Possibly using a custom class to encapsulate the IO; the actual records could then be accessed through the generator protocol (__iter__ + next).

This would be memory friendly, probably very good in terms of total time to complete the operation and would enable you to produce output incrementally.

A sketch:

class Foo(object):

    def __init__(self, env_filenames=[], event_filenames=[]):
        # open the files etc.

    def next(self):
        if self._cache = []:
            # take care of reading more records
        else:
            # return the first record and pop it from the cache

    # ... other stuff you need ...
Michał Marczyk
+2  A: 

"YYYY-MM-DD HH:MM:SS" can be sorted with a simple ascii compare. How about reusing external merge logic? If the first field is the key then:

for entry in os.popen("sort -m -t, -k1,1 file1 file2"):
    process(entry)
pixelbeat
+1  A: 

This is a similar to a relational join. Since your timestamps don't have to match, it's called a non-equijoin.

Sort-Merge is one of several popular algorithms. For non-equijoins, it works well. I think this would be what you're called "pre-merge". I don't know what you mean by "merge in real time", but I suspect it's still a simple sort-merge, which is a fine technique, heavily used by real databases.

Nested Loops can also work. In this case, you read the smaller table in the outer loop. In the inner loop you find all of the "matching" rows from the larger table. This is effectively a sort-merge, but with an assumption that there will be multiple rows from the big table that will match the small table.

This, BTW, will allow you to more properly assign meaning to the relationship between Event Data and Environmental Data. Rather than reading the result of a massive sort merge and trying to determine which kind of record you've got, the nested loops handle that well.

Also, you can do "lookups" into the smaller table while reading the larger table.

This is hard when you're doing non-equal comparisons because you don't have a proper key to do a simple retrieval from a simple dict. However, you can easily extend dict (override __contains__ and __getitem__) to do range comparisons on a key instead of simple equality tests.

S.Lott