ansaurus

Question

Merging two datasets in Python efficiently

Answer 1

+6 A:

Here's a radical approach.

Don't.

You have two CSV files; one (users) is clearly the driver. Leave this alone. The other -- transaction codes for a user -- can be turned into a simple dictionary.

Don't "combine" or "join" anything except when absolutely necessary. Certainly don't "merge" or "pre-join".

Write your application do simply do simple lookups in the other collection.

Create a list of dictionaries and iterate over the list comparing the key each time,

Close. It looks like this. Note: No Sort.

import csv
with open('activations.csv','rb') as act_data:
    rdr= csv.DictReader( act_data)
    activations = dict( (row['user'],row) for row in rdr )
with open('users.csv','rb') as user_data:
    rdr= csv.DictReader( user_data )
    with open( 'users_2.csv','wb') as updated_data:
        wtr= csv.DictWriter( updated_data, ['some','list','of','columns'])
        for user in rdr:
             user['some_field']= activations[user['user_id_column']]['some_field']
             wtr.writerow( user )

This is fast and simple. Save the dictionaries (use shelve or pickle).

however, worst case scenario this could run up to len(inputDict)*len(outputDict) <- Not sure?

False.

One list is the "driving" list. The other is the lookup list. You'll drive by iterating through users and lookup appropriate values for transaction. This is O( n ) on the list of users. The lookup is O( 1 ) because dictionaries are hashes.

S.Lott 2010-07-27 12:53:49

The current dictionnaries seems to be merely database rows with named fields, and that's not a good structure for a lookup. Why do you say current keys are "proper" ?

kriss 2010-07-27 13:02:37

@kriss: I don't believe that they are database rows. What evidence do you have for your opinion? Keys are "proper" because they're keys in a Python dictionary.

S.Lott 2010-07-27 13:04:29

@S.Lott: I believe the syntax used ` {user: myUser, ... }` should be ` {'user': myUser, ...}`. I believe so because of the `my` prefix before variables. I understand this as a use of a dictionnary as a named tuple, but it's not necessarilly actual database rows, that's just an analogy.

kriss 2010-07-27 13:08:22

@S.Lott: another hint that the structure shown is not actual dictionnary (or not actual python code) is that an unsique dictionary where keys are not of the same type, nor values (transactions and users) is quite strange.

kriss 2010-07-27 13:29:04

Answer 2

+1 A:

Sort the two data sets by transaction number. That way, you always only need to keep one row of each in memory.

Aaron Digulla 2010-07-27 12:56:32

-1: Sorts are slow. Dictionaries -- because they're hashed -- are fast.

S.Lott 2010-07-29 11:36:37

Sorts can become very fast if you can't keep the dictionary in RAM anymore.

Aaron Digulla 2010-07-29 19:14:24

Answer 3

A:

I'd create a map myTransactionNumber -> {transaction: myTransactionNumber, activationNumber: myActivationNumber} and then iterate on {user: myUser, transaction: myTransactionNumber} entries and search in the map for needed myTransactionNumber. The complexity of a search should be O(log N) where N is amount of the entries in the set. So the overal complexity would be O(M*log N) where M is amount of user entries.

Drakosha 2010-07-27 12:56:34

Answer 4

+1 A:

This looks like a typical use for dictionaries with transaction number as key. But you don't have to create the common structure, just build the lookup dictionnaries and use them as needed.

kriss 2010-07-27 13:04:35

ansaurus

tags:

views:

answers:

Merging two datasets in Python efficiently

related questions