views:

106

answers:

4

What would anyone consider the most efficient way to merge two datasets using Python?

A little background - this code will take 100K+ records in the following format:

{user: aUser, transaction: UsersTransactionNumber}, ...

and using the following data

{transaction: aTransactionNumber, activationNumber: assoiciatedActivationNumber}, ...

to create

{user: aUser, activationNumber: assoiciatedActivationNumber}, ...

N.B These are not Python dictionaries, just the closest thing to portraying record format cleanly.

So in theory, all I am trying to do is create a view of two lists (or tables) joining on a common key - at first this points me towards sets (unions etc), but before I start learning these in depth, are they the way to go? So far I felt this could be implemented as:

  1. Create a list of dictionaries and iterate over the list comparing the key each time, however, worst case scenario this could run up to len(inputDict)*len(outputDict) <- Not sure?

  2. Manipulate the data as an in-memory SQLite Table? Peferrably not as although there is no strict requirement for Python 2.4, it would make life easier.

  3. Some kind of Set based magic?

Clarification

The whole purpose of this script is to summarise, the actual data sets are coming from two different sources. The user and transaction numbers are coming in the form of a CSV as an output from a performance test that is testing email activation code throughput. The second dataset comes from parsing the test mailboxes, which contain the transaction id and activation code. The output of this test is then a CSV that will get pumped back into stage 2 of the performance test, activating user accounts using the activation codes that were paired up.

Apologies if my notation for the records was misleading, I have updated them accordingly.

Thanks for the replies, I am going to give two ideas a try:

  • Sorting the lists first (I don't know how expensive this is)
  • Creating a dictionary with the transactionCodes as the key then store the user and activation code in a list as the value

Performance isn't overly paramount for me, I just want to try and get into good habits with my Python Programming.

+6  A: 

Here's a radical approach.

Don't.

You have two CSV files; one (users) is clearly the driver. Leave this alone. The other -- transaction codes for a user -- can be turned into a simple dictionary.

Don't "combine" or "join" anything except when absolutely necessary. Certainly don't "merge" or "pre-join".

Write your application do simply do simple lookups in the other collection.

Create a list of dictionaries and iterate over the list comparing the key each time,

Close. It looks like this. Note: No Sort.

import csv
with open('activations.csv','rb') as act_data:
    rdr= csv.DictReader( act_data)
    activations = dict( (row['user'],row) for row in rdr )
with open('users.csv','rb') as user_data:
    rdr= csv.DictReader( user_data )
    with open( 'users_2.csv','wb') as updated_data:
        wtr= csv.DictWriter( updated_data, ['some','list','of','columns'])
        for user in rdr:
             user['some_field']= activations[user['user_id_column']]['some_field']
             wtr.writerow( user )

This is fast and simple. Save the dictionaries (use shelve or pickle).

however, worst case scenario this could run up to len(inputDict)*len(outputDict) <- Not sure?

False.

One list is the "driving" list. The other is the lookup list. You'll drive by iterating through users and lookup appropriate values for transaction. This is O( n ) on the list of users. The lookup is O( 1 ) because dictionaries are hashes.

S.Lott
The current dictionnaries seems to be merely database rows with named fields, and that's not a good structure for a lookup. Why do you say current keys are "proper" ?
kriss
@kriss: I don't believe that they are database rows. What evidence do you have for your opinion? Keys are "proper" because they're keys in a Python dictionary.
S.Lott
@S.Lott: I believe the syntax used ` {user: myUser, ... }` should be ` {'user': myUser, ...}`. I believe so because of the `my` prefix before variables. I understand this as a use of a dictionnary as a named tuple, but it's not necessarilly actual database rows, that's just an analogy.
kriss
@S.Lott: another hint that the structure shown is not actual dictionnary (or not actual python code) is that an unsique dictionary where keys are not of the same type, nor values (transactions and users) is quite strange.
kriss
+1  A: 

Sort the two data sets by transaction number. That way, you always only need to keep one row of each in memory.

Aaron Digulla
-1: Sorts are slow. Dictionaries -- because they're hashed -- are fast.
S.Lott
Sorts can become very fast if you can't keep the dictionary in RAM anymore.
Aaron Digulla
A: 

I'd create a map myTransactionNumber -> {transaction: myTransactionNumber, activationNumber: myActivationNumber} and then iterate on {user: myUser, transaction: myTransactionNumber} entries and search in the map for needed myTransactionNumber. The complexity of a search should be O(log N) where N is amount of the entries in the set. So the overal complexity would be O(M*log N) where M is amount of user entries.

Drakosha
+1  A: 

This looks like a typical use for dictionaries with transaction number as key. But you don't have to create the common structure, just build the lookup dictionnaries and use them as needed.

kriss