tags:

views:

1103

answers:

6

I never actually thought I'd run into speed-issues with python, but I have. I'm trying to compare really big lists of dictionaries to each other based on the dictionary values. I compare two lists, with the first like so

biglist1=[{transaction:"somevalue", id:"somevalue", date:"somevalue" ...}, {transaction:"somevalue", id:"somevalue", date:"somevalue" ...}, ...]

With "somevalue" standing for a user-generated string, int or decimal. Now, the second list is pretty similar, except the id-values are always empty, as they have not been assigned yet.

biglist2=[{transaction:"somevalue", id:"", date:"somevalue" ...}, {transaction:"somevalue", id:"", date:"somevalue" ...}, ...]

So I want to get a list of the dictionaries in biglist2 that match the dictionaries in biglist1 for all other keys except id.

I've been doing

for item in biglist2:
    for transaction in biglist1:
       if item['transaction'] == transaction['transaction']:
          list_transactionnamematches.append(transaction)

for item in biglist2:
    for transaction in list_transactionnamematches:
       if item['date'] == transaction['date']:
          list_transactionnamematches.append(transaction)

... and so on, not comparing id values, until I get a final list of matches. Since the lists can be really big (around 3000+ items each), this takes quite some time for python to loop through.

I'm guessing this isn't really how this kind of comparison should be done. Any ideas?

+4  A: 

What you want to do is to use correct data structures:

  1. Create a dictionary of mappings of tuples of other values in the first dictionary to their id.

  2. Create two sets of tuples of values in both dictionaries. Then use set operations to get the tuple set you want.

  3. Use the dictionary from the point 1 to assign ids to those tuples.

J S
+1: use tuples as keys to a mapping.
S.Lott
I've written a code example that does this, but I think the set operations in step 2 are unnecessary, since you can cheaply check whether your target tuple is in your step 1 dict key list.
recursive
+1  A: 

In O(m*n)...

for item in biglist2:
    for transaction in biglist1:
       if (item['transaction'] == transaction['transaction'] &&
           item['date'] == transaction['date'] &&
           item['foo'] == transaction['foo'] ) :

          list_transactionnamematches.append(transaction)
Triptych
This will loop through biglist1 a total of len(biglist2) times.
Sparr
Right you are. Changed the intro text.
Triptych
Hooray for community interaction in action.
Sparr
Haha. And thanks. That's what I get for SOing at work.
Triptych
+1  A: 

Forgive my rusty python syntax, it's been a while, so consider this partially pseudocode

import operator
biglist1.sort(key=(operator.itemgetter(2),operator.itemgetter(0)))
biglist2.sort(key=(operator.itemgetter(2),operator.itemgetter(0)))
i1=0;
i2=0;
while i1 < len(biglist1) and i2 < len(biglist2):
    if (biglist1[i1]['date'],biglist1[i1]['transaction']) == (biglist2[i2]['date'],biglist2[i2]['transaction']):
        biglist3.append(biglist1[i1])
        i1++
        i2++
    elif (biglist1[i1]['date'],biglist1[i1]['transaction']) < (biglist2[i2]['date'],biglist2[i2]['transaction']):
        i1++
    elif (biglist1[i1]['date'],biglist1[i1]['transaction']) > (biglist2[i2]['date'],biglist2[i2]['transaction']):
        i2++
    else:
        print "this wont happen if i did the tuple comparison correctly"

This sorts both lists into the same order, by (date,transaction). Then it walks through them side by side, stepping through each looking for relatively adjacent matches. It assumes that (date,transaction) is unique, and that I am not completely off my rocker with regards to tuple sorting and comparison.

Sparr
+10  A: 

Index on the fields you want to use for lookup. O(n+m)

matches = []
biglist1_indexed = {}

for item in biglist1:
 biglist1_indexed[(item["transaction"], item["date"])] = item

for item in biglist2:
 if (item["transaction"], item["date"]) in biglist1_indexed:
  matches.append(item)

This is probably thousands of times faster than what you're doing now.

recursive
"if a in b:" is a search operation, which isn't constant time. In effect, this is still O(m*n) assuming a tuple search is linear.
codelogic
That's a bad assumption, because it's not. It's a hashtable lookup.
recursive
Cool, didn't know that +1 :)
codelogic
More info: Python's dictionary implementation reduces the average complexity of dictionary lookups to O(1) http://wiki.python.org/moin/DictionaryKeys
recursive
Ok, but in the statement "if a in b.keys()" it's doing a list search, not a dictionary lookup correct?
codelogic
Ah never mind, you edited it.
codelogic
Well, in python 3, it would be a hash lookup, but I wanted to make sure it was backward compatible too :)
recursive
Info about changes to dict.keys() in python 3 and others: http://www.python.org/dev/peps/pep-3106/
recursive
+1: dictionary keys based on tuples.
S.Lott
Awesome, thanks for the reference.
codelogic
Very interesting.
ayaz
Yes, seriously, this *was* thousands of times faster than what I did. Thanks a lot. Could someone also explain why this is so much faster? I'm a newbie, so I really didn't understand everything.
Tuomas
This might be made more easier to understand (for the newbies) with bliglist_index.has_key(tuple) . Just a thought.
Triptych
@Tuomas: The key is that checking for the existence of a dict key is much faster than a brute force iteration over all the keys. If you have n keys, you have to loop n times to do it your original ways, but theoretically, a dict key lookup always takes the same time, regardless of n.
recursive
A: 

The approach I would probably take to this is to make a very, very lightweight class with one instance variable and one method. The instance variable is a pointer to a dictionary; the method overrides the built-in special method __hash__(self), returning a value calculated from all the values in the dictionary except id.

From there the solution seems fairly obvious: Create two initially empty dictionaries: N and M (for no-matches and matches.) Loop over each list exactly once, and for each of these dictionaries representing a transaction (let's call it a Tx_dict), create an instance of the new class (a Tx_ptr). Then test for an item matching this Tx_ptr in N and M: if there is no matching item in N, insert the current Tx_ptr into N; if there is a matching item in N but no matching item in M, insert the current Tx_ptr into M with the Tx_ptr itself as a key and a list containing the Tx_ptr as the value; if there is a matching item in N and in M, append the current Tx_ptr to the value associated with that key in M.

After you've gone through every item once, your dictionary M will contain pointers to all the transactions which match other transactions, all neatly grouped together into lists for you.

Edit: Oops! Obviously, the correct action if there is a matching Tx_ptr in N but not in M is to insert a key-value pair into M with the current Tx_ptr as the key and as the value, a list of the current Tx_ptr and the Tx_ptr that was already in N.

A: 

Have a look at Psyco. Its a Python compiler that can create very fast, optimized machine code from your source.

http://sourceforge.net/projects/psyco/

While this isn't a direct solution to your code's efficiency issues, it could still help speed things up without needing to write any new code. That said, I'd still highly recommend optimizing your code as much as possible AND use Psyco to squeeze as much speed out of it as possible.

Part of their guide specifically talks about using it to speed up list, string, and numeric computation heavy functions.

http://psyco.sourceforge.net/psycoguide/node8.html

Soviut