ansaurus

Question

Comparing massive lists of dictionaries in python

Answer 1

+4 A:

What you want to do is to use correct data structures:

Create a dictionary of mappings of tuples of other values in the first dictionary to their id.
Create two sets of tuples of values in both dictionaries. Then use set operations to get the tuple set you want.
Use the dictionary from the point 1 to assign ids to those tuples.

J S 2008-12-19 22:58:09

+1: use tuples as keys to a mapping.

S.Lott 2008-12-20 00:27:36

I've written a code example that does this, but I think the set operations in step 2 are unnecessary, since you can cheaply check whether your target tuple is in your step 1 dict key list.

recursive 2008-12-20 01:03:00

Answer 2

+1 A:

In O(m*n)...

for item in biglist2:
    for transaction in biglist1:
       if (item['transaction'] == transaction['transaction'] &&
           item['date'] == transaction['date'] &&
           item['foo'] == transaction['foo'] ) :

          list_transactionnamematches.append(transaction)

Triptych 2008-12-19 23:02:50

This will loop through biglist1 a total of len(biglist2) times.

Sparr 2008-12-19 23:37:44

Right you are. Changed the intro text.

Triptych 2008-12-19 23:48:47

Hooray for community interaction in action.

Sparr 2008-12-20 00:22:51

Haha. And thanks. That's what I get for SOing at work.

Triptych 2008-12-20 00:24:02

Answer 3

+1 A:

Forgive my rusty python syntax, it's been a while, so consider this partially pseudocode

import operator
biglist1.sort(key=(operator.itemgetter(2),operator.itemgetter(0)))
biglist2.sort(key=(operator.itemgetter(2),operator.itemgetter(0)))
i1=0;
i2=0;
while i1 < len(biglist1) and i2 < len(biglist2):
    if (biglist1[i1]['date'],biglist1[i1]['transaction']) == (biglist2[i2]['date'],biglist2[i2]['transaction']):
        biglist3.append(biglist1[i1])
        i1++
        i2++
    elif (biglist1[i1]['date'],biglist1[i1]['transaction']) < (biglist2[i2]['date'],biglist2[i2]['transaction']):
        i1++
    elif (biglist1[i1]['date'],biglist1[i1]['transaction']) > (biglist2[i2]['date'],biglist2[i2]['transaction']):
        i2++
    else:
        print "this wont happen if i did the tuple comparison correctly"

This sorts both lists into the same order, by (date,transaction). Then it walks through them side by side, stepping through each looking for relatively adjacent matches. It assumes that (date,transaction) is unique, and that I am not completely off my rocker with regards to tuple sorting and comparison.

Sparr 2008-12-19 23:07:13

Answer 4

+10 A:

Index on the fields you want to use for lookup. O(n+m)

matches = []
biglist1_indexed = {}

for item in biglist1:
 biglist1_indexed[(item["transaction"], item["date"])] = item

for item in biglist2:
 if (item["transaction"], item["date"]) in biglist1_indexed:
  matches.append(item)

This is probably thousands of times faster than what you're doing now.

recursive 2008-12-20 01:01:00

"if a in b:" is a search operation, which isn't constant time. In effect, this is still O(m*n) assuming a tuple search is linear.

codelogic 2008-12-20 01:49:36

That's a bad assumption, because it's not. It's a hashtable lookup.

recursive 2008-12-20 01:51:20

Cool, didn't know that +1 :)

codelogic 2008-12-20 01:55:31

More info: Python's dictionary implementation reduces the average complexity of dictionary lookups to O(1) http://wiki.python.org/moin/DictionaryKeys

recursive 2008-12-20 01:59:10

Ok, but in the statement "if a in b.keys()" it's doing a list search, not a dictionary lookup correct?

codelogic 2008-12-20 02:10:41

Ah never mind, you edited it.

codelogic 2008-12-20 02:11:41

Well, in python 3, it would be a hash lookup, but I wanted to make sure it was backward compatible too :)

recursive 2008-12-20 02:13:01

Info about changes to dict.keys() in python 3 and others: http://www.python.org/dev/peps/pep-3106/

recursive 2008-12-20 02:18:42

+1: dictionary keys based on tuples.

S.Lott 2008-12-20 03:50:04

Awesome, thanks for the reference.

codelogic 2008-12-20 04:25:49

Very interesting.

ayaz 2008-12-20 09:24:20

Yes, seriously, this *was* thousands of times faster than what I did. Thanks a lot. Could someone also explain why this is so much faster? I'm a newbie, so I really didn't understand everything.

Tuomas 2008-12-20 20:18:46

This might be made more easier to understand (for the newbies) with bliglist_index.has_key(tuple) . Just a thought.

Triptych 2008-12-21 18:01:51

@Tuomas: The key is that checking for the existence of a dict key is much faster than a brute force iteration over all the keys. If you have n keys, you have to loop n times to do it your original ways, but theoretically, a dict key lookup always takes the same time, regardless of n.

recursive 2008-12-22 03:45:33

Answer 5

A:

The approach I would probably take to this is to make a very, very lightweight class with one instance variable and one method. The instance variable is a pointer to a dictionary; the method overrides the built-in special method __hash__(self), returning a value calculated from all the values in the dictionary except id.

From there the solution seems fairly obvious: Create two initially empty dictionaries: N and M (for no-matches and matches.) Loop over each list exactly once, and for each of these dictionaries representing a transaction (let's call it a Tx_dict), create an instance of the new class (a Tx_ptr). Then test for an item matching this Tx_ptr in N and M: if there is no matching item in N, insert the current Tx_ptr into N; if there is a matching item in N but no matching item in M, insert the current Tx_ptr into M with the Tx_ptr itself as a key and a list containing the Tx_ptr as the value; if there is a matching item in N and in M, append the current Tx_ptr to the value associated with that key in M.

After you've gone through every item once, your dictionary M will contain pointers to all the transactions which match other transactions, all neatly grouped together into lists for you.

Edit: Oops! Obviously, the correct action if there is a matching Tx_ptr in N but not in M is to insert a key-value pair into M with the current Tx_ptr as the key and as the value, a list of the current Tx_ptr and the Tx_ptr that was already in N.

2008-12-20 01:31:19

Answer 6

A:

Have a look at Psyco. Its a Python compiler that can create very fast, optimized machine code from your source.

http://sourceforge.net/projects/psyco/

While this isn't a direct solution to your code's efficiency issues, it could still help speed things up without needing to write any new code. That said, I'd still highly recommend optimizing your code as much as possible AND use Psyco to squeeze as much speed out of it as possible.

Part of their guide specifically talks about using it to speed up list, string, and numeric computation heavy functions.

http://psyco.sourceforge.net/psycoguide/node8.html

Soviut 2008-12-20 08:32:02

ansaurus

tags:

views:

answers:

Comparing massive lists of dictionaries in python

related questions