views:

256

answers:

8

I am trying to find corresponding keys in two different dictionaries. Each has about 600k entries.

Say for example:

    myRDP = { 'Actinobacter': 'GATCGA...TCA', 'subtilus sp.': 'ATCGATT...ACT' }
    myNames = { 'Actinobacter': '8924342' }

I want to print out the value for Actinobacter (8924342) since it matches a value in myRDP.

The following code works, but is very slow:

    for key in myRDP:
        for jey in myNames:
            if key == jey:
                print key, myNames[key]

I've tried the following but it always results in a KeyError:

    for key in myRDP:
        print myNames[key]

Is there perhaps a function implemented in C for doing this? I've googled around but nothing seems to work.

Thanks.

+15  A: 

You could do this:

for key in myRDP:
    if key in myNames:
        print key, myNames[key]

Your first attempt was slow because you were comparing every key in myRDP with every key in myNames. In algorithmic jargon, if myRDP has n elements and myNames has m elements, then that algorithm would take O(n×m) operations. For 600k elements each, this is 360,000,000,000 comparisons!

But testing whether a particular element is a key of a dictionary is fast -- in fact, this is one of the defining characteristics of dictionaries. In algorithmic terms, the key in dict test is O(1), or constant-time. So my algorithm will take O(n) time, which is one 600,000th of the time.

John Fouhy
By the way, if that "O(n×m)" stuff is confusing you, this question might help: http://stackoverflow.com/questions/107165/big-o-for-eight-year-olds
John Fouhy
I'm not sure what your code did. I think it was printing out every key from myRDP and myNames, but kudos for the explanation
Austin
Inspect the second line carefully; it's not `for key in myNames`, it's `if key in myNames`.
Andrew Keeton
Also, +1 for explaining _why_ this method is faster.
Andrew Keeton
whoops! Your real code was very fast too. It took about a second. My original code would have taken ~166 hours!
Austin
Yep this is the right answer. Audism was doing a linear search on a dictionary. I don't get why people are suggesting using sets and intersections below...
Unknown
+3  A: 
for key in myRDP:
    name = myNames.get(key, None)
    if name:
        print key, name

dict.get returns the default value you give it (in this case, None) if the key doesn't exist.

Andrew Keeton
This one took about 1 second :)
Austin
+2  A: 

Use the get method instead:

 for key in myRDP:
    value = myNames.get(key)
    if value != None:
      print key, "=", value
JG
A: 
Alex
There isn't a 1:1 relation. There should be slightly more values in myNames. Would this still work?
Austin
Yes this would still work. You just instantiate a new array and copy these values horizontally together on match. See example above.
Alex
+8  A: 

Use sets, because they have a built-in intersection method which ought to be quick:

myRDP = { 'Actinobacter': 'GATCGA...TCA', 'subtilus sp.': 'ATCGATT...ACT' }
myNames = { 'Actinobacter': '8924342' }

rdpSet = set(myRDP)
namesSet = set(myNames)

for name in rdpSet.intersection(namesSet):
    print name, myNames[name]

# Prints: Actinobacter 8924342
RichieHindle
Of course, you have the time hit of creating the sets in the first place. I wonder if `[x for x in myRDP if x in myNames]` is quicker or slower than creating the two sets and taking the intersection?
John Fouhy
Without @Audism's data set we can't be sure, but my money would be on the sets being faster, even though they'll take some time to create.
RichieHindle
This one also took about 1 second. I didn't really notice any difference from creating the sets.
Austin
Yes, uses sets.
hughdbrown
I did some tests using fake data: keys were strings generated as `repr(math.log(i))[-8:]` for i from 1..600000 (first dict) and for i from 300000..900000 (second dict). Dict values were None. Method using sets took about .4s, versus .3s for the list comprehension in my earlier comment.
John Fouhy
@John: OK, interesting. Python never ceases to impress me!
RichieHindle
+3  A: 

You could start by finding the common keys and then iterating over them. Set operations should be fast because they are implemented in C, at least in modern versions of Python.

common_keys = set(myRDP).intersection(myNames)
for key in common_keys:
    print key, myNames[key]
Roberto Bonvallet
A: 

If the entries are sorted, you can try a binary search algorithm. Its the fastest search method you could do here.

Havenard
You don't use binary search to search a hash table!
John Fouhy
I don't see any hash table.
Havenard
Hash table is the underlying implementation of a dictionary, in most cases. That's what gives it constant-time lookup.
hughdbrown
A: 

Here is my code for doing intersections, unions, differences, and other set operations on dictionaries:

class DictDiffer(object):
    """
    Calculate the difference between two dictionaries as:
    (1) items added
    (2) items removed
    (3) keys same in both but changed values
    (4) keys same in both and unchanged values
    """
    def __init__(self, current_dict, past_dict):
     self.current_dict, self.past_dict = current_dict, past_dict
     self.set_current, self.set_past = set(current_dict.keys()), set(past_dict.keys())
     self.intersect = self.set_current.intersection(self.set_past)
    def added(self):
     return self.set_current - self.intersect 
    def removed(self):
     return self.set_past - self.intersect 
    def changed(self):
     return set(o for o in self.intersect if self.past_dict[o] != self.current_dict[o])
    def unchanged(self):
     return set(o for o in self.intersect if self.past_dict[o] == self.current_dict[o])

if __name__ == '__main__':
    import unittest
    class TestDictDifferNoChanged(unittest.TestCase):
     def setUp(self):
      self.past = dict((k, 2*k) for k in range(5))
      self.current = dict((k, 2*k) for k in range(3,8))
      self.d = DictDiffer(self.current, self.past)
     def testAdded(self):
      self.assertEqual(self.d.added(), set((5,6,7)))
     def testRemoved(self):  
      self.assertEqual(self.d.removed(), set((0,1,2)))
     def testChanged(self):
      self.assertEqual(self.d.changed(), set())
     def testUnchanged(self):
      self.assertEqual(self.d.unchanged(), set((3,4)))
    class TestDictDifferNoCUnchanged(unittest.TestCase):
     def setUp(self):
      self.past = dict((k, 2*k) for k in range(5))
      self.current = dict((k, 2*k+1) for k in range(3,8))
      self.d = DictDiffer(self.current, self.past)
     def testAdded(self):
      self.assertEqual(self.d.added(), set((5,6,7)))
     def testRemoved(self):  
      self.assertEqual(self.d.removed(), set((0,1,2)))
     def testChanged(self):
      self.assertEqual(self.d.changed(), set((3,4)))
     def testUnchanged(self):
      self.assertEqual(self.d.unchanged(), set())
    unittest.main()
hughdbrown