tags:

views:

139

answers:

5

I'm working with an app that is cpu-bound more than memory bound, and I'm trying to merge two things whether they be lists or dicts.

Now the thing is i can choose either one, but I'm wondering if merging dicts would be faster since it's all in memory? Or is it always going to be O(n), n being the size of the smaller list.

The reason I asked about dicts rather than sets is because I can't convert a set to json, because that results in {key1, key2, key3} and json needs a key/value pair, so I am using a dict so json dumps returns {key1:1, key2:1, key3:1}. Yes this is wasteful, but if it proves to be faster then I'm okay with it.

Edit: My question is the difference in using dict and list for merging, I originally and mistakenly had dict and set listed.

dict1 = {"the" : {"1":1, "3":1, "10":1}

dict2 = {"the" : {"11":1, "13":1}}

after merging

dict3 = {"the" : {"1":1, "3":1, "10":1, "11":1, "13":1}

+1  A: 

You can use the timeit module to measure the speed of your code, but I'm going to guess that they'll be practically the same (since a set is probably implemented using a dictionary).

Michael Aaron Safyan
+1  A: 

Dicts and sets will be just as fast (and O(N), as you surmise). Lists, which you only mention in your Q's title and never in its text, might be slower, depending what you mean by "merging".

Given the json downstream requirements, dicts with values all set to 1 will be fastest overall -- not for the merging, but for the JSON serialization.

Alex Martelli
+1  A: 

If you are looking for duplicate elimination, sets are very, very fast.

>>> x = set(range(1000000,2000000))
>>> y = set(range(1900000,2900000))

the following happened in ~0.020s  
>>> z = set.intersection(x,y)
>>> len(z)
100000

Regarding output to json, just convert to a list...

json_encode(list(z))
gahooa
@gahooa, but what about the time to convert to a list?
Michael Aaron Safyan
A: 

I'd be more worried about correctness. If you have duplicate keys, the list will duplicate your keys and values. A dictionary will only keep one of the values. Also, a list will keep the order consistent. Which do you prefer?

My gut reaction is that if you are searching for keys the dictionary will be faster. But how will you deal with duplication?

wisty
A: 

as Michael said, it's probably easiest to use the timeit module and see for yourself. It's very easy to do:

import timeit
def test():
    # do your thing here
    # including conversion to json
    pass

result = timeit.repeat(test, repeat=10, number=10000)
print '{0:.2}s per 10000 test runs.'.format(min(result))

Hope that helps.

Tripzilch