ansaurus

Question

Fastest method in merging of the two: dicts vs lists

Answer 1

+2 A:

Hmmm. I would first go for a dict-of-dicts approach, as Python has one of the most fine-tuned dict implementation, so I highly doubt you can get any better with using a dict-of-lists.

As for merging the dicts, this should be enough:

for key, value in smaller_dict.iteritems():
    try:
        big_dict[key].update(value)
    except KeyError:
        big_dict[key] = dict(value)

I would probably also experiment with subclassing json.JSONEncoder to be able to serialize set types:

class SetEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, set):
            return dict.fromkeys(obj)
        return json.JSONEncoder.default(self, obj)

This latter method might add some overhead on the serialization side, however, and you will also need to convert these dicts to sets upon deserialization, either by subclassing json.JSONDecoder or doing it yourself in an extra step.

Tamás 2010-05-17 12:59:32

Tamas - sorry I crossed over with your post while writing. Id normally avoid posting over an already good answer!

Ian 2010-05-17 13:09:46

No problem, I guess we posted our solutions quite at the same time -- and it's always good if someone reinforces my answer :)

Tamás 2010-05-17 13:52:18

Answer 2

+2 A:

It really depends on what you want to do with the values in your inner lists/dictionaries. If, when you add a new entry, you want the inner list to have only unique values, then the list implementation will be much slower for large lists. It scales roughly at O(n), rather than O(1) [average case] for dictionaries.

If you don't care about multiples in those inner lists, then it is a closer thing.

I would use dictionaries, as you have. Python's dictionaries are very efficient (speaking as someone who's tried to implement dictionary data structures in C for real time applications).

As for not using sets. It would be better (since memory isn't an issue, you say), to tweak the serialization, and have the speed critical part of your code be as simple as possible. After deserialization, just go through and convert the lists to sets:

big_dict = {"the" : [1, 2, 3, 4, 5]} # imagine we got this from JSON

for key, value in big_dict: 
    big_dict[key] = set(value)

Should do it. Unless you are serializing / deserializing the whole index all the time, this added pre-processing costs should be amortized over enough requests not to matter.

Alternatively you can register encoders and decoders with JSON so you can do this conversion automatically. I usually don't bother when the issue is something so small and contained, however.

So in your dictionary based approach you could do:

for key, value in smaller_dict.items():
    if key in big_dict: 
        big_dict[key].update(value)
    else:
        big_dict[key] = value

If you want the big_dict to only copy the dictionary, use dict(value) instead of value on the last line. You could also use try: and except KeyError in the last loop, but if...else is a fraction faster (on my machine, YMMV).

Ian 2010-05-17 13:08:53

Dictionaries are average-case O(1), not O(log n).

Daniel Stutzbach 2010-05-17 13:27:47

Yes,Daniel you're right. I'll edit. Depending on the implementation they have worst case O(log n) or O(n).

Ian 2010-05-17 15:32:14

Answer 3

+1 A:

Any hashing-container will be better than a list for this kind of stuff.

I'd still use a set instead of a dict; if you're having trouble with json.dumps you can get around it by converting the set to a dict when you go to serialize: dict.fromkeys(the_set, 1) And for pulling them out: set(the_dict.keys())
It's easier than mucking about with registering JSON providers.

And as for merging: merged_set = bigger_set.union(smaller_set)

tzaman 2010-05-17 13:24:44

I'm worried that dict(((item, 1) for item in the_set)) takes cycles that would be unnecessary based on my current implementation

tipu 2010-05-17 13:29:34

Wait, just realized `dict` has a method for that already - see my updated answer. `fromkeys` should be pretty fast; worrying about cycles beyond that seems premature to me. Also, `set.union` should be faster than `dict.update`, so there's that.

tzaman 2010-05-17 13:35:10

I'd suggest that, using a conversion for serialization, you convert to and from lists, rather than dicts, since at that point they will be more efficient, both in memory, time and JSON storage. Its only when you manipulate them that the lists become a bad idea. So go from sets -> list -> serialize, and deserialize -> list -> set before working with the data structure. @tzaman - yes, sets will be a little faster than dictionaries for `union`/`update`, and for various other manipulations. The big-O scale is the same, but they need less writes so should be a little faster.

Ian 2010-05-17 15:40:14

ansaurus

tags:

views:

answers:

Fastest method in merging of the two: dicts vs lists

related questions