views:

113

answers:

3

I'm doing some indexing and memory is sufficient but CPU isn't. So I have one huge dictionary and then a smaller dictionary I'm merging into the bigger one:

big_dict = {"the" : {"1" : 1, "2" : 1, "3" : 1, "4" : 1, "5" : 1}}
smaller_dict = {"the" : {"6" : 1, "7" : 1}}
#after merging
resulting_dict = {"the" : {"1" : 1, "2" : 1, "3" : 1, "4" : 1, "5" : 1, "6" : 1, "7" : 1}}

My question is for the values in both dicts, should I use a dict (as displayed above) or list (as displayed below) when my priority is to use as much memory as possible to gain the most out of my CPU?

For clarification, using a list would look like:

big_dict = {"the" : [1, 2, 3, 4, 5]}
smaller_dict = {"the" : [6,7]}
#after merging
resulting_dict = {"the" : [1, 2, 3, 4, 5, 6, 7]}

Side note: The reason I'm using a dict nested into a dict rather than a set nested in a dict is because JSON won't let me do json.dumps because a set isn't key/value pairs, it's (as far as the JSON library is concerned) {"a", "series", "of", "keys"}

Also, after choosing between using dict to a list, how would I go about implementing the most efficient, in terms of CPU, method of merging them?

I appreciate the help.

+2  A: 

Hmmm. I would first go for a dict-of-dicts approach, as Python has one of the most fine-tuned dict implementation, so I highly doubt you can get any better with using a dict-of-lists.

As for merging the dicts, this should be enough:

for key, value in smaller_dict.iteritems():
    try:
        big_dict[key].update(value)
    except KeyError:
        big_dict[key] = dict(value)

I would probably also experiment with subclassing json.JSONEncoder to be able to serialize set types:

class SetEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, set):
            return dict.fromkeys(obj)
        return json.JSONEncoder.default(self, obj)

This latter method might add some overhead on the serialization side, however, and you will also need to convert these dicts to sets upon deserialization, either by subclassing json.JSONDecoder or doing it yourself in an extra step.

Tamás
Tamas - sorry I crossed over with your post while writing. Id normally avoid posting over an already good answer!
Ian
No problem, I guess we posted our solutions quite at the same time -- and it's always good if someone reinforces my answer :)
Tamás
+2  A: 

It really depends on what you want to do with the values in your inner lists/dictionaries. If, when you add a new entry, you want the inner list to have only unique values, then the list implementation will be much slower for large lists. It scales roughly at O(n), rather than O(1) [average case] for dictionaries.

If you don't care about multiples in those inner lists, then it is a closer thing.

I would use dictionaries, as you have. Python's dictionaries are very efficient (speaking as someone who's tried to implement dictionary data structures in C for real time applications).

As for not using sets. It would be better (since memory isn't an issue, you say), to tweak the serialization, and have the speed critical part of your code be as simple as possible. After deserialization, just go through and convert the lists to sets:

big_dict = {"the" : [1, 2, 3, 4, 5]} # imagine we got this from JSON

for key, value in big_dict: 
    big_dict[key] = set(value)

Should do it. Unless you are serializing / deserializing the whole index all the time, this added pre-processing costs should be amortized over enough requests not to matter.

Alternatively you can register encoders and decoders with JSON so you can do this conversion automatically. I usually don't bother when the issue is something so small and contained, however.

So in your dictionary based approach you could do:

for key, value in smaller_dict.items():
    if key in big_dict: 
        big_dict[key].update(value)
    else:
        big_dict[key] = value

If you want the big_dict to only copy the dictionary, use dict(value) instead of value on the last line. You could also use try: and except KeyError in the last loop, but if...else is a fraction faster (on my machine, YMMV).

Ian
Dictionaries are average-case O(1), not O(log n).
Daniel Stutzbach
Yes,Daniel you're right. I'll edit. Depending on the implementation they have worst case O(log n) or O(n).
Ian
+1  A: 

Any hashing-container will be better than a list for this kind of stuff.

I'd still use a set instead of a dict; if you're having trouble with json.dumps you can get around it by converting the set to a dict when you go to serialize: dict.fromkeys(the_set, 1) And for pulling them out: set(the_dict.keys())
It's easier than mucking about with registering JSON providers.

And as for merging: merged_set = bigger_set.union(smaller_set)

tzaman
I'm worried that dict(((item, 1) for item in the_set)) takes cycles that would be unnecessary based on my current implementation
tipu
Wait, just realized `dict` has a method for that already - see my updated answer. `fromkeys` should be pretty fast; worrying about cycles beyond that seems premature to me. Also, `set.union` should be faster than `dict.update`, so there's that.
tzaman
I'd suggest that, using a conversion for serialization, you convert to and from lists, rather than dicts, since at that point they will be more efficient, both in memory, time and JSON storage. Its only when you manipulate them that the lists become a bad idea. So go from sets -> list -> serialize, and deserialize -> list -> set before working with the data structure. @tzaman - yes, sets will be a little faster than dictionaries for `union`/`update`, and for various other manipulations. The big-O scale is the same, but they need less writes so should be a little faster.
Ian