tags:

views:

128

answers:

3

I have two Python lists of dictionaries, entries9 and entries10. I want to compare the items and write joint items to a new list called joint_items. I also want to save the unmatched items to two new lists, unmatched_items_9 and unmatched_items_10.

This is my code. Getting the joint_items and unmatched_items_9 (in the outer list) is quite easy: but how do I get unmatched_items_10 (in the inner list)?

for counter, entry1 in enumerate(entries9):
    match_found = False
    for counter2,entry2 in enumerate(entries10):
        if match_found:
            continue
        if entry1[a]==entry2[a] and entry1[b]==entry2[b]: # the dictionaries only have some keys in common, but we care about a and b
            match_found = True
            joint_item = entry1
            joint_items.append(joint_item)
            #entries10.remove(entry2) # Tried this originally, but realised it messes with the original list object!
    if match_found:
        continue
    else: 
        unmatched_items_9.append(entry1)

Performance is not really an issue, since it's a one-off script.

A: 

The Python stdlib has a class, difflib.SequenceMatcher that looks like it can do what you want, though I don't know how to use it!

Ned Batchelder
+5  A: 

The equivalent of what you're currently doing, but the other way around, is:

unmatched_items_10 = [d for d in entries10 if d not in entries9]

While more concise than your way of coding it, this has the same performance problem: it will take time proportional to the number of items in each list. If the lengths you're interested in are about 9 or 10 (as those numbers seem to indicate), no problem.

But for lists of substantial length you can get much better performance by sorting the lists and "stepping through" them "in parallel" so to speak (time proportional to N log N where N is the length of the longer list). There are other possibilities, too (of growing complication;-) if even this more advanced approach is not sufficient to get you the performance you need. I'll refrain from suggesting very complicated stuff unless you indicate that you do require it to get good performance (in which case, please mention the typical lengths of each list and the typical contents of the dicts that are their items, since of course such "details" are the crucial consideration for picking algorithms that are a good compromise between speed and simplicity).

Edit: the OP edited his Q to show what he cares about, for any two dicts d1 and d2 one each from the two lists, is not whether d1 == d2 (which is what the in operator checks), but rather d1[a]==d2[a] and d1[b]==d2[b]. In this case the in operator cannot be used (well, not without some funky wrapping, but that's a complication that's best avoided when feasible;-), but the all builtin replaces it handily:

unmatched_items_10 = [d for d in entries10
                      if all(d[a]!=d1[a] or d[b]!=d2[b] for d2 in entries9)]

I have switched the logic around (to != and or, per De Morgan's laws) since we want the dicts that are not matched. However, if you prefer:

unmatched_items_10 = [d for d in entries10
                      if not any(d[a]==d1[a] and d[b]==d2[b] for d2 in entries9)]

Personally, I don't like if not any and if not all, for stylistic reasons, but the maths are impeccable (by what the Wikipedia page calls the Extensions to De Morgan's laws, since any is an existential quantifier and all a universal quantifier, so to speak;-). Performance should be just about equivalent (but then, the OP did clarify in a comment that performance is not very important for them on this task).

Alex Martelli
Thank you for this detailed answer. Performance isn't an issue - it's a one-off script to clean up some data and it doesn't matter how long it takes. Unfortunately though, 'd not in entries9' doesn't work, because the match condition is more complicated - I have to compare certain fields. It's more like "if d[a]==entries9_item[a] and d[b]==entries9_item[b]". I'll update the question to make this clearer.
AP257
@AP257, it _would_ have been nice of you to mention that in the first place, you know -- equality checks are obviously special cases, and that's what you were using;-). Anyway, editing my answer to show how the code changes.
Alex Martelli
Sorry. Thank you for this - very neat use of all() and any(). To get the joint_items list, do you think I should simply do "joint_items = [d for d in entries10 if all(d[a]==d1[a] or d[b]==d2[b] for d2 in entries9)]"? That seems repetitive, but probably safer than messing around with the original objects.
AP257
@AP257, safer, yes, though you want `any`, not `all`. Though _three_ loops are starting to stretch it, there's no good clean way to do it with a single loop, so the potential performance gains are small. If the `d[a]` and `d[b]` for every `d` are hashable, there are much faster ways of course (but you did says you don't care much about performance here, so I'd just do three loops).
Alex Martelli
great. thanks again.
AP257
A: 

You may consider using sets and their associated methods, like intersection. You will however, need to turn your dictionaries into immutable data so that you can store them in a set (e.g. strings). Would something like this work?

a = set(str(x) for x in entries9)
b = set(str(x) for x in entries10)  

# You'll have to change the above lines if you only care about _some_ of the keys

joint_items = a.union(b)
unmatched_items = a - b

# Now you can turn them back into dicts:
joint_items     = [eval(i) for i in joint_items]
unmatched_items = [eval(i) for i in unmatched_items]
scrible
I would use `dict.items` and `dict` rather then `str` and `eval`, if possible.
DiggyF