ansaurus

Question

How to remove unique, then duplicate dictionaries in a list?

Answer 1

+1 A:

I'd make another dictionary, using the existing dictionaries as keys and the count of occurrences as values. (Python doesn't allow dictionaries to be used as dictionary keys out of the box, but there are a couple of ways of doing that mentioned in this answer.) Then it's just a matter of iterating over it and selecting the keys where the value is greater than 1.

Of course, using dictionaries as keys relies on their contents not changing over time - at least over the time that you need to use the resulting dictionary. (This is why Python doesn't support it natively.)

Evgeny 2009-11-13 03:28:36

Answer 2

+3 A:

One idea is to sort the data. Assume inputdata is your list from above:

from itertools import groupby
from operator import itemgetter

inputdata.sort(key=itemgetter(*inputdata[0])) # ensures order
print [k for k, g in groupby(inputdata) if len(list(g)) > 1]

prints:

[{'line': u'line 666', 'file': u'/file.txt', 'rule': u'A DUPLICATE RULE'}]

nosklo 2009-11-13 03:31:10

+1 I should have read this *before* writing the exact same code.

THC4k 2009-11-13 16:12:11

+1 - didn't know about groupby when I posted my solution.

Evgeny 2009-11-15 23:17:27

Answer 3

+1 A:

Another way is to make a counter for each dict data, based on a frozenset of items:

from operator import itemgetter
from collections import defaultdict

counter = defaultdict(int)
for d in inputdata:
    counter[frozenset(d.iteritems())] += 1

result = [dict(item) for item, count in counter.iteritems() if count > 1]
print result

I think that is the best answer so far, because it is very simple to understand and will work linearly.

nosklo 2009-11-13 03:37:21

Answer 4

+1 A:

>>> import itertools
>>> list(a[0] for a in itertools.groupby(sorted(data)) if len(list(a[1])) > 1)
[{'file': u'/file.txt', 'line': u'line 666', 'rule': u'A DUPLICATE RULE'}]

There's probably a more optimal way to check this than len(list(a[1])).

Edit: I added a call to sorted.

Steven Huwig 2009-11-13 03:38:42

I don't know. I just tried this verbatim on my test data set and just got back an empty list.

Geuis 2009-11-13 03:50:05

It is exactly what I posted on my first sort solution, except it won't work if the list is not sorted.

nosklo 2009-11-13 04:06:03

So, sort the list! You can take your choice: sort the list in place, once, or use `sorted()` around the list in the call to groupby.

steveha 2009-11-13 06:07:30

Answer 5

A:

Another option is to create your own data structure instead of using a dict. If you do this, then you can override __cmp__, __eq__ and __hash__. This will give you the ability to then use the 'set' data type in all its glory.

Here's one possible implementation, though I make no promises about the quality of the hash routine I've provided:

class Thing(object):
    def __init__(self, file, line, rule):
        self.file = file
        self.line = line
        self.rule = rule

    def __cmp__(self, other):
        result = cmp(self.file, other.file)
        if result == 0:
            result = cmp(self.line, other.line)
        if result == 0:
            result = cmp(self.rule, other.rule)
        return result

    def __eq__(self, other):
        return cmp(self, other) == 0

    def __hash__(self):
        return hash(self.file) * hash(self.line) * hash(self.rule)

    def __str__(self):
        return ', '.join([self.file, self.line, self.rule])

things = [ Thing(u'/file.txt', u'line 666', u'A DUPLICATE RULE'),
  Thing(u'/file.txt', u'line 666', u'A DUPLICATE RULE'),
  Thing(u'/uniquefile.txt', u'line 999', u'A UNIQUE RULE')]

duplicate_things = set()
unique_things = set()
for t in things:
    if t in unique_things:
        duplicate_things.add(t)
    else:
        unique_things.add(t)

If you need to get back to a list, just construct one from the resulting set:

unique_things = list(unique_things)
duplicate_things = list(duplicate_things)

It's a bit more code to create your own class like this, but may give you other options down the road if your program grows in complexity.

Edit

OK, my hands are faster than my eyes tonight, but I think this edit solves the problem pointed out by @nosklo

Joe Holloway 2009-11-13 04:01:42

unfortunately doing all that won't get what was asked - a list of the *duplicate* items.

nosklo 2009-11-13 04:08:26

Yeah, thanks I missed that. It can be solved with another set and one additional hash lookup.

Joe Holloway 2009-11-13 04:20:03

Answer 6

A:

I'm pretty sure this works...

dupe = [i for n,i in enumerate(x) if x.index(i) == n and x.count(i) > 1]
>>[{'line': u'line 666', 'rule': u'A DUPLICATE RULE', 'file': u'/file.txt'}]

chris 2009-11-13 04:14:41

yes, it works - but it calls x.count() *and* x.index() for each element of x, leading to quadratic O(2n**2) behaviour. It will suck really fast for more than a few items.

nosklo 2009-11-13 04:18:38

Answer 7

A:

This answer is based on Steven Huwig's answer. It's similar to his, but I use sorted() on the list so that groupby() works correctly.

Also, since he said "There's probably a more optimal way to check this than len(list(a[1])).", I decided to use some other way to check for non-unique items. Instead of forcing the whole list, I try to call the .next() method on the iterator, twice. If it works twice, there are at least two items in the iterator, and we are done with it; if we get a StopIteration exception on the first or second call to .next() there was zero or one items in the iterator. (Actually, since we got this iterator from itertools.groupby we know it will have at least one item in it.)

Also, instead of using explicit tuple indexing like a[0] and a[1], I used tuple unpacking, since that's what the cool kids seem to be doing these days.

Finally, instead of using a generator expression to compute the list, and using list() to force it to expand out into a list, I simply used a list comprehension.

data = [
    {
     'file': u'/file.txt',
     'line': u'line 666',
     'rule': u'A DUPLICATE RULE'
    },

    {   'file': u'/uniquefile.txt',
     'line': u'line 999',
     'rule': u'A UNIQUE RULE'
    },

    {   'file': u'/file.txt',
     'line': u'line 666',
     'rule': u'A DUPLICATE RULE'
    },

]

from itertools import groupby

def notunique(itr):
    try:
     itr.next()
     itr.next()
     return True
    except StopIteration:
     return False

def unique_list(lst):
    return [key for key, itr in groupby(sorted(lst)) if notunique(itr)]

print(unique_list(data))

steveha 2009-11-13 06:24:33

Answer 8

+1 A:

I always prefer to work with objects instead of dicts, if the fields are the same for every item.

So, I define a class:

class rule(object):
    def __init__(self, file, line, rule):
     self.file = file
     self.line = line
     self.rule = rule

    #Not a "magic" method, just a helper for all the methods below :)
    def _tuple_(self):
     return (self.file, self.line, self.rule)

    def __eq__(self, other):
     return cmp(self, other) == 0

    def __cmp__(self, other):
     return cmp(self._tuple_(), rule._tuple_(other))

    def __hash__(self):
     return hash(self._tuple_())

    def __repr__(self):
     return repr(self._tuple_())

Now, create a list of these objects, and sort it. ruledict_list can be the example data in your question.

rules = [rule(**r) for r in ruledict_list]
rules.sort()

Loop through the (sorted) list, removing unique objects as we go. Finally, create a set, to remove duplicates. The loop will also remove one of each duplicate object, but that doesn't really matter.

pos = 0
while(pos < len(rules)):
    while pos < len(rules)-1 and rules[pos] == rules[pos+1]:
     print "Skipping rule %s" % rules[pos]
     pos+=1
    rules.pop(pos)
rule_set = set(rules)

gnud 2009-11-13 16:27:48

FWIW, This is similar to my answer except mine doesn't require the list to be pre-sorted

Joe Holloway 2009-11-13 20:13:26

ansaurus

tags:

views:

answers:

How to remove unique, then duplicate dictionaries in a list?

related questions