views:

159

answers:

8

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single instances? I gotta say I only recently started getting into Python but its making this project so much easier. I'm just a bit stumped on this kind of problem.

So my list looks like this:

[{  'file': u'/file.txt',
    'line': u'line 666',
    'rule': u'A DUPLICATE RULE'}

{   'file': u'/file.txt',
    'line': u'line 666',
    'rule': u'A DUPLICATE RULE'}

{   'file': u'/uniquefile.txt',
    'line': u'line 999',
    'rule': u'A UNIQUE RULE'}]

What I'm going for is in the end, the list should look like:

[{  'file': u'/file.txt',
    'line': u'line 666',
    'rule': u'A DUPLICATE RULE'}]
+1  A: 

I'd make another dictionary, using the existing dictionaries as keys and the count of occurrences as values. (Python doesn't allow dictionaries to be used as dictionary keys out of the box, but there are a couple of ways of doing that mentioned in this answer.) Then it's just a matter of iterating over it and selecting the keys where the value is greater than 1.

Of course, using dictionaries as keys relies on their contents not changing over time - at least over the time that you need to use the resulting dictionary. (This is why Python doesn't support it natively.)

Evgeny
+3  A: 

One idea is to sort the data. Assume inputdata is your list from above:

from itertools import groupby
from operator import itemgetter

inputdata.sort(key=itemgetter(*inputdata[0])) # ensures order
print [k for k, g in groupby(inputdata) if len(list(g)) > 1]

prints:

[{'line': u'line 666', 'file': u'/file.txt', 'rule': u'A DUPLICATE RULE'}]
nosklo
+1 I should have read this *before* writing the exact same code.
THC4k
+1 - didn't know about groupby when I posted my solution.
Evgeny
+1  A: 

Another way is to make a counter for each dict data, based on a frozenset of items:

from operator import itemgetter
from collections import defaultdict

counter = defaultdict(int)
for d in inputdata:
    counter[frozenset(d.iteritems())] += 1

result = [dict(item) for item, count in counter.iteritems() if count > 1]
print result

I think that is the best answer so far, because it is very simple to understand and will work linearly.

nosklo
+1  A: 
>>> import itertools
>>> list(a[0] for a in itertools.groupby(sorted(data)) if len(list(a[1])) > 1)
[{'file': u'/file.txt', 'line': u'line 666', 'rule': u'A DUPLICATE RULE'}]

There's probably a more optimal way to check this than len(list(a[1])).

Edit: I added a call to sorted.

Steven Huwig
I don't know. I just tried this verbatim on my test data set and just got back an empty list.
Geuis
It is exactly what I posted on my first sort solution, except it won't work if the list is not sorted.
nosklo
So, sort the list! You can take your choice: sort the list in place, once, or use `sorted()` around the list in the call to groupby.
steveha
A: 

Another option is to create your own data structure instead of using a dict. If you do this, then you can override __cmp__, __eq__ and __hash__. This will give you the ability to then use the 'set' data type in all its glory.

Here's one possible implementation, though I make no promises about the quality of the hash routine I've provided:

class Thing(object):
    def __init__(self, file, line, rule):
        self.file = file
        self.line = line
        self.rule = rule

    def __cmp__(self, other):
        result = cmp(self.file, other.file)
        if result == 0:
            result = cmp(self.line, other.line)
        if result == 0:
            result = cmp(self.rule, other.rule)
        return result

    def __eq__(self, other):
        return cmp(self, other) == 0

    def __hash__(self):
        return hash(self.file) * hash(self.line) * hash(self.rule)

    def __str__(self):
        return ', '.join([self.file, self.line, self.rule])

things = [ Thing(u'/file.txt', u'line 666', u'A DUPLICATE RULE'),
  Thing(u'/file.txt', u'line 666', u'A DUPLICATE RULE'),
  Thing(u'/uniquefile.txt', u'line 999', u'A UNIQUE RULE')]

duplicate_things = set()
unique_things = set()
for t in things:
    if t in unique_things:
        duplicate_things.add(t)
    else:
        unique_things.add(t)

If you need to get back to a list, just construct one from the resulting set:

unique_things = list(unique_things)
duplicate_things = list(duplicate_things)

It's a bit more code to create your own class like this, but may give you other options down the road if your program grows in complexity.

Edit

OK, my hands are faster than my eyes tonight, but I think this edit solves the problem pointed out by @nosklo

Joe Holloway
unfortunately doing all that won't get what was asked - a list of the *duplicate* items.
nosklo
Yeah, thanks I missed that. It can be solved with another set and one additional hash lookup.
Joe Holloway
A: 

I'm pretty sure this works...

dupe = [i for n,i in enumerate(x) if x.index(i) == n and x.count(i) > 1]
>>[{'line': u'line 666', 'rule': u'A DUPLICATE RULE', 'file': u'/file.txt'}]
chris
yes, it works - but it calls x.count() *and* x.index() for each element of x, leading to quadratic O(2n**2) behaviour. It will suck really fast for more than a few items.
nosklo
A: 

This answer is based on Steven Huwig's answer. It's similar to his, but I use sorted() on the list so that groupby() works correctly.

Also, since he said "There's probably a more optimal way to check this than len(list(a[1])).", I decided to use some other way to check for non-unique items. Instead of forcing the whole list, I try to call the .next() method on the iterator, twice. If it works twice, there are at least two items in the iterator, and we are done with it; if we get a StopIteration exception on the first or second call to .next() there was zero or one items in the iterator. (Actually, since we got this iterator from itertools.groupby we know it will have at least one item in it.)

Also, instead of using explicit tuple indexing like a[0] and a[1], I used tuple unpacking, since that's what the cool kids seem to be doing these days.

Finally, instead of using a generator expression to compute the list, and using list() to force it to expand out into a list, I simply used a list comprehension.

data = [
    {
     'file': u'/file.txt',
     'line': u'line 666',
     'rule': u'A DUPLICATE RULE'
    },

    {   'file': u'/uniquefile.txt',
     'line': u'line 999',
     'rule': u'A UNIQUE RULE'
    },

    {   'file': u'/file.txt',
     'line': u'line 666',
     'rule': u'A DUPLICATE RULE'
    },

]

from itertools import groupby

def notunique(itr):
    try:
     itr.next()
     itr.next()
     return True
    except StopIteration:
     return False

def unique_list(lst):
    return [key for key, itr in groupby(sorted(lst)) if notunique(itr)]

print(unique_list(data))
steveha
+1  A: 

I always prefer to work with objects instead of dicts, if the fields are the same for every item.

So, I define a class:

class rule(object):
    def __init__(self, file, line, rule):
     self.file = file
     self.line = line
     self.rule = rule

    #Not a "magic" method, just a helper for all the methods below :)
    def _tuple_(self):
     return (self.file, self.line, self.rule)

    def __eq__(self, other):
     return cmp(self, other) == 0

    def __cmp__(self, other):
     return cmp(self._tuple_(), rule._tuple_(other))

    def __hash__(self):
     return hash(self._tuple_())

    def __repr__(self):
     return repr(self._tuple_())

Now, create a list of these objects, and sort it. ruledict_list can be the example data in your question.

rules = [rule(**r) for r in ruledict_list]
rules.sort()

Loop through the (sorted) list, removing unique objects as we go. Finally, create a set, to remove duplicates. The loop will also remove one of each duplicate object, but that doesn't really matter.

pos = 0
while(pos < len(rules)):
    while pos < len(rules)-1 and rules[pos] == rules[pos+1]:
     print "Skipping rule %s" % rules[pos]
     pos+=1
    rules.pop(pos)
rule_set = set(rules)
gnud
FWIW, This is similar to my answer except mine doesn't require the list to be pre-sorted
Joe Holloway