tags:

views:

987

answers:

11

What is an efficient way to find the most common element in a Python list?

My list items may not be hashable so can't use a dictionary. Also in case of draws the item with the lowest index should be returned. Example:

>>> most_common(['duck', 'duck', 'goose'])
'duck'
>>> most_common(['goose', 'duck', 'duck', 'goose'])
'goose'
+7  A: 

If they are not hashable, you can sort them and do a single loop over the result counting the items (identical items will be next to each other). But it might be faster to make them hashable and use a dict.

def most_common(lst):
    cur_length = 0
    max_length = 0
    cur_i = 0
    max_i = 0
    cur_item = None
    max_item = None
    for i, item in sorted(enumerate(lst), key=lambda x: x[1]):
        if cur_item is None or cur_item != item:
            if cur_length > max_length or (cur_length == max_length and cur_i < max_i):
                max_length = cur_length
                max_i = cur_i
                max_item = cur_item
            cur_length = 1
            cur_i = i
            cur_item = item
        else:
            cur_length += 1
    if cur_length > max_length or (cur_length == max_length and cur_i < max_i):
        return cur_item
    return max_item
Lukáš Lalinský
+4  A: 

Sort a copy of the list and find the longest run. You can decorate the list before sorting it with the index of each element, and then choose the run that starts with the lowest index in the case of a tie.

Boojum
+1  A: 

This is the obvious slow solution (O(n^2)) if neither sorting nor hashing is feasible, but equality comparison (==) is available:

def most_common(items):
  if not items:
    raise ValueError
  fitems = [] 
  best_idx = 0
  for item in items:   
    item_missing = True
    i = 0
    for fitem in fitems:  
      if fitem[0] == item:
        fitem[1] += 1
        d = fitem[1] - fitems[best_idx][1]
        if d > 0 or (d == 0 and fitems[best_idx][2] > fitem[2]):
          best_idx = i
        item_missing = False
        break
      i += 1
    if item_missing:
      fitems.append([item, 1, i])
  return items[best_idx]

But making your items hashable or sortable (as recommended by other answers) would almost always make finding the most common element faster if the length of your list (n) is large. O(n) on average with hashing, and O(n*log(n)) at worst for sorting.

pts
A: 

Here:

def most_common(l):
    max = 0
    maxitem = None
    for x in set(l):
        count =  l.count(x)
        if count > max:
            max = count
            maxitem = x
    return maxitem

I have a vague feeling there is a method somewhere in the standard library that will give you the count of each element, but I can't find it.

Lennart Regebro
'max' is a method. Would you change the name of the variable?
TheMachineCharmer
Note that set() also requires hashable items, to the solution wouldn't work in this case.
Lukáš Lalinský
Wait, I missed that part of not being hashable. But if the objects have equality it should be easy to make them hashable.
Lennart Regebro
A: 
>>> li  = ['goose', 'duck', 'duck']

>>> def foo(li):
         st = set(li)
         mx = -1
         for each in st:
             temp = li.count(each):
             if mx < temp:
                 mx = temp 
                 h = each 
         return h

>>> foo(li)
'duck'
TheMachineCharmer
This has terrible performance characteristic when n is big and the number of unique elements is large as well: O(n) for the conversion to a set and O(m*n)=O(n^2) for the count (where m is the number of uniques). Sort and walk is O(n log n) for the sort and 0(n) for the walk.
jmucchiello
Yeah you are right. Now I know this is a terrible solution and why. Thanks for comment!! :-)
TheMachineCharmer
+1  A: 

A one-liner:

def most_common (lst):
    return max(((item, lst.count(item)) for item in set(lst)), key=lambda a: a[1])[0]
wbowers
+6  A: 

A simpler one-liner:

def most_common(lst):
    return max(set(lst), key=lst.count)
newacct
You win. Gorgeous.
wbowers
The OP stated that *[..] in case of draws the item with the lowest index should be returned.* This code does not, in general, meet that requirement.
Stephan202
Plus, the OP stated that the elements must be hashable: sets must contains hashable objects.
EOL
Plus, this approach is algorithmically slow (for each elements in `set(lst)`, the whole list must be checked again)… Probably fast enough for most uses, though…
EOL
@EOL: s/must be/may not be/. Other than that, I agree.
Stephan202
You can replace `set(lst)` with `lst` and it will work with non-hashable elements too; albeit slower.
newacct
+2  A: 
# use Decorate, Sort, Undecorate to solve the problem

def most_common(iterable):
    # Make a list with tuples: (item, index)
    # The index will be used later to break ties for most common item.
    lst = [(x, i) for i, x in enumerate(iterable)]
    lst.sort()

    # lst_final will also be a list of tuples: (count, index, item)
    # Sorting on this list will find us the most common item, and the index
    # will break ties so the one listed first wins.  Count is negative so
    # largest count will have lowest value and sort first.
    lst_final = []

    # Get an iterator for our new list...
    itr = iter(lst)

    # ...and pop the first tuple off.  Setup current state vars for loop.
    count = 1
    tup = next(itr)
    x_cur, i_cur = tup

    # Loop over sorted list of tuples, counting occurrences of item.
    for tup in itr:
        # Same item again?
        if x_cur == tup[0]:
            # Yes, same item; increment count
            count += 1
        else:
            # No, new item, so write previous current item to lst_final...
            t = (-count, i_cur, x_cur)
            lst_final.append(t)
            # ...and reset current state vars for loop.
            x_cur, i_cur = tup
            count = 1

    # Write final item after loop ends
    t = (-count, i_cur, x_cur)
    lst_final.append(t)

    lst_final.sort()
    answer = lst_final[0][2]

    return answer

print most_common(['x', 'e', 'a', 'e', 'a', 'e', 'e']) # prints 'e'
print most_common(['goose', 'duck', 'duck', 'goose']) # prints 'goose'
steveha
A: 

This is an O(n) solution.

mydict   = {}
cnt, itm = 0, ''
for item in reversed(lst):
     mydict[item] = mydict.get(item, 0) + 1
     if mydict[item] >= cnt :
         cnt, itm = mydict[item], item

print itm

(reversed is used to make sure that it returns the lowest index item)

ThisIsMeMoony
+6  A: 

With so many solutions proposed, I'm amazed nobody's proposed what I'd consider an obvious one (for non-hashable but comparable elements) -- [itertools.groupby][1]. itertools offers fast, reusable functionality, and lets you delegate some tricky logic to well-tested standard library components. Consider for example:

import itertools
import operator

def most_common(L):
  # get an iterable of (item, iterable) pairs
  SL = sorted((x, i) for i, x in enumerate(L))
  # print 'SL:', SL
  groups = itertools.groupby(SL, key=operator.itemgetter(0))
  # auxiliary function to get "quality" for an item
  def _auxfun(g):
    item, iterable = g
    count = 0
    min_index = len(L)
    for _, where in iterable:
      count += 1
      min_index = min(min_index, where)
    # print 'item %r, count %r, minind %r' % (item, count, min_index)
    return count, -min_index
  # pick the highest-count/earliest item
  return max(groups, key=_auxfun)[0]

This could be written more concisely, of course, but I'm aiming for maximal clarity. The two print statements can be uncommented to better see the machinery in action; for example, with prints uncommented:

print most_common(['goose', 'duck', 'duck', 'goose'])

emits:

SL: [('duck', 1), ('duck', 2), ('goose', 0), ('goose', 3)]
item 'duck', count 2, minind 1
item 'goose', count 2, minind 0
goose

As you see, SL is a list of pairs, each pair an item followed by the item's index in the original list (to implement the key condition that, if the "most common" items with the same highest count are > 1, the result must be the earliest-occurring one).

groupby groups by the item only (via operator.itemgetter). The auxiliary function, called once per grouping during the max computation, receives and internally unpacks a group - a tuple with two items (item, iterable) where the iterable's items are also two-item tuples, (item, original index) [[the items of SL]].

Then the auxiliary function uses a loop to determine both the count of entries in the group's iterable, and the minimum original index; it returns those as combined "quality key", with the min index sign-changed so the max operation will consider "better" those items that occurred earlier in the original list.

This code could be much simpler if it worried a little less about big-O issues in time and space, e.g....:

def most_common(L):
  groups = itertools.groupby(sorted(L))
  def _auxfun((item, iterable)):
    return len(list(iterable)), -L.index(item)
  return max(groups, key=_auxfun)[0]

same basic idea, just expressed more simply and compactly... but, alas, an extra O(N) auxiliary space (to embody the groups' iterables to lists) and O(N squared) time (to get the L.index of every item). While premature optimization is the root of all evil in programming, deliberately picking an O(N squared) approach when an O(N log N) one is available just goes too much against the grain of scalability!-)

Finally, for those who prefer "oneliners" to clarity and performance, a bonus 1-liner version with suitably mangled names:-).

from itertools import groupby as g
def most_common_oneliner(L):
  return max(g(sorted(L)), key=lambda(x, v):(len(list(v)),-L.index(x)))[0]
Alex Martelli
Alex, is providing detailed solutions on Stackoverflow your full time job?!
Plumo
@Richard, heh, no, just an amusing diversion (today I'm indulging a bit as it's my birthday so I'm offering myself a treat;-).
Alex Martelli
@Alex, Happy birthday :)
Satoru.Logic
+1  A: 

You probably don't need this anymore, but this is what I did for a similar problem. (It looks longer than it is because of the comments.)

itemList = ['hi', 'hi', 'hello', 'bye']

counter = {}
maxItemCount = 0
for item in itemList:
    try:
        # Referencing this will cause a KeyError exception
        # if it doesn't already exist
        counter[item]
        # ... meaning if we get this far it didn't happen so
        # we'll increment
        counter[item] += 1
    except KeyError:
        # If we got a KeyError we need to create the
        # dictionary key
        counter[item] = 1

    # Keep overwriting maxItemCount with the latest number,
    # if it's higher than the existing itemCount
    if counter[item] > maxItemCount:
        maxItemCount = counter[item]
        mostPopularItem = item

print mostPopularItem
Ed Holden