views:

261

answers:

9

Hi,

I have an array with some objects, and there are several objects that are alike. E.g: fruit = [apple, orange, apple, banana, banana, orange, apple, apple]

What is the most efficient way to get the most represented object from this array? In this case it would be "apple" but how would you go out and calculate that in an efficient way?

+4  A: 

If the objects are hashable then you can use a dict to store the counts:

results = {}
for item in somelist:
  if item not in results:
    results[item] = 1
  else
    results[item] += 1

print max(results.iteritems(), key=operator.itemgetter(1))
Ignacio Vazquez-Abrams
+1  A: 

The best time you can hope to achieve here is O(n) - you'll always need to walk the entire array at least once. The easiest way will certainly be to build a histogram. If your dictionary structure (map of some kind) offers O(1) insert and retrieve then this is as easy as (groovy-ish pseudocode):

def histogram = new HashMap()
def maxObj = null
def maxObjCount = 0
objectList.each {
    if(histogram.contains(it)) histogram.put(it, histogram.get(it)+1)
    else histogram.put(it, 1)

    if(histogram.get(it) > maxObjCount) {
        maxObj = it
        maxObjCount = histogram.get(it)
    }
}
jharlap
+3  A: 

Keep a dictionary of how often each object appears.

Walk through the list once, building this table. As you go, keep track of which object has appeared the most often so far.

This code is untested.

from collections import defaultdict

def mode(objects):
    h = defaultdict(int)
    max_f = 0
    max_obj = None
    for o in objects:
        f = h[o] = h[o] + 1
        if f > max_f:
            max_f = f
            max_obj = o
    return max_obj

If the objects are not hashable, you can instead hash some unique feature of them, such as id(o).

Jason Orendorff
If you modify this to keep the second-most-common object as well as the most common, then you can stop when the most common object gets so far ahead that it's impossible for any other object to catch up. But I doubt it would be worth it for real-world data. Better to speed through the loop, is my guess.
Jason Orendorff
+2  A: 

You want an efficient method. Clearly it's possible in O(n) time, so any method that requires sorting the list is out as that would be O(n log(n)). It's not possible to do it faster than O(n) because even if you check the first n/2-1 elements, and they are all "apple", you don't know that the rest of the elements won't be bananas.

So given that we're looking for O(n), you must iterate over the list and keep a count of how many items of each type you've seen.

A defaultdict would be a simple way to implement this in practice.

>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> for i in ['apple', 'banana', 'apple']:
...    d[i] += 1
...
>>> d
defaultdict(<type 'int'>, {'apple': 2, 'banana': 1})
Mark Byers
A: 
def count_reps(item, agg):
  k = hash(item)
  try:
    agg[k] += 1
  except KeyError:
    agg[k] = 1
  return agg

item_dict = reduce(your_array, {})

item_dict will contain the counts, then you can rate the popularity of each object.

Kris Walker
A: 

Heres a different approach which essentially sorts the list and then processes it in a sorted order.

fruits = ['apple', 'orange', 'apple', 'banana', 'banana', 'orange', 'apple', 'apple']

max_fruit_count = 0
max_fruit = ''
current_fruit_count = 0
current_fruit = ''
for fruit in sorted(fruits) :
    if fruit != current_fruit :
        if current_fruit != max_fruit :
            if current_fruit_count > max_fruit_count :
                max_fruit = current_fruit
                max_fruit_count = current_fruit_count
        current_fruit = fruit
        current_fruit_count = 1
    else :
        current_fruit_count += 1

if current_fruit_count > max_fruit_count :
    max_fruit = current_fruit
    max_fruit_count = current_fruit_count

print max_fruit, max_fruit_count
Dhananjay Nene
A: 

This is not O(n), but O(n^2), so it not may fit your bill as "most efficient way", but it's compact and avoids for loops, which are rather slow in Python. It will be faster than the O(n) option up to 11 unique items.

def most_common(items):
    s = set(items)
    return max([(items.count(i), i) for i in s])[1]
Heim
Mmh... Forgot to tell that it's up to 11 unique items... compared to the first solution. With some improvements to that (no real need to calculate the biggest one *each time you loop*, the advantage still holds, but only up 7 items.
Heim
+8  A: 

Don't reinvent the wheel. In Python 2.7+ you can use the Counter class:

import collections
fruit=['apple', 'orange', 'apple', 'banana', 'banana', 'orange', 'apple', 'apple']
c=collections.Counter(fruit)
print(c.most_common(1))
# [('apple', 4)]

If you are using an older version of Python, then you can download Counter here.

While it's good to know how to implement something like this yourself, it's also a good idea to get used to using Counter, since it is (or going to be) part of the standard library.

unutbu
This is awesome given one is running 2.7...
jathanism
A: 

As ~unutbu says: use collections.Counter Failing that, time your code. Here is my (likely inefficient) approach:

python -m timeit -s "fruit = ['apple']*4 + ['banana'] + ['orange']*2" \
"kL = set(fruit);  L = [fruit.count(f) for f in kL];  D = dict(zip(kL,L)); \
sorted(D,key = lambda k: D[k],reverse=True)" 
100000 loops, best of 3: 10.1 usec per loop
telliott99