views:

215

answers:

6
def maxVote(nLabels):
    count = {}
    maxList = []
    maxCount = 0
    for nLabel in nLabels:
        if nLabel in count:
            count[nLabel] += 1
        else:
            count[nLabel] = 1
    #Check if the count is max
        if count[nLabel] > maxCount:
            maxCount = count[nLabel]
            maxList = [nLabel,]
        elif count[nLabel]==maxCount:
            maxList.append(nLabel)
    return random.choice(maxList) 

nLabels contains a list of integers.

The above function returns the integer with highest frequency, if more than one have same frequency then a randomly selected integer from them is returned.

E.g. maxVote([1,3,4,5,5,5,3,12,11]) is 5

A: 

Idea 1

Does the return really need to be random, or can you just return a maximum? If you just need to nondeterministically return a max frequency, you could just store a single label and remove the list logic, including

 elif count[nLabel]==maxCount:
        maxList.append(nLabel)

Idea 2

If this method is called frequently, would it be possible to only work on new data, as opposed to the entire data set? You could cache your count map and then only process new data. Assuming your data set is large and the calculations are done online, this could net huge improvements.

Stefan Kendall
no it has to be random, it is a requirement of algorithm in which this is a most used part
RandomVector
I am actually using a Map-Reduce type formalism and using multiple cores visa python multiprocessing module.Full code herehttp://github.com/AKSHAYUBHAT/Label-Propagation/blob/master/LP.py
RandomVector
finding a maximum in parallel can approach O(log n) time in some cases.
Jweede
+6  A: 
import random
import collections

def maxvote(nlabels):
  cnt = collections.defaultdict(int)
  for i in nlabels:
    cnt[i] += 1
  maxv = max(cnt.itervalues())
  return random.choice([k for k,v in cnt.iteritems() if v == maxv])

print maxvote([1,3,4,5,5,5,3,3,11])
Ignacio Vazquez-Abrams
Sorting is `O(n log n)`; you just need the maximum, which is `O(n)` :)
badp
Good call. Fixed.
Ignacio Vazquez-Abrams
Thanks it seems this is the best solution.
RandomVector
dictionaries implemented with hash tables tend to run in constant time. This looks like O(n) for the first loop, then O(n) for finding the max. The random choice should remain trivially small as long as many characters don't have the same frequency. O(n) + O(n) = O(2n) ~ O(n). The original appeared to run in O(n^2). So this is an improvement.
Jweede
Ya i didn't knew about the defaultdict, and thus i was searching the dict to initialize it to zero.
RandomVector
@RandomVector: So the count(in your question) is a dict and existence check is O(1)? Then perhaps the solution I gave might perform better. Afterall, default dict or not, there _are_ going to be initialization costs. IMO, by ignoring elements of frequency 1, you would save a lot: I presume you will have a majority of them to be frequency 1. Of course, I don't know python :-)
Moron
`cnt = collections.Counter(nlabels)` (py2.7)
Paul Hankin
@RandomVector: you could also have used default dicts with the `dict.get(key, default_value)` method. :)
badp
+2  A: 

It appears to run in O(n) time. However there may be a bottleneck in checking if nLabel in count since this operation could also potentially run O(n) time as well, making the total efficiency O(n^2).

Using a dictionary instead of a list in this case is the only major efficiency boost I can spot.

Jweede
`count` is a `dict`, therefore, membership check is `O(1)`
SilentGhost
ah, I missed that. Yep, O(n) time then.
Jweede
A: 

Complete example:

#!/usr/bin/env python

def max_vote(l):
    """
    Return the element with the (or a) maximum frequency in ``l``.
    """
    unsorted = [(a, l.count(a)) for a in set(l)]
    return sorted(unsorted, key=lambda x: x[1]).pop()[0]

if __name__ == '__main__':
    votes = [1, 3, 4, 5, 5, 5, 3, 12, 11]
    print max_vote(votes)
    # => 5

Benchmarks:

#!/usr/bin/env python

import random
import collections

def max_vote_2(l):
    """
    Return the element with the (or a) maximum frequency in ``l``.
    """
    unsorted = [(a, l.count(a)) for a in set(l)]
    return sorted(unsorted, key=lambda x: x[1]).pop()[0]

def max_vote_1(nlabels):
    cnt = collections.defaultdict(int)
    for i in nlabels:
        cnt[i] += 1
        maxv = max(cnt.itervalues())
    return random.choice([k for k,v in cnt.iteritems() if v == maxv])

if __name__ == '__main__':
    from timeit import Timer
    votes = [1, 3, 4, 5, 5, 5, 3, 12, 11]
    print max_vote_1(votes)
    print max_vote_2(votes)

    t = Timer("votes = [1, 3, 4, 5, 5, 5, 3, 12, 11]; max_vote_2(votes)", \
        "from __main__ import max_vote_2")
    print 'max_vote_2', t.timeit(number=100000)

    t = Timer("votes = [1, 3, 4, 5, 5, 5, 3, 12, 11]; max_vote_1(votes)", \
        "from __main__ import max_vote_1")
    print 'max_vote_1', t.timeit(number=100000)

Yields:

5
5
max_vote_2 1.79455208778
max_vote_1 2.31705093384
The MYYN
Please let me know if this is wrong, but performing the set creation is O(n), looping through the set is ~O(n/2) depending on the number of duplicates, and counting is also O(n). Then when we make a new sorted list is O(n). O(n) + O(n/2) * O(n/2) + O(n) = O(n^2/2)
Jweede
Keep in mind that `O(n²/2) === O(n²)`
badp
mmh, and it seems to be faster than the curretly upvoted answer - please correct me if I'm wrong ...
The MYYN
right, but there is a demonstrable improvement for large data sets from O(n^2) to O(n^2/2)
Jweede
ok, sorry for coming back - but on a list with 1_000_000 entries these improvements still don't show up. I mean, not that 1_000_000 would be that much, but ...
The MYYN
The randomly selection is an important step since its mandatory requirement. However even few % improvement is also critical since I am planning to use it with Wikpedia Page Link Graph.Thus distribution of length(nLabels) will be distribution of Wikipedia out-links
RandomVector
interesting ..., not that I have a data center at hand, but do you have some estimates on the number - just curious ...
The MYYN
To give you an ideaI am currently using StackOverflow Question-Answers Bipartite network.It has 700k nodes (Questions + Users). A single iteration in which max vote is applied for labels of neighbor of each node, takes around 20 secs on Intel 2.6 Ghz with Two cores (I am using multiproccessing).
RandomVector
thanks, if I happen to accumulate more benchmarking data, I'll post it ...
The MYYN
+2  A: 

I'm not sure what exactly you want to optimize, but this should work:

from collections import defaultdict

def maxVote(nLabels):
   count = defaultdict(int)
   for nLabel in nLabels:
      count[nLabel] += 1
   maxCount = max(count.itervalues())
   maxList = [k for k in count if count[k] == maxCount]
   return random.choice(maxList)
sth
+4  A: 

In Python 3.1 or future 2.7 you'd be able to use Counter:

>>> from collections import Counter
>>> Counter([1,3,4,5,5,5,3,12,11]).most_common(1)
[(5, 3)]

If you don't have access to those versions of Python you could do:

>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> for i in nLabels:
    d[i] += 1


>>> max(d, key=lambda x: d[x])
5
SilentGhost