ansaurus

Question

Answer 1

A:

Idea 1

Does the return really need to be random, or can you just return a maximum? If you just need to nondeterministically return a max frequency, you could just store a single label and remove the list logic, including

 elif count[nLabel]==maxCount:
        maxList.append(nLabel)

Idea 2

If this method is called frequently, would it be possible to only work on new data, as opposed to the entire data set? You could cache your count map and then only process new data. Assuming your data set is large and the calculations are done online, this could net huge improvements.

Stefan Kendall 2010-05-26 16:58:20

no it has to be random, it is a requirement of algorithm in which this is a most used part

RandomVector 2010-05-26 16:59:29

I am actually using a Map-Reduce type formalism and using multiple cores visa python multiprocessing module.Full code herehttp://github.com/AKSHAYUBHAT/Label-Propagation/blob/master/LP.py

RandomVector 2010-05-26 17:05:06

finding a maximum in parallel can approach O(log n) time in some cases.

Jweede 2010-05-26 17:17:41

Answer 2

+6 A:

import random
import collections

def maxvote(nlabels):
  cnt = collections.defaultdict(int)
  for i in nlabels:
    cnt[i] += 1
  maxv = max(cnt.itervalues())
  return random.choice([k for k,v in cnt.iteritems() if v == maxv])

print maxvote([1,3,4,5,5,5,3,3,11])

Ignacio Vazquez-Abrams 2010-05-26 17:03:34

Sorting is `O(n log n)`; you just need the maximum, which is `O(n)` :)

badp 2010-05-26 17:04:17

Good call. Fixed.

Ignacio Vazquez-Abrams 2010-05-26 17:06:00

Thanks it seems this is the best solution.

RandomVector 2010-05-26 17:16:23

dictionaries implemented with hash tables tend to run in constant time. This looks like O(n) for the first loop, then O(n) for finding the max. The random choice should remain trivially small as long as many characters don't have the same frequency. O(n) + O(n) = O(2n) ~ O(n). The original appeared to run in O(n^2). So this is an improvement.

Jweede 2010-05-26 17:16:26

Ya i didn't knew about the defaultdict, and thus i was searching the dict to initialize it to zero.

RandomVector 2010-05-26 17:21:22

@RandomVector: So the count(in your question) is a dict and existence check is O(1)? Then perhaps the solution I gave might perform better. Afterall, default dict or not, there _are_ going to be initialization costs. IMO, by ignoring elements of frequency 1, you would save a lot: I presume you will have a majority of them to be frequency 1. Of course, I don't know python :-)

Moron 2010-05-26 17:36:28

`cnt = collections.Counter(nlabels)` (py2.7)

Paul Hankin 2010-05-26 17:58:10

@RandomVector: you could also have used default dicts with the `dict.get(key, default_value)` method. :)

badp 2010-05-26 18:00:23

Answer 3

+2 A:

It appears to run in O(n) time. However there may be a bottleneck in checking if nLabel in count since this operation could also potentially run O(n) time as well, making the total efficiency O(n^2).

Using a dictionary instead of a list in this case is the only major efficiency boost I can spot.

Jweede 2010-05-26 17:04:06

`count` is a `dict`, therefore, membership check is `O(1)`

SilentGhost 2010-05-26 17:14:17

ah, I missed that. Yep, O(n) time then.

Jweede 2010-05-26 17:28:00

Answer 4

A:

Complete example:

#!/usr/bin/env python

def max_vote(l):
    """
    Return the element with the (or a) maximum frequency in ``l``.
    """
    unsorted = [(a, l.count(a)) for a in set(l)]
    return sorted(unsorted, key=lambda x: x[1]).pop()[0]

if __name__ == '__main__':
    votes = [1, 3, 4, 5, 5, 5, 3, 12, 11]
    print max_vote(votes)
    # => 5

Benchmarks:

#!/usr/bin/env python

import random
import collections

def max_vote_2(l):
    """
    Return the element with the (or a) maximum frequency in ``l``.
    """
    unsorted = [(a, l.count(a)) for a in set(l)]
    return sorted(unsorted, key=lambda x: x[1]).pop()[0]

def max_vote_1(nlabels):
    cnt = collections.defaultdict(int)
    for i in nlabels:
        cnt[i] += 1
        maxv = max(cnt.itervalues())
    return random.choice([k for k,v in cnt.iteritems() if v == maxv])

if __name__ == '__main__':
    from timeit import Timer
    votes = [1, 3, 4, 5, 5, 5, 3, 12, 11]
    print max_vote_1(votes)
    print max_vote_2(votes)

    t = Timer("votes = [1, 3, 4, 5, 5, 5, 3, 12, 11]; max_vote_2(votes)", \
        "from __main__ import max_vote_2")
    print 'max_vote_2', t.timeit(number=100000)

    t = Timer("votes = [1, 3, 4, 5, 5, 5, 3, 12, 11]; max_vote_1(votes)", \
        "from __main__ import max_vote_1")
    print 'max_vote_1', t.timeit(number=100000)

Yields:

5
5
max_vote_2 1.79455208778
max_vote_1 2.31705093384

The MYYN 2010-05-26 17:04:06

Please let me know if this is wrong, but performing the set creation is O(n), looping through the set is ~O(n/2) depending on the number of duplicates, and counting is also O(n). Then when we make a new sorted list is O(n). O(n) + O(n/2) * O(n/2) + O(n) = O(n^2/2)

Jweede 2010-05-26 17:11:42

Keep in mind that `O(n²/2) === O(n²)`

badp 2010-05-26 17:19:01

mmh, and it seems to be faster than the curretly upvoted answer - please correct me if I'm wrong ...

The MYYN 2010-05-26 17:23:58

right, but there is a demonstrable improvement for large data sets from O(n^2) to O(n^2/2)

Jweede 2010-05-26 17:24:31

ok, sorry for coming back - but on a list with 1_000_000 entries these improvements still don't show up. I mean, not that 1_000_000 would be that much, but ...

The MYYN 2010-05-26 17:31:10

The randomly selection is an important step since its mandatory requirement. However even few % improvement is also critical since I am planning to use it with Wikpedia Page Link Graph.Thus distribution of length(nLabels) will be distribution of Wikipedia out-links

RandomVector 2010-05-26 17:33:24

interesting ..., not that I have a data center at hand, but do you have some estimates on the number - just curious ...

The MYYN 2010-05-26 17:36:05

To give you an ideaI am currently using StackOverflow Question-Answers Bipartite network.It has 700k nodes (Questions + Users). A single iteration in which max vote is applied for labels of neighbor of each node, takes around 20 secs on Intel 2.6 Ghz with Two cores (I am using multiproccessing).

RandomVector 2010-05-26 17:39:55

thanks, if I happen to accumulate more benchmarking data, I'll post it ...

The MYYN 2010-05-26 17:42:18

Answer 5

+2 A:

I'm not sure what exactly you want to optimize, but this should work:

from collections import defaultdict

def maxVote(nLabels):
   count = defaultdict(int)
   for nLabel in nLabels:
      count[nLabel] += 1
   maxCount = max(count.itervalues())
   maxList = [k for k in count if count[k] == maxCount]
   return random.choice(maxList)

sth 2010-05-26 17:04:59

Answer 6

+4 A:

In Python 3.1 or future 2.7 you'd be able to use Counter:

>>> from collections import Counter
>>> Counter([1,3,4,5,5,5,3,12,11]).most_common(1)
[(5, 3)]

If you don't have access to those versions of Python you could do:

>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> for i in nLabels:
    d[i] += 1


>>> max(d, key=lambda x: d[x])
5

SilentGhost 2010-05-26 17:06:00

ansaurus

tags:

views:

answers:

How to optimize this Python code?

Idea 1

Idea 2

related questions