views:

423

answers:

9

This question asks how to determine if every element in a list is the same. How would I go about determining if 95% of the elements in a list are the same in a reasonably efficient way? For example:

>>> ninety_five_same([1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1])
True
>>> ninety_five_same([1,1,1,1,1,1,2,1]) # only 80% the same
False

This would need to be somewhat efficient because the lists may be very large.

+1  A: 

This is even less efficient than checking if every element is the same.

The algorithm is roughly the same, go through every element in the list and count those that do not match the expected one (with the extra difficulty of knowing which one is the expected one). However, this time, you cannot just return false when you meet the first mismatch, you have to continue until you have enough mismatches to make up a 5% error rate.

Come to think of it, figuring out which element is the "right" one is probably not so easy, and involves counting every value up to the point where you can be sure that 5% are misplaced.

Consider a list with 10.000 elements of which 99% are 42:

  (1,2,3,4,5,6,7,8,9,10, ... , 100, 42,42, 42, 42 .... 42)

So I think you would have to start out building a frequency table for at least the first 5% of the table.

Thilo
I like this idea. It is easy to understand and should be quite fast. The tricky part will be finding out the stop condition, but I think that is fairly easy.
kigurai
Forget about my answer, use Boyer-Moore's majority vote algorithm outlined by Nikita.
Thilo
A: 

Here's a brute-force method: convert the list to a set, then check for every element in the set, how often it can be found in the original list, and return the maximum found percentage:

def yourProblem(yourList):
    s = set(yourList)
    m = 0
    l = len(yourList)
    for i in s:
        x = len([e for e in yourList if e == i]) / l
            if x > m:
                m = x
    return m
xor_eq
This is horribly ineffective.
kigurai
+5  A: 
def ninety_five_same(lst):
    freq = collections.defaultdict(int)
    for x in lst:
        freq[x] += 1
    freqsort = sorted(freq.itervalues())
    return freqsort[-1] >= .95 * sum(freqsort)

Assuming perfect hash table performance and a good sorting algorithm, this runs in O(n + m lg m), where m is the number of distinct items. O(n lg n) worst case.

Edit: here's an O(n + m), single-pass version (assuming m << n):

def ninety_five_same(lst):
    freq = collections.defaultdict(int)
    for x in lst:
        freq[x] += 1
    freq = freq.values()
    return max(freq) >= .95 * sum(freq)

Memory use is O(m). max and sum can be replaced by a single loop.

larsmans
You can replace `lambda: 0` by `int`, it is guaranteed to be initialized to 0.
Space_C0wb0y
Thanks for the tip, but I find `collections.defaultdict(int)` slightly confusing.
larsmans
The Boyer-Moore algorithm proposed by @Nikita Rybak has O(N)
Thilo
While this is a correct solution, I think Thilos solution of breaking early is better.
kigurai
+1  A: 
def ninety_five_same(l):
  return max([l.count(i) for i in set(l)])*20 >= 19*len(l)

Also eliminating the problem with with accuracy of float division.

Rok
otherwise good, but you do complete count of whole list for every value of plus set production. Very heavy stuff with big list with many different values but small portion of the whole length in total.
Tony Veijalainen
+14  A: 
>>> from collections import Counter
>>> lst = [1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
>>> _, freq = Counter(lst).most_common(1)[0]
>>> len(lst)*.95 <= freq
True
SilentGhost
Python really does have some neat tricks hidden in its back pocket.
Tim McNamara
Should note that this requires Python version 2.7, which was when `Counter` subclass got added to the `collections` module.
martineau
@martineau: it got added to py3.1 and then backported to 2.7, that is to say it's been around for a while. Also, Python 2.7 is a current stable version of Python.
SilentGhost
@SilentGhost: I know that and only brought the dependency up because, for various reasons, a number of people using this website aren't or can't use the latest release of Python 2 -- and sometimes don't understand that is why a particular answer won't work for them.
martineau
@martineau: You could use http://code.activestate.com/recipes/576611/ in Python ≥2.5.
KennyTM
It should be noted that this solution requires all elements to be hashable.
KennyTM
@KennyTM: Comment 1: Nice! I wasn't aware of that one -- now others are, too. Comment 2: Good point, a fact often forgotten in the answers posted here. Fortunately, in practical use, they often are.
martineau
+12  A: 

Actually, there's an easy linear solution for similar problem, only with 50% constraint instead of 95%. Check this question, it's just a few lines of code.

It will work for you as well, only in the end you check that selected element satisfies 95% threshold, not 50%. (Although, as Thilo notes, it's not necessary if currentCount >= n*0.95 already.)

I'll also post Python code from st0le's answer, to show everybody how difficult it is.

currentCount = 0
currentValue = lst[0]
for val in lst:
   if val == currentValue:
      currentCount += 1
   else:
      currentCount -= 1

   if currentCount == 0:
      currentValue = val
      currentCount = 1

If you're looking for explanation, I think Nabb has got the best one.

Nikita Rybak
+1. O(N). Looking at all the other answers should settle the argument whether this problem was "trivial".
Thilo
Excellent animated demo here: http://www.cs.utexas.edu/~moore/best-ideas/mjrty/index.html
Thilo
After this, you need to do another pass to ensure that majority indeed is 95% (except for the cases where this can already be deduced from the final value of currentCount).
Thilo
O(*n*), but a two-pass algorithm with Thilo's addition. Adding a frequency table as in my algorithm makes it O(*n*) in a single pass again :)
larsmans
@larsmans Yep, no solution is perfect. (And I've already mentioned in the post that it needs to be checked whether given frequency rate is satisfied.) Your algorithm is nice too, even though it uses considerable amount of additional memory and will probably be slower.
Nikita Rybak
@Nikita Rybak: your last claim depends on the nature of the input, esp. the number of distinct items and the memory locality of the original list. I gave you a +1 and updated my answer.
larsmans
@larsmans I agree that your solution is a good one (the best you can get without fancy theorems and textbook algorithms). Still, I do think that second pass against input list will be faster than invoking hash n times. (that's, assuming input list is available at all)
Nikita Rybak
In most cases, you would not need that second pass. currentCount should be significant enough.
Thilo
A: 

Think about your list as a bucket of red and black balls.

If you have one red ball in a bucket of ten balls, and you pick a ball at random and put it back in the bucket, and then repeat that sample-and-replace step a thousand times, how many times out of a thousand do you expect to observe a red ball, on average?

Check out the Binomial distribution and check out confidence intervals. If you have a very long list and want to do things relatively efficiently, sampling is the way to go.

Alex Reynolds
Problem is that you have not only red and black balls (but potentially hundreds of different colors). And sampling seems very unreliable, considering that there is an O(N) exact solution.
Thilo
If you know the number of colors, you can always extend to a multinomial. If your list is billions of elements or longer, for example, sampling a few thousand "balls" can be much more appealing than an O(n) approach that requires passing through every element in the list.
Alex Reynolds
How do you know the number of colors?
Thilo
If you know a priori what kinds of balls ("objects") go into the bucket ("list") then you can model appropriately.
Alex Reynolds
A: 
lst = [1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
#lst = [1, 2, 1, 4, 1]
#lst = [1, 2, 1, 4]

length = len(lst)
currentValue = lst[0]
lst.pop(0)
currentCount = 1

for val in lst:
   if currentCount == 0:
      currentValue = val

   if val == currentValue:
      currentCount += 1
   else:
      currentCount -= 1

percent = (currentCount * 50.0 / length + 50)
epsilon = 0.1
if (percent - 50 > epsilon):
    print "Percent %g%%" % percent
else:
    print "No majority"

Note: epsilon has a "random" value, chose something depending on the length of the array etc. Nikita Rybak's solution with currentCount >= n*0.95 won't work, because the value of currentCount differs depending on the order of elements, but the above does work.

C:\Temp>a.py
[2, 1, 1, 4, 1]
currentCount = 1

C:\Temp>a.py
[1, 2, 1, 4, 1]
currentCount = 2
tehnicaorg
A: 

sort as general solution probably is heavy, but consider the exceptional well balanced nature of tim sort in Python, which utilize the existing order of the list. I would suggest to sort the list (or copy of it with sorted, but that copy will hurt the performance). Scan from end and front to find the same element or reach scan length > 5 %, otherwise list is 95% similar with the element found.

Taking random elements as candidates and taking count of them by decreasing order of frequency would not probably also be so bad until found count > 95 % or the total of counts goes over 5%.

Tony Veijalainen