ansaurus

Question

Determine if a Python list is 95% the same?

Answer 1

+1 A:

This is even less efficient than checking if every element is the same.

The algorithm is roughly the same, go through every element in the list and count those that do not match the expected one (with the extra difficulty of knowing which one is the expected one). However, this time, you cannot just return false when you meet the first mismatch, you have to continue until you have enough mismatches to make up a 5% error rate.

Come to think of it, figuring out which element is the "right" one is probably not so easy, and involves counting every value up to the point where you can be sure that 5% are misplaced.

Consider a list with 10.000 elements of which 99% are 42:

  (1,2,3,4,5,6,7,8,9,10, ... , 100, 42,42, 42, 42 .... 42)

So I think you would have to start out building a frequency table for at least the first 5% of the table.

Thilo 2010-10-18 09:29:34

I like this idea. It is easy to understand and should be quite fast. The tricky part will be finding out the stop condition, but I think that is fairly easy.

kigurai 2010-10-18 09:46:01

Forget about my answer, use Boyer-Moore's majority vote algorithm outlined by Nikita.

Thilo 2010-10-18 09:51:36

Answer 2

A:

Here's a brute-force method: convert the list to a set, then check for every element in the set, how often it can be found in the original list, and return the maximum found percentage:

def yourProblem(yourList):
    s = set(yourList)
    m = 0
    l = len(yourList)
    for i in s:
        x = len([e for e in yourList if e == i]) / l
            if x > m:
                m = x
    return m

xor_eq 2010-10-18 09:35:08

This is horribly ineffective.

kigurai 2010-10-18 09:42:50

Answer 3

+5 A:

def ninety_five_same(lst):
    freq = collections.defaultdict(int)
    for x in lst:
        freq[x] += 1
    freqsort = sorted(freq.itervalues())
    return freqsort[-1] >= .95 * sum(freqsort)

Assuming perfect hash table performance and a good sorting algorithm, this runs in O(n + m lg m), where m is the number of distinct items. O(n lg n) worst case.

Edit: here's an O(n + m), single-pass version (assuming m << n):

def ninety_five_same(lst):
    freq = collections.defaultdict(int)
    for x in lst:
        freq[x] += 1
    freq = freq.values()
    return max(freq) >= .95 * sum(freq)

Memory use is O(m). max and sum can be replaced by a single loop.

larsmans 2010-10-18 09:35:33

You can replace `lambda: 0` by `int`, it is guaranteed to be initialized to 0.

Space_C0wb0y 2010-10-18 09:41:46

Thanks for the tip, but I find `collections.defaultdict(int)` slightly confusing.

larsmans 2010-10-18 09:45:20

The Boyer-Moore algorithm proposed by @Nikita Rybak has O(N)

Thilo 2010-10-18 09:46:27

While this is a correct solution, I think Thilos solution of breaking early is better.

kigurai 2010-10-18 09:48:26

Answer 4

+1 A:

def ninety_five_same(l):
  return max([l.count(i) for i in set(l)])*20 >= 19*len(l)

Also eliminating the problem with with accuracy of float division.

Rok 2010-10-18 09:37:17

otherwise good, but you do complete count of whole list for every value of plus set production. Very heavy stuff with big list with many different values but small portion of the whole length in total.

Tony Veijalainen 2010-10-18 20:22:13

Answer 5

+14 A:

>>> from collections import Counter
>>> lst = [1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
>>> _, freq = Counter(lst).most_common(1)[0]
>>> len(lst)*.95 <= freq
True

SilentGhost 2010-10-18 09:39:37

Python really does have some neat tricks hidden in its back pocket.

Tim McNamara 2010-10-18 09:44:04

Should note that this requires Python version 2.7, which was when `Counter` subclass got added to the `collections` module.

martineau 2010-10-18 14:24:45

@martineau: it got added to py3.1 and then backported to 2.7, that is to say it's been around for a while. Also, Python 2.7 is a current stable version of Python.

SilentGhost 2010-10-18 14:29:39

@SilentGhost: I know that and only brought the dependency up because, for various reasons, a number of people using this website aren't or can't use the latest release of Python 2 -- and sometimes don't understand that is why a particular answer won't work for them.

martineau 2010-10-18 18:14:55

@martineau: You could use http://code.activestate.com/recipes/576611/ in Python ≥2.5.

KennyTM 2010-10-18 20:28:41

It should be noted that this solution requires all elements to be hashable.

KennyTM 2010-10-18 20:29:23

@KennyTM: Comment 1: Nice! I wasn't aware of that one -- now others are, too. Comment 2: Good point, a fact often forgotten in the answers posted here. Fortunately, in practical use, they often are.

martineau 2010-10-18 21:14:02

Answer 6

+12 A:

Actually, there's an easy linear solution for similar problem, only with 50% constraint instead of 95%. Check this question, it's just a few lines of code.

It will work for you as well, only in the end you check that selected element satisfies 95% threshold, not 50%. (Although, as Thilo notes, it's not necessary if currentCount >= n*0.95 already.)

I'll also post Python code from st0le's answer, to show everybody how difficult it is.

currentCount = 0
currentValue = lst[0]
for val in lst:
   if val == currentValue:
      currentCount += 1
   else:
      currentCount -= 1

   if currentCount == 0:
      currentValue = val
      currentCount = 1

If you're looking for explanation, I think Nabb has got the best one.

Nikita Rybak 2010-10-18 09:41:03

+1. O(N). Looking at all the other answers should settle the argument whether this problem was "trivial".

Thilo 2010-10-18 09:43:03

Excellent animated demo here: http://www.cs.utexas.edu/~moore/best-ideas/mjrty/index.html

Thilo 2010-10-18 09:48:59

After this, you need to do another pass to ensure that majority indeed is 95% (except for the cases where this can already be deduced from the final value of currentCount).

Thilo 2010-10-18 09:50:37

O(*n*), but a two-pass algorithm with Thilo's addition. Adding a frequency table as in my algorithm makes it O(*n*) in a single pass again :)

larsmans 2010-10-18 10:06:49

@larsmans Yep, no solution is perfect. (And I've already mentioned in the post that it needs to be checked whether given frequency rate is satisfied.) Your algorithm is nice too, even though it uses considerable amount of additional memory and will probably be slower.

Nikita Rybak 2010-10-18 10:11:38

@Nikita Rybak: your last claim depends on the nature of the input, esp. the number of distinct items and the memory locality of the original list. I gave you a +1 and updated my answer.

larsmans 2010-10-18 10:17:20

@larsmans I agree that your solution is a good one (the best you can get without fancy theorems and textbook algorithms). Still, I do think that second pass against input list will be faster than invoking hash n times. (that's, assuming input list is available at all)

Nikita Rybak 2010-10-18 10:32:08

In most cases, you would not need that second pass. currentCount should be significant enough.

Thilo 2010-10-18 10:32:19

Answer 7

A:

Think about your list as a bucket of red and black balls.

If you have one red ball in a bucket of ten balls, and you pick a ball at random and put it back in the bucket, and then repeat that sample-and-replace step a thousand times, how many times out of a thousand do you expect to observe a red ball, on average?

Check out the Binomial distribution and check out confidence intervals. If you have a very long list and want to do things relatively efficiently, sampling is the way to go.

Alex Reynolds 2010-10-18 09:44:30

Problem is that you have not only red and black balls (but potentially hundreds of different colors). And sampling seems very unreliable, considering that there is an O(N) exact solution.

Thilo 2010-10-18 09:56:06

If you know the number of colors, you can always extend to a multinomial. If your list is billions of elements or longer, for example, sampling a few thousand "balls" can be much more appealing than an O(n) approach that requires passing through every element in the list.

Alex Reynolds 2010-10-18 10:05:42

How do you know the number of colors?

Thilo 2010-10-18 10:29:16

If you know a priori what kinds of balls ("objects") go into the bucket ("list") then you can model appropriately.

Alex Reynolds 2010-10-18 21:07:51

Answer 8

A:

lst = [1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
#lst = [1, 2, 1, 4, 1]
#lst = [1, 2, 1, 4]

length = len(lst)
currentValue = lst[0]
lst.pop(0)
currentCount = 1

for val in lst:
   if currentCount == 0:
      currentValue = val

   if val == currentValue:
      currentCount += 1
   else:
      currentCount -= 1

percent = (currentCount * 50.0 / length + 50)
epsilon = 0.1
if (percent - 50 > epsilon):
    print "Percent %g%%" % percent
else:
    print "No majority"

Note: epsilon has a "random" value, chose something depending on the length of the array etc. Nikita Rybak's solution with currentCount >= n*0.95 won't work, because the value of currentCount differs depending on the order of elements, but the above does work.

C:\Temp>a.py
[2, 1, 1, 4, 1]
currentCount = 1

C:\Temp>a.py
[1, 2, 1, 4, 1]
currentCount = 2

tehnicaorg 2010-10-18 12:39:51

Answer 9

A:

sort as general solution probably is heavy, but consider the exceptional well balanced nature of tim sort in Python, which utilize the existing order of the list. I would suggest to sort the list (or copy of it with sorted, but that copy will hurt the performance). Scan from end and front to find the same element or reach scan length > 5 %, otherwise list is 95% similar with the element found.

Taking random elements as candidates and taking count of them by decreasing order of frequency would not probably also be so bad until found count > 95 % or the total of counts goes over 5%.

Tony Veijalainen 2010-10-18 20:23:30

ansaurus

tags:

views:

answers:

Determine if a Python list is 95% the same?

related questions