views:

327

answers:

11

Hi everyone,

How can retrieve the 2 highest item from a list containing 100 000 integers.

You do understand without having to sort the entire list.

+5  A: 

You iterate over the list, maintaining variables that contain the value of the highest and the second highest item encountered thus far. Each new item that is encountered will replace whichever of the two the new item is higher than (if any).

JacobM
This is the way to go. O(n) time and O(1) space.
Justin Peel
A: 

Iterating through the entire list is the only way to do it without sorting.

Alex DeLarge
A: 

Without sorting the list the only way to really do it is to iterate through the whole list and save the highest two numbers. I think you'd be better off sorting the list.

jaltiere
Sorting is O(n log n)..
BlueRaja - Danny Pflughoeft
Better off .. why exactly?
kaloyan
I think he just means it saves human time :)
Po
I'm making the assumption that this list of numbers is going to change at some point. If your list is sorted you don't have to iterate every time to find the maximum numbers. If you take it a step further and store your data structure in some sort of binary tree, you will be able to pull the maximum numbers extremely fast. (as well as any other operations you want, such as checking for duplicates, etc)
jaltiere
+1  A: 

This will work, but I don't know if you want to retain the items in the list:

max1 = max(myList)
myList.remove(max1)
max2 = max(myList)

If you do, you can do this:

max1 = max(myList)
idx1 = myList.index(max1)
myList.pop(idx1)

max2 = max(myList)
myList.insert(idx1,max1)
Jeff B
Both of these iterate over `myList` three times, where only one iteration is really necessary.
Thomas Wouters
I understand. This requires less algorithm however, and sometimes that's what people want. If it were me, I would sort or traverse the list. Those methods have already been covered several times in the answers.
Jeff B
@Thomas, Who cares?
Mike Graham
@Jeff, Sorting would scale worse than this.
Mike Graham
Yeah, I guess O(3n) is better than O(n log n). :) Traversing the array, retaining maximum values along the way is O(n), which == O(3n). In real time, they may differ by a factor of 3 worst case, but in terms of scalability, they are equivalent.
Jeff B
Why did I get a down vote on this? It is a valid answer. May not be the best answer, but it is valid.
Jeff B
A: 

The second highest item is a fairly simple case, but for the kth highest item what you want is a selection algorithm. That page is pretty thorough so it's probably best just to read that.

aspo
+5  A: 

A really slick way is to use heapq. Heapify the array (O(n)), then just pop an many elements that you need (log(n)). (Saw this question in an interview once, good question to keep in mind.)

zdav
+1. This is a good approach. Only thing is that the heap serves as a priority queue. It's not sorted and the only reliable thing is the top of the queue will be the largest. So, after popping that out, you'll have to heapify the queue once again(log(n) like zdav's mentioned). It doesn't change the asymptotic time but one should be aware of this.
Noufal Ibrahim
Agreed @Noufal, if you just use heappop, the re-heapifying is taken care of.
zdav
I like the answer, but saying that this is not sorted is a bit of a stretch. Certainly, this is not linearly sorted, but there is a degree of sorting going on.
Jeff B
If you knew about heapq, you should've known about: http://docs.python.org/library/heapq.html#heapq.nlargest
FogleBird
@Jeff B: finding the first (or equivalently the last) M elements of the sort of a list necessarily involves some sorting. This way is more up-front about it than the "simple" hand-coded loop, but unfortunately it also does more *of* it, and takes more space also...
SamB
Very right @FogleBird, I forgot all about `nlargest`, that function makes the solution a one-liner!
zdav
@FogleBird. I haven't used heapq myself. I was making a statement about heaps as a data structure. Your point is very valid.
Noufal Ibrahim
+10  A: 

In Python, use heapq.nlargest. This is the most flexible approach in case you ever want to handle more than just the top two elements.

Here's an example.

>>> import heapq
>>> import random
>>> x = range(100000)
>>> random.shuffle(x)
>>> heapq.nlargest(2, x)
[99999, 99998]

Documentation: http://docs.python.org/library/heapq.html#heapq.nlargest

FogleBird
A: 

The best time you can expect is linear, since you have to at least look through all the elements.

Here is my pseudocode to solve the problem:

//assume list has at least 2 elements
(max, nextMax) = if (list[0] > list[1])
                 then (list[0], list[1])
                 else (list[1], list[0])

for (2 <= i < length) {
    (max, nextMax) = if       (max < list[i])     => (list[i], max)
                     elseif   (nextMax < list[i]) => (max, list[i])
                     else     (no change)         => (max, nextMax)
}

return (max, nextMax)
Po
+2  A: 

JacobM's answer is absolutely the way to go. However, there are a few things to keep in mind while implementing what he described. Here's a little play-along-at-home tutorial to guide you through the trickier parts of solving this problem.

If this code is meant for production use, please use one of the more efficient/concise answers listed. This answer is targetted at someone new to programming.

The idea

The idea is simple.

  • Keep two variables: largest and second_largest.
  • Go through the list.
    • If an item is greater than largest, assign it to largest.
    • If an item is greater than second_largest, but less than largest, assign it to second_largest.

Getting started

Let's start.

def two_largest(inlist):
    """Return the two largest items in the sequence. The sequence must
    contain at least two items."""
    for item in inlist:
        if item > largest:
            largest = item
        elif largest > item > second_largest:
            second_largest = item
    # Return the results as a tuple
    return largest, second_largest

# If we run this script, it will should find the two largest items and
# print those
if __name__ == "__main__":
    inlist = [3, 2, 1]
    print two_largest(inlist)

Okay, we now have JacobM's answer as a Python function. What happens when we try to run it?

Traceback (most recent call last):
  File "twol.py", line 10, in <module>
    print two_largest(inlist)
  File "twol.py", line 3, in two_largest
    if item > largest:
UnboundLocalError: local variable 'largest' referenced before assignment

Apparently, we need to set largest before we start the loop. This probably means we should set second_largest too.

Initializing variables

Let's set largest and second_largest to 0.

def two_largest(inlist):
    """Return the two largest items in the sequence. The sequence must
    contain at least two items."""
    largest = 0 # NEW!
    second_largest = 0 # NEW!
    for item in inlist:
        if item > largest:
            largest = item
        elif largest > item > second_largest:
            second_largest = item
    # Return the results as a tuple
    return largest, second_largest

# If we run this script, it will should find the two largest items and
# print those
if __name__ == "__main__":
    inlist = [3, 2, 1]
    print two_largest(inlist)

Good. Let's run it.

(3, 2)

Great! Now let's test with inlist being [1, 2, 3]

    inlist = [1, 2, 3] # CHANGED!

Let's try it.

(3, 0)

...Uh oh.

Fixing the logic

The largest value (3) seems correct. The second-largest value is completely wrong though. What's going on?

Let's work through what the function is doing.

  • When we start, largest is 0 and second_largest is also 0.
  • The first item in the list we look at is 1, so largest becomes 1.
  • The next item is 2, so largest becomes 2.

But what about second_largest?

When we assign a new value to largest, the largest value actually becomes second-largest. We need to show that in the code.

def two_largest(inlist):
    """Return the two largest items in the sequence. The sequence must
    contain at least two items."""
    largest = 0
    second_largest = 0
    for item in inlist:
        if item > largest:
            second_largest = largest # NEW!
            largest = item
        elif largest > item > second_largest:
            second_largest = item
    # Return the results as a tuple
    return largest, second_largest

# If we run this script, it will should find the two largest items and
# print those
if __name__ == "__main__":
    inlist = [1, 2, 3]
    print two_largest(inlist)

Let's run it.

(3, 2)

Fantastic.

Initializing variables, part 2

Now let's try it with a list of negative numbers.

    inlist = [-1, -2, -3] # CHANGED!

Let's run it.

(0, 0)

That's not right at all. Where did these zeroes come from?

It turns out that the starting values for largest and second_largest were actually larger than all the items in the list. The first thing you might consider is setting largest and second_largest to the lowest values possible in Python. Unfortunately, Python doesn't have a smallest possible value. That means that, even if you set both of them to -1,000,000,000,000,000,000, you can have a list of values smaller than that.

So what's the best thing to do? Let's try setting largest and second_largest to the first and second items in the list. Then, to avoid double-counting any items in the list, we only look at the part of the list after the second item.

def two_largest(inlist):
    """Return the two largest items in the sequence. The sequence must
    contain at least two items."""
    largest = inlist[0] # CHANGED!
    second_largest = inlist[1] # CHANGED!
    # Only look at the part of inlist starting with item 2
    for item in inlist[2:]: # CHANGED!
        if item > largest:
            second_largest = largest
            largest = item
        elif largest > item > second_largest:
            second_largest = item
    # Return the results as a tuple
    return largest, second_largest

# If we run this script, it will should find the two largest items and
# print those
if __name__ == "__main__":
    inlist = [-1, -2, -3]
    print two_largest(inlist)

Let's run it.

(-1, -2)

Great! Let's try with another list of negative numbers.

    inlist = [-3, -2, -1] # CHANGED!

Let's run it.

(-1, -3)

Wait, what?

Initializing variables, part 3

Let's step through our logic again.

  • largest is set to -3
  • second_largest is set to -2

Wait right there. Already, this seems wrong. -2 is larger than -3. Is this what caused the problem? Let's continue.

  • largest is set to -1; second_largest is set to the old value of largest, which is -3

Yes, that looks to be the problem. We need to ensure that largest and second_largest are set correctly.

def two_largest(inlist):
    """Return the two largest items in the sequence. The sequence must
    contain at least two items."""
    if inlist[0] > inlist[1]: # NEW
        largest = inlist[0]
        second_largest = inlist[1]
    else: # NEW
        largest = inlist[1] # NEW
        second_largest = inlist[0] # NEW
    # Only look at the part of inlist starting with item 2
    for item in inlist[2:]:
        if item > largest:
            second_largest = largest
            largest = item
        elif largest > item > second_largest:
            second_largest = item
    # Return the results as a tuple
    return largest, second_largest

# If we run this script, it will should find the two largest items and
# print those
if __name__ == "__main__":
    inlist = [-3, -2, -1]
    print two_largest(inlist)

Let's run it.

(-1, -2)

Excellent.

Conclusion

So here's the code, nicely commented and formatted. It's also had all the bugs I could find beaten from it. Enjoy.

However, assuming this really is a homework question, I hope you get some useful experience from seeing an imperfect piece of code slowly improved. I hope some of these techniques will be useful in future programming assignments.


Efficiency

Not very efficient. But for most purposes, it should be okay: on my computer (Core 2 Duo), a list of 100 000 items can be processed in 0.27 seconds (using timeit, averaged over 100 runs).

Wesley
It's a shame you wrote a diatribe while there is `heapq.nlargest` :)
ΤΖΩΤΖΙΟΥ
heapq sure might be the way to go in production code, but it's much easier to explain a for loop to a new programmer than a heap data structure. Hopefully some new programmer finds this useful someday. Maybe.
Wesley
OK, now what if I want the top 3? Or 4? This approach is naive and bug-prone (as you've shown).
FogleBird
If your goal is to teach, we should be explaining heaps and why they are good for this task.
FogleBird
I agree; that would be best.
Wesley
A: 

i don't think that Wesley solution is good. I think, instead, that is horrible.(maybe is useful to teach how to fix a code)

Simple is better than complex.

we have a lots of solutions, and we have to take the simpler. the use of 'for' or 'while' on a list of 100.000 integers is a folly

ant
You're absolutely right. I wrote that answer with a very educational tone. A more concise answer is definitely possible. A _much_ more efficient answer is also possible, given more context for the problem.
Wesley
A: 

"2 highest" is impossible; only one item can be "highest". Perhaps you mean "highest 2". In any case, you need to say what to do when the list contains duplicates. What do you want from [8, 9, 10, 10]: (10, 9) or (10, 10)? If your response is (10, 10), please consider input of [8, 9, 10, 10, 10]. What are you going to do with the "highest two" when you've got them? Please edit your question to give this guidance.

In the meantime, here's an answer that takes the first approach (two unique values):

largest = max(inlist)
second_largest = max(item for item in inlist if item < largest)

You should add guards against fewer than 2 unique values in the list.

John Machin