views:

428

answers:

8

So, lets say I have 100,000 float arrays with 100 elements each. I need the highest X number of values, BUT only if they are greater than Y. Any element not matching this should be set to 0. What would be the fastest way to do this in Python? Order must be maintained. Most of the elements are already set to 0.

sample variables:

array = [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1

expected result:

array = [0, .25, 0, .15, .5, 0, 0, 0, 0, 0]
+3  A: 

The simplest way would be:

topX = sorted([x for x in array if x > lowValY], reverse=True)[highCountX-1]
print [x if x >= topX else 0 for x in array]

In pieces, this selects all the elements greater than lowValY:

[x for x in array if x > lowValY]

This array only contains the number of elements greater than the threshold. Then, sorting it so the largest values are at the start:

sorted(..., reverse=True)

Then a list index takes the threshold for the top highCountX elements:

sorted(...)[highCountX-1]

Finally, the original array is filled out using another list comprehension:

[x if x >= topX else 0 for x in array]

There is a boundary condition where there are two or more equal elements that (in your example) are 3rd highest elements. The resulting array will contain that element more than once.

There are other boundary conditions as well, such as if len(array) < highCountX. Handling such conditions is left to the implementor.

Greg Hewgill
You can use x for x in array if x > lowValY instead of [x for x in array if x > lowValY] to just enumerate over original array without copying it (if original data is quite large this might be a good thing to do).
Abgan
That's true. `sorted()` will probably need the whole list anyway, though.
Greg Hewgill
Heh, 3x faster then my noob code, but I would need the equal elements to maintain the highCountX limit. The arrays should have anywhere from 20-200 elements... they are actually segments of a larger array that I process in chunks. Thanks for the help so far.
David
I can't see how do you `zero`ing elements in the original array.
J.F. Sebastian
If `highCountX > len([x for x in array if x > lowValY])` then you'll get IndexError.
J.F. Sebastian
This wouldn't work (IndexError) if the number of elements larger than lowValY is smaller than highCountX
ThisIsMeMoony
Sebastian was quick :P
ThisIsMeMoony
Yes, there are other boundary conditions. Error handling is left to the implementor, I have provided an outline of a possible solution.
Greg Hewgill
+1. Elegantly solved. N.B.: the last list comprehension only works with Python 2.5+ because of the ternary operation.
e-satis
+7  A: 

This is a typical job for NumPy, which is very fast for these kinds of operations:

array_np = numpy.array(array)
low_values_indices = array_np < lowValY  # Where values are low
array_np[low_values_indices] = 0  # All low values set to 0

Now, if you only need the highCountX largest elements, you can even "forget" the small elements (instead of setting them to 0) and only sort the list of large elements:

array_np = numpy.array(array)
print numpy.sort(array_np[array_np >= lowValY])[-highCountX:]

Of course, sorting the whole array if you only need a few elements might not be optimal. Depending on your needs, you might want to consider the standard heapq module.

EOL
Nice... using proper libraries can take you really far :-)
Abgan
I keep running into this numPy, guess I'll have to check it out :) Thanks for the help (everyone).
David
@David NumPy really fills a need. I would suggest that you start with the tutorial I linked to: it's probably the fastest way of getting up to speed with NumPy and learning its most important concepts.
EOL
A: 

Settings elements below some threshold to zero is easy:

array = [ x if x > threshold else 0.0 for x in array ]

(plus the occasional abs() if needed.)

The requirement of the N highest numbers is a bit vague, however. What if there are e.g. N+1 equal numbers above the threshold? Which one to truncate?

You could sort the array first, then set the threshold to the value of the Nth element:

threshold = sorted(array, reverse=True)[N]
array = [ x if x >= threshold else 0.0 for x in array ]

Note: this solution is optimized for readability not performance.

digitalarbeiter
in this case, it doesn't matter which one is truncated... more important is that highCountX is followed
David
+1  A: 

Using numpy:

# assign zero to all elements less than or equal to `lowValY`
a[a<=lowValY] = 0 
# find n-th largest element in the array (where n=highCountX)
x = partial_sort(a, highCountX, reverse=True)[:highCountX][-1]
# 
a[a<x] = 0 #NOTE: it might leave more than highCountX non-zero elements
           # . if there are duplicates

Where partial_sort could be:

def partial_sort(a, n, reverse=False):
    #NOTE: in general it should return full list but in your case this will do
    return sorted(a, reverse=reverse)[:n]

The expression a[a<value] = 0 can be written without numpy as follows:

for i, x in enumerate(a):
    if x < value:
       a[i] = 0
J.F. Sebastian
+1  A: 

You can use map and lambda, it should be fast enough.

new_array = map(lambda x: x if x>y else 0, array)
nnrcschmdt
A: 

Use a heap.

This works in time O(n*lg(HighCountX)).

import heapq

heap = []
array =  [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1

for i in range(1,highCountX):
    heappush(heap, lowValY)
    heappop(heap)

for i in range( 0, len(array) - 1)
    if array[i] > heap[0]:
     heappush(heap, array[i])

min = heap[0]

array = [x if x >= min else 0 for x in array]

deletemin works in heap O(lg(k)) and insertion O(lg(k)) or O(1) depending on which heap type you use.

egon
didn't test the code syntax...
egon
+1  A: 

There's a special MaskedArray class in NumPy that does exactly that. You can "mask" elements based on any precondition. This better represent your need than assigning zeroes: numpy operations will ignore masked values when appropriate (for example, finding mean value).

>>> from numpy import ma
>>> x = ma.array([.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0])
>>> x1 = ma.masked_inside(0, 0.1) # mask everything in 0..0.1 range
>>> x1
masked_array(data = [-- 0.25 -- 0.15 0.5 -- -- -- -- --],
         mask = [ True False True False False True True True True True],
   fill_value = 1e+20)
>>> print x.filled(0) # Fill with zeroes
[ 0 0.25 0 0.15 0.5 0 0 0 0 0 ]

As an addded benefit, masked arrays are well supported in matplotlib visualisation library if you need this.

Docs on masked arrays in numpy

Alex Lebedev
A: 

Using a heap is a good idea, as egon says. But you can use the heapq.nlargest function to cut down on some effort:

import heapq 

array =  [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1

threshold = max(heapq.nlargest(highCountX, array)[-1], lowValY)
array = [x if x >= threshold else 0 for x in array]
Matt Anderson
I like this homemade solution that only uses standard modules. However, it should be upgraded so as to really return the largest highCountX elements (if many elements in the array have value `threshold`, the final array has too many non-zero elements).
EOL