views:

83

answers:

3

I am trying to create a 'mask' of a numpy.array by specifying certain criteria. Python even has nice syntax for something like this:

>> A = numpy.array([1,2,3,4,5])
>> A > 3
array([False, False, False, True, True])

But if I have a list of criteria instead of a range:

>> A = numpy.array([1,2,3,4,5])
>> crit = [1,3,5]

I can't do this:

>> A in crit

I have to do something based on list comprehensions, like this:

>> [a in crit for a in A]
array([True, False, True, False, True])

Which is correct.

Now, the problem is that I am working with large arrays and the above code is very slow. Is there a more natural way of doing this operation that might speed it up?

EDIT: I was able to get a small speedup by making crit into a set.

EDIT2: For those who are interested:

Jouni's approach: 1000 loops, best of 3: 102 µs per loop

numpy.in1d: 1000 loops, best of 3: 1.33 ms per loop

EDIT3: Just tested again with B = randint(10,size=100)

Jouni's approach: 1000 loops, best of 3: 2.96 ms per loop

numpy.in1d: 1000 loops, best of 3: 1.34 ms per loop

Conclusion: Use numpy.in1d() unless B is very small.

A: 

Create a mask and use the compress function of the numpy array. It should be much faster. If you have a complex criteria, remember to construct it based on math of the arrays.

a = numpy.array([3,1,2,4,5])
mask = a > 3
b = a.compress(mask)

or

a = numpy.random.random_integers(1,5,100000)
c=a.compress((a<=4)*(a>=2)) ## numbers between n<=4 and n>=2
d=a.compress(~((a<=4)*(a>=2))) ## numbers either n>4 or n<2

Ok, if you want a mask that has all a in [1,3,5] you can do something like

a = numpy.random.random_integers(1,5,100000)
mask=(a==1)+(a==3)+(a==5)

or

a = numpy.random.random_integers(1,5,100000)
mask = numpy.zeros(len(a), dtype=bool)
for num in [1,3,5]:
    mask += (a==num)
jimbob
I don't think that this is what I'm looking for. I don't want to get the actual contents of the array back, I just want to get a boolean mask that has the same length as the original array.
aduric
Ok, edited it now that I know what you want. I guess Jouni's solution that he came up with while I was editing mine was equivalent, as True= True + True, True = True + False, False = False + False, exactly the same as or using |.
jimbob
+3  A: 

Combine several comparisons with "or":

A = randint(10,size=10000)
mask = (A == 1) | (A == 3) | (A == 5)

Or if you have a list B and want to create the mask dynamically:

B = [1, 3, 5]
mask = zeros((10000,),dtype=bool)
for t in B: mask = mask | (A == t)
Jouni K. Seppänen
@Jouni - just wondering why or how to anticipate when `numpy` will naturally do this `ufunc` enabled element-wise logical operation? When doing logical operations `numpy` sometimes throws back an exception: `ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().`
dtlussier
@Jouni this is certainly the fastest approach, albeit, not the cleanest one.
aduric
+3  A: 

I think that the numpy function in1d is what you are looking for:

>>> A = numpy.array([1,2,3,4,5])
>>> B = [1,3,5]
>>> numpy.in1d(A,crit)
array([ True, False,  True, False,  True], dtype=bool)

as stated in its docstring, "in1d(a, b) is roughly equivalent to np.array([item in b for item in a])"

Admittedly, I haven't done any speed tests, but it sounds like what you are looking for.

Another faster way

Here's another way to do it which is faster. Sort the B array first(containing the elements you are looking to find in A), turn it into a numpy array, and then do:

B[B.searchsorted(A)] == A

though if you have elements in A that are larger than the largest in B, you will need to do:

inds = B.searchsorted(A)
inds[inds == len(B)] = 0
mask = B[inds] == A

It may not be faster for small arrays (especially for B being small), but before long it will definitely be faster. Why? Because this is a O(N log M) algorithm, where N is the number of elements in A and M is the number of elements in M, putting together a bunch of individual masks is O(N * M). I tested it with N = 10000 and M = 14 and it was already faster. Anyway, just thought that you might like to know, especially if you are truly planning on using this on very large arrays.

Justin Peel
looks like a recent addition to numpy (wasn't in version 1.3)
bpowah
You are right. I only tested on B having a length of 3. If B is also large, numpy.in1d() definitely scales a lot better.
aduric
@aduric and my second method is even faster than in1d.
Justin Peel