views:

66

answers:

2

Hi,

I have a numpy array:

A = array([['id1', '1', '2', 'NaN'],
           ['id2', '2', '0', 'NaN']])

I also have a list:

li = ['id1', 'id3', 'id6']

I wish to iterate over the array and the list and where the first element in each row of the array is not in the list, then delete that entire row from the array.

My code to date:

from numpy import *

for row in A:
    if row[0] not in li:
        delete(A, row, axis = 0)

This returns the following error:

ValueError: invalid literal for int() with base 10: 'NaN'

The type of all elements in each row is str(), therefore I do not understand this mention of int() in the error.

Any suggestions?

Thanks, S ;-)

+5  A: 

Just generating a new array is no option?

numpy.array([x for x in A if x[0] in li])
atomocopter
Yes, much simpler than my solution!
eumiro
I think the original poster wanted to retain the rows where `row[0]` was in `li`, need to eliminate `not` from the condition in your list comprehension.
dtlussier
@dtlussier: thanks for pointing out my mistake. :)
atomocopter
+2  A: 

It appears you want to delete a row of your array in-place, however, this is not possible using the np.delete function, as such an operation goes against the way that Python and Numpy manage memory.

I found an interesting post on the Numpy mailing list (Travis Oliphant, [Numpy-discussion] Deleting a row from a matrix) where the np.delete function is first discussed:

So, "in-place" deletion of array objects would not be particularly useful, because it would only work for arrays with no additional reference counts (i.e. simple b=a assignment would increase the reference count and make it impossible to say del a[obj]).

....

But, the problem with both of those approaches is that once you start removing arbitrary rows (or n-1 dimensional sub-spaces) from an array you very likely will no longer have a chunk of memory that can be described using the n-dimensional array memory model.

If you take a look at the documentation for np.delete (http://docs.scipy.org/doc/numpy/reference/generated/numpy.delete.html), we can see that the function returns a new array with the desired parts (not necessarily rows) deleted.

Definition:       np.delete(arr, obj, axis=None)
Docstring:
Return a new array with sub-arrays along an axis deleted.

Parameters
----------
arr : array_like
  Input array.
obj : slice, int or array of ints
  Indicate which sub-arrays to remove.
axis : int, optional
  The axis along which to delete the subarray defined by `obj`.
  If `axis` is None, `obj` is applied to the flattened array.

Returns
-------
out : ndarray
    A copy of `arr` with the elements specified by `obj` removed. Note
    that `delete` does not occur in-place. If `axis` is None, `out` is
    a flattened array.

So, in your case I think you'll want to do something like:

A = array([['id1', '1', '2', 'NaN'],
           ['id2', '2', '0', 'NaN']])

li = ['id1', 'id3', 'id6']

for i, row in enumerate(A):
    if row[0] not in li:
        A = np.delete(A, i, axis=0)

A is now cut down as you wanted, but remember it is a new piece of memory. Each time np.delete is called new memory is allocated which the name A will point to.

I'm sure there is a better vectorized way (maybe using masked arrays?) to find out which rows to delete, but I couldn't get it together. If anyone has it though please comment!

dtlussier