views:

59

answers:

1

Suppose that I have an array defined by:

data = np.array([('a1v1', 'a2v1', 'a3v1', 'a4v1', 'a5v1'),
       ('a1v1', 'a2v1', 'a3v1', 'a4v2', 'a5v1'),
       ('a1v3', 'a2v1', 'a3v1', 'a4v1', 'a5v2'),
       ('a1v2', 'a2v2', 'a3v1', 'a4v1', 'a5v2'),
       ('a1v2', 'a2v3', 'a3v2', 'a4v1', 'a5v2'),
       ('a1v2', 'a2v3', 'a3v2', 'a4v2', 'a5v1'),
       ('a1v3', 'a2v3', 'a3v2', 'a4v2', 'a5v2'),
       ('a1v1', 'a2v2', 'a3v1', 'a4v1', 'a5v1'),
       ('a1v1', 'a2v3', 'a3v2', 'a4v1', 'a5v2'),
       ('a1v2', 'a2v2', 'a3v2', 'a4v1', 'a5v2'),
       ('a1v1', 'a2v2', 'a3v2', 'a4v2', 'a5v2'),
       ('a1v3', 'a2v2', 'a3v1', 'a4v2', 'a5v2'),
       ('a1v3', 'a2v1', 'a3v2', 'a4v1', 'a5v2'),
       ('a1v2', 'a2v2', 'a3v1', 'a4v2', 'a5v1')],
      dtype=[('a1', '|S4'), ('a2', '|S4'), ('a3', '|S4'),
             ('a4', '|S4'), ('a5', '|S4')])

How to create a function to list out data elements by row with conditions given in a list of tuples, r.

r = [('a1', 'a1v1'), ('a4', 'a4v1')]

I know that it can be done manually like this:

data[(data['a1']=='a1v1') & data['a4']=='a4v1']

What about removing rows from data that comply with the r.

data[(data['a1']!='a1v1') | data['a4']!='a4v1']

Thanks.

+1  A: 

If I'm understanding you correctly, you want to list the entire row, where a given tuple of columns is equal to some value. In that case, this should be what you want, though it's a bit verbose and obscure:

test_cols = data[['a1', 'a4']]
test_vals = np.array(('a1v1', 'a4v1'), test_cols.dtype)
data[test_cols == test_vals]

Note the "nested list" style indexing... That's the easiest way to select multiple columns of a structured array. E.g.

data[['a1', 'a4']] 

will yield

array([('a1v1', 'a4v1'), ('a1v1', 'a4v2'), ('a1v3', 'a4v1'),
       ('a1v2', 'a4v1'), ('a1v2', 'a4v1'), ('a1v2', 'a4v2'),
       ('a1v3', 'a4v2'), ('a1v1', 'a4v1'), ('a1v1', 'a4v1'),
       ('a1v2', 'a4v1'), ('a1v1', 'a4v2'), ('a1v3', 'a4v2'),
       ('a1v3', 'a4v1'), ('a1v2', 'a4v2')], 
      dtype=[('a1', '|S4'), ('a4', '|S4')])

You can then test this agains a tuple of the values that you're checking for and get a one-dimensional boolean array where those columns are equal to those values.

However, with structured arrays, the dtype has to be an exact match. E.g. data[['a1', 'a4']] == ('a1v1', 'a4v1') just yields False, so we have to make an array of the values we want to test using the same dtype as the columns we're testing against. Thus, we have to do something like:

test_cols = data[['a1', 'a4']]
test_vals = np.array(('a1v1', 'a4v1'), test_cols.dtype)

before we can do this:

data[test_cols == test_vals]

Which yields what we were originally after:

array([('a1v1', 'a2v1', 'a3v1', 'a4v1', 'a5v1'),
       ('a1v1', 'a2v2', 'a3v1', 'a4v1', 'a5v1'),
       ('a1v1', 'a2v3', 'a3v2', 'a4v1', 'a5v2')], 
      dtype=[('a1', '|S4'), ('a2', '|S4'), ('a3', '|S4'), ('a4', '|S4'), ('a5', '|S4')])

Hope that makes some sense, anyway...

Joe Kington
Thanks Joe Kington.
Selinap
What if r is not in the order of data dtype?For example, r = [('a4', 'a4v1'), ('a1', 'a1v1')].
Selinap
Yeah, that's one of the gotcha's with this method. The colums have to be listed in the same order as the dtype. (Or, rather, they will be returned in the order of the dtype, regardless of the order in which they're listed.) I think this is just a design limitation of structured arrays... There was a patch for it posted to the mailing list, ( http://www.mail-archive.com/[email protected]/msg24453.html )but it apparently never made it into the trunk version of numpy...
Joe Kington