tags:

views:

66

answers:

2

If I create a recarray in this way:

In [29]: np.rec.fromrecords([(1,'hello'),(2,'world')],names=['a','b'])

The result looks fine:

Out[29]: 
rec.array([(1, 'hello'), (2, 'world')], 
      dtype=[('a', '<i8'), ('b', '|S5')])

But if I want to specify the data types:

In [32]: np.rec.fromrecords([(1,'hello'),(2,'world')],dtype=[('a',np.int8),('b',np.str)])

The string is set to a length of zero:

Out[32]: 
rec.array([(1, ''), (2, '')], 
      dtype=[('a', '|i1'), ('b', '|S0')])

I need to specify datatypes for all numerical types since I care about int8/16/32, etc, but I would like to benefit from the auto string length detection that works if I don't specify datatypes. I tried replacing np.str by None but no luck. I know I can specify '|S5' for example, but I don't know in advance what the string length should be set to.

A: 

I don't know how to ask numpy to determine for you some aspects of a dtype but not others, but couldn't you have, e.g.:

data = [(1,'hello'),(2,'world')]
dlen = max(len(s) for i, s in data)
st = '|S%d' % dlen
np.rec.fromrecords(data, dtype=[('a',np.int8), ('b',st)])
Alex Martelli
Since I'm working with converting arbitrary lists of tuples to recarrays, this isn't an ideal solution (as in, I don't know in advance which columns are going to be strings). Of course, I can search for the string length manually, but I was hoping to be able to avoid that.
astrofrog
If you don't know in advance which columns are strings, how comes you DO know which ones are int8 vs int16 vs int32, since you do say you need to control _that_ "manually"?! Point is, you can either do your own discovery of types and sizes, or just let numpy do it, or let it do it (on part or all of the data) and then overrule its opinions by reparsing the data with a different dtype -- I'm not sure what further option you're yearning for (as you say you want to control the types of some columns but not others BUT you don't know which ones in advance?!)
Alex Martelli
I'm sorry, you are correct - I meant that the solution you suggested would be too simple as it will only work for two-element tuples with the string column in the second place, but of course I can just code a loop over columns and for ones I know contain strings, find the maximum length. I was just hoping to be able to avoid duplicating code which might already be in Numpy.
astrofrog
@Morgoth, then let numpy.recarray do its work, then alter the dtype to be as you wish and run another numpy.recarray with your newly determined dtype -- what better way to repurpose "the code that is already in numpy"?
Alex Martelli
A: 

If you don't need to manipulate the strings as bytes, you may use the object data-type to represent them. This essentially stores a pointer instead of the actual bytes:

In [38]: np.array(data, dtype=[('a', np.uint8), ('b', np.object)])
Out[38]: 
array([(1, 'hello'), (2, 'world')], 
      dtype=[('a', '|u1'), ('b', '|O8')])

Alternatively, Alex's idea would work well:

new_dt = []

# For each field of a given type and alignment, determine
# whether the field is an integer.  If so, represent it as a byte.

for f, (T, align) in dt.fields.iteritems():
    if np.issubdtype(T, int):
        new_dt.append((f, np.uint8))
    else:
        new_dt.append((f, T))

new_dt = np.dtype(new_dt)
np.array(data, dtype=new_dt)

which should yield

array([(1, 'hello'), (2, 'world')], 
      dtype=[('f0', '|u1'), ('f1', '|S5')])
Stefan van der Walt