ansaurus

Question

Answer 1

A:

I don't know how to ask numpy to determine for you some aspects of a dtype but not others, but couldn't you have, e.g.:

data = [(1,'hello'),(2,'world')]
dlen = max(len(s) for i, s in data)
st = '|S%d' % dlen
np.rec.fromrecords(data, dtype=[('a',np.int8), ('b',st)])

Alex Martelli 2009-11-03 02:52:29

Since I'm working with converting arbitrary lists of tuples to recarrays, this isn't an ideal solution (as in, I don't know in advance which columns are going to be strings). Of course, I can search for the string length manually, but I was hoping to be able to avoid that.

astrofrog 2009-11-03 04:25:06

If you don't know in advance which columns are strings, how comes you DO know which ones are int8 vs int16 vs int32, since you do say you need to control _that_ "manually"?! Point is, you can either do your own discovery of types and sizes, or just let numpy do it, or let it do it (on part or all of the data) and then overrule its opinions by reparsing the data with a different dtype -- I'm not sure what further option you're yearning for (as you say you want to control the types of some columns but not others BUT you don't know which ones in advance?!)

Alex Martelli 2009-11-03 04:33:19

I'm sorry, you are correct - I meant that the solution you suggested would be too simple as it will only work for two-element tuples with the string column in the second place, but of course I can just code a loop over columns and for ones I know contain strings, find the maximum length. I was just hoping to be able to avoid duplicating code which might already be in Numpy.

astrofrog 2009-11-03 04:39:44

@Morgoth, then let numpy.recarray do its work, then alter the dtype to be as you wish and run another numpy.recarray with your newly determined dtype -- what better way to repurpose "the code that is already in numpy"?

Alex Martelli 2009-11-03 04:43:04

Answer 2

A:

If you don't need to manipulate the strings as bytes, you may use the object data-type to represent them. This essentially stores a pointer instead of the actual bytes:

In [38]: np.array(data, dtype=[('a', np.uint8), ('b', np.object)])
Out[38]: 
array([(1, 'hello'), (2, 'world')], 
      dtype=[('a', '|u1'), ('b', '|O8')])

Alternatively, Alex's idea would work well:

new_dt = []

# For each field of a given type and alignment, determine
# whether the field is an integer.  If so, represent it as a byte.

for f, (T, align) in dt.fields.iteritems():
    if np.issubdtype(T, int):
        new_dt.append((f, np.uint8))
    else:
        new_dt.append((f, T))

new_dt = np.dtype(new_dt)
np.array(data, dtype=new_dt)

which should yield

array([(1, 'hello'), (2, 'world')], 
      dtype=[('f0', '|u1'), ('f1', '|S5')])

Stefan van der Walt 2009-11-19 22:45:10

ansaurus

tags:

views:

answers:

Automatic string length in recarray

related questions