tags:

views:

120

answers:

2

I'm trying to convert a two-dimensional array into a structured array with named fields. I want each row in the 2D array to be a new record in the structured array. Unfortunately, nothing I've tried is working the way I expect.

I'm starting with:

>>> myarray = numpy.array([("Hello",2.5,3),("World",3.6,2)])
>>> print myarray
[['Hello' '2.5' '3']
 ['World' '3.6' '2']]

I want to convert to something that looks like this:

>>> newarray = numpy.array([("Hello",2.5,3),("World",3.6,2)], dtype=[("Col1","S8"),("Col2","f8"),("Col3","i8")])
>>> print newarray
[('Hello', 2.5, 3L) ('World', 3.6000000000000001, 2L)]

What I've tried:

>>> newarray = myarray.astype([("Col1","S8"),("Col2","f8"),("Col3","i8")])
>>> print newarray
[[('Hello', 0.0, 0L) ('2.5', 0.0, 0L) ('3', 0.0, 0L)]
 [('World', 0.0, 0L) ('3.6', 0.0, 0L) ('2', 0.0, 0L)]]

>>> newarray = numpy.array(myarray, dtype=[("Col1","S8"),("Col2","f8"),("Col3","i8")])
>>> print newarray
[[('Hello', 0.0, 0L) ('2.5', 0.0, 0L) ('3', 0.0, 0L)]
 [('World', 0.0, 0L) ('3.6', 0.0, 0L) ('2', 0.0, 0L)]]

Both of these approaches attempt to convert each entry in myarray into a record with the given dtype, so the extra zeros are inserted. I can't figure out how to get it to convert each row into a record.

Another attempt:

>>> newarray = myarray.copy()
>>> newarray.dtype = [("Col1","S8"),("Col2","f8"),("Col3","i8")]
>>> print newarray
[[('Hello', 1.7219343871178711e-317, 51L)]
 [('World', 1.7543139673493688e-317, 50L)]]

This time no actual conversion is performed. The existing data in memory is just re-interpreted as the new data type.

The array that I'm starting with is being read in from a text file. The data types are not known ahead of time, so I can't set the dtype at the time of creation. I need a high-performance and elegant solution that will work well for general cases since I will be doing this type of conversion many, many times for a large variety of applications.

Thanks!

+1  A: 
>>> import numpy
>>> myarray = numpy.array([("Hello",2.5,3),("World",3.6,2)], dtype=tuple)
>>> print myarray
[[Hello 2.5 3]
 [World 3.6 2]]
>>> myarray.tolist()
[['Hello', 2.5, 3], ['World', 3.6000000000000001, 2]]
gnibbler
Adding tuple as the dtype in the definition of myarray doesn't seem to have changed anything. Also, I need the output to be a structured array with dtype=[("Col1","S8"),("Col2","f8"),("Col3","i8")]). I'm looking for a solution that does not involve converting to a list (for performance reasons).
Emma
@Emma, adding the dtype of tuple prevents all the items being converted to strings. ie. the numeric entries are still numbers. If that is not what you want, can you please clarify.
gnibbler
+1  A: 

The best solution is to determine the correct dtype earlier on in your script and load the data from the file once, at the beginning, with the correct dtype.

Perhaps that might mean reading in one line of the file, using it to construct the right dtype, using seek(0) to return to the beginning of the file, and then using np.genfromtxt(...,dtype=dt) to load the array with the correct dtype.

If for some reason the above is not an option, you could use one of the following methods:

import numpy as np
import cStringIO

arr = np.array([("Hello",2.5,3),("World",3.6,2)])
dt=np.dtype([("Col1","S8"),("Col2","f8"),("Col3","i8")])

def convert_to_dtype(arr,dt):    
    arr2 = np.empty(len(arr),dtype=dt)
    converter={'S':str,'f':float,'i':int}
    kinds=[dt[i].kind for i in range(len(dt))]
    for i,(kind,name) in enumerate(zip(kinds,dt.names)):
        arr2[name]=map(converter[kind],arr[:,i])
    return arr2

def reload_with_dtype(arr,dt):
    fh=cStringIO.StringIO()
    np.savetxt(fh,arr,fmt='%s')
    fh.seek(0)
    arr2=np.genfromtxt(fh,dtype=dt)
    return arr2

if __name__=='__main__':
    arr2=convert_to_dtype()
    print(arr2)
    # [('Hello', 2.5, 3L) ('World', 3.6000000000000001, 2L)]
    arr3=reload_with_dtype()
    print(arr3)
    # [('Hello', 2.5, 3L) ('World', 3.6000000000000001, 2L)]

Of the two methods, convert_to_dtype seems to be significantly faster:

% python -mtimeit -s'import test' 'test.convert_to_dtype(test.arr,test.dt)'
10000 loops, best of 3: 78.9 usec per loop
% python -mtimeit -s'import test' 'test.reload_with_dtype(test.arr,test.dt)'
1000 loops, best of 3: 244 usec per loop
unutbu