views:

113

answers:

4

Hello all,

I need to take a csv file and import this data into a multi-dimensional array in python, but I am not sure how to strip the 'None' values out of the array after I have appended my data to the empty array.

I first created a structure like this:

storecoeffs = numpy.empty((5,11), dtype='object')

This returns an 5 row by 11 column array populated by 'None'.

Next, I opened my csv file and converted it to an array:

coeffsarray = list(csv.reader(open("file.csv")))

coeffsarray = numpy.array(coeffsarray, dtype='object')

Then, I appended the two arrays:

newmatrix = numpy.append(storecoeffs, coeffsarray, axis=1)

The result is an array populated by 'None' values followed by the data that I want (first two rows shown to give you an idea as to the nature of my data):

array([[None, None, None, None, None, None, None, None, None, None, None,
    workers, constant, hhsize, inc1, inc2, inc3, inc4, age1, age2,
    age3, age4],[None, None, None, None, None, None, None, None, None, None, None,
    w0, 7.334, -1.406, 2.823, 2.025, 0.5145, 0, -4.936, -5.054, -2.8, 0],,...]], dtype=object)

How do I remove those 'None' objects from each row so what I am left with is the 5 x11 multidimensional array with my data?

Thanks in advance!

+1  A: 

Why are you allocating an entire array of Nones and appending to that? Is coeffsarray not the array you want?

Edit

Oh. Use numpy.reshape.

import numpy
coeffsarray = numpy.reshape( coeffsarray, ( 5, 11 ) )
katrielalex
yes, but its not structured to be a multidimensional array at the onset. I need it structured so that there are 11 columns by 5 rows.
myClone
Oh. Then use `reshape`.
katrielalex
+1  A: 

Start with an empty array?

storecoeffs = numpy.empty((5,0), dtype='object')
gnibbler
Umm... did I not do that when I created storecoeffs = numpy.empty((5,11), dtype = 'object')?
myClone
@myClone - No, you created a 5x11 object array populated with whatever was in memory (actually for an object array like you created above, it just fills it with `None`'s). You don't need to initalize the array at all. Just convert what you read from the file into an array.
Joe Kington
Ok, thanks. I guess I was getting confused with how python was handling the first array in terms of structure, which is why I took that additional step. Just being a typical n00b making life more difficult. :)
myClone
+1  A: 

why not simply using numpy.loadtxt():

newmatrix = numpy.loadtxt("file.csv", dtype='object') 

should do the job, if i understood well you question.

Mermoz
Could you be a bit more specific as to why this is better?
myClone
+2  A: 

@Gnibbler's answer is technically correct, but there's no reason to create the initial storecoeffs array in the first place. Just load in your values and then create an array from them. As @Mermoz noted, though, your use case looks simple enough for numpy.loadtxt().

Beyond that, why are you using an object array?? It's probably not what you want... Right now, you're storing the numerical values as strings, not floats!

You have essentially two ways to handle your data in numpy. If you want easy access to named columns, use a structured array (or a record array). If you want to have a "normal" multidimensional array, just use an array of floats, ints, etc. Object arrays have a specific purpose, but it's probably not what you're doing.

For example: To just load in the data as a normal 2D numpy array (assuming all your data can be represented easily as a float):

import numpy as np
# Note that this ignores your column names, and attempts to 
# convert all values to a float...
data = np.loadtxt('input_filename.txt', delimiter=',', skiprows=1)

# Access the first column 
workers = data[:,0]

To load your data in as a structured array, you might do something like this:

import numpy as np
infile = file('input_filename.txt')

# Read in the names of the columns from the first row...
names = infile.next().strip().split()

# Make a dtype from these names...
dtype = {'names':names, 'formats':len(names)*[np.float]}

# Read the data in...
data = np.loadtxt(infile, dtype=dtype, delimiter=',')

# Note that data is now effectively 1-dimensional. To access a column,
# index it by name
workers = data['workers']

# Note that this is now one-dimensional... You can't treat it like a 2D array
data[1:10, 3:5] # <-- Raises an error!

data[1:10][['inc1', 'inc2']] # <-- Effectively the same thing, but works..

If you have non-numerical values in your data and want to handle them as strings, you'll need to use a structured array, specify which fields you want to be strings, and set a max length for the strings in the field.

From your sample data, it looks like the first column, "workers" is a non-numerical value that you might want to store as a string and all the rest look like floats. In that case, you'd do something like this:

import numpy as np
infile = file('input_filename.txt')
names = infile.next().strip().split()

# Create the dtype... The 'S10' indicates a string field with a length of 10
dtype = {'names':names, 'formats':['S10'] + (len(names) - 1)*[np.float]}
data = np.loadtxt(infile, dtype=dtype, delimiter=',')

# The "workers" field is now a string array
print data['workers']

# Compare this to the other fields
print data['constant']

If there are cases where you really need the flexibility of the csv module (e.g. text fields with commas), you can use it to read the data, and then convert it to a structured array with the appropriate dtype.

Hope that makes things a bit clearer...

Joe Kington
Joe, you've covered my dilemma perfectly. My problem has been that this array contains a mixture of floats and non-nums but I need to use the non-numerical (words) to reference the numerical (float). Ideally, I want to store the words in a dictionary to reference the associated numerical data where each word = a column header. This has been an ongoing frustration for me as I am very new to python.
myClone