views:

158

answers:

0

I've been trying to find a good and flexible way to parse CSV files in Python but none of the standard options seem to fit the bill. I am tempted to write my own but I think that some combination of what exists in numpy/scipy and the csv module can do what I need, and so I don't want to reinvent the wheel.

I'd like the standard features of being able to specify delimiters, specify whether or not there's a header, how many rows to skip, comments delimiter, which columns to ignore, etc. The central feature I am missing is being able to parse CSV files in a way that gracefully handles both string data and numeric data. Many of my CSV files have columns that contain strings (not of the same length necessarily) and numeric data. I'd like to be able to have numpy array functionality for this numeric data, but also be able to access the strings. For example, suppose my file looks like this (imagine columns are tab-separated):

# my file
name  favorite_integer  favorite_float1  favorite_float2  short_description
johnny  5  60.2  0.52  johnny likes fruitflies
bob 1  17.52  0.001  bob, bobby, robert

data = loadcsv('myfile.csv', delimiter='\t', parse_header=True, comment='#')

I'd like to be able to access data in two ways:

  1. As a matrix of values: it's important for me to get a numpy.array so that I can easily transpose and access the columns that are numeric. In this case, I want to be able to do something like:

    floats_and_ints = data.matrix floats_and_ints[:, 0] # access the integers floats_and_ints[:, 1:3] # access some of the floats transpose(floats_and_ints) # etc..

  2. As a dictionary-like object where I don't have to know the order of the headers: I'd like to also access the data by the header order. For example, I'd like to do:

    data['favorite_float1'] # get all the values of the column with header "favorite_float1" data['name'] # get all the names of the rows

I don't want to have to know that favorite_float1 is the second column in the table, since this might change.

It's also important for me to be able to iterate through the rows and access the fields by name. For example:

for row in data:
  # print names and favorite integers of all 
  print "Name: ", row["name"], row["favorite_int"]

The representation in (1) suggest a numpy.array, but as far as I can tell, this does not deal well with strings and requires me to specify the data type ahead of time as well as the header labels.

The representation in (2) suggests a list of dictionaries, and this is what I have been using. However, this is really bad for csv files that have two string fields but the rest of the columns are numeric. For the numeric values, you really do want to be able to sometime get access to the matrix representation and manipulate it as a numpy.array.

Is there a combination of csv/numpy/scipy features that allows the flexibility of both worlds? Any advice on this would be greatly appreciated.

In summary, the main features are:

  1. Standard ability to specify delimiters, number of rows to skip, columns to ignore, etc.
  2. The ability to get a numpy.array/matrix representation of the data so that it can numeric values can be manipulated
  3. The ability to extract columns and rows by header name (as in the above example)

thanks very much.