views:

83

answers:

2

Hi I'd like to convert a file that's tab delimited and looks like this:

Species Date Data
1       Dec   3 
2       Jan   4
2       Dec   6
2       Dec   3

to a matrix like this (species is the row header):

    1  2
Dec 3  9
Jan    4

I'm guessing the part of the solution is to create a dictionary with two keys and use defaultdict to append new values to a key pair. I'd like to spit this out into tab-delimited form but also get into a format so that I can use the cluster part of scipy.

A: 

The DataFrame object in the pandas library makes this quite simple.

import csv
from collections import defaultdict
from pandas import DataFrame

rdr = csv.reader(open('mat.txt'), delimiter=' ', skipinitialspace=True)
datacols = defaultdict(list)

# skip header
rdr.next()
for spec, dat, num in rdr:
    datacols['species'].append(int(spec))
    datacols['dates'].append(dat)
    datacols['data'].append(int(num))

df = DataFrame(datacols)
df2 = df.pivot(index='dates', columns='species', values='data')

First we read data from a file in the format you provided. Then construct a dictionary of columns (datacol) since this is what panda's DataFrame wants. Once the DataFrame is constructed (df), then call it's pivot method to get it in the desired format. Here's what df and df2 look like in the console:

In [205]: df
Out[205]:
     data           dates          species
0    3              Dec            1
1    4              Jan            2
2    6              Dec            2
3    3              Dec            2


In [206]: df2
Out[206]:
       1              2
Dec    3              3
Jan    NaN            4

You can then use the toCSV method to save it to a file (see the DataFrame docs linked earlier).

ars
+1  A: 

I don't know numpy, so I can only be of partial help, but i found writing this little snippet entertaining, so here is with defaultdict:

# we'll pretend *f* is a file below
f = '''Species Date Data
1       Dec   3 
2       Jan   4
2       Dec   6
2       Dec   3'''.split('\n')[1:]

from collections import defaultdict

d = defaultdict(int)
for ln in f:
    x,y,n = ln.split()
    d[x,y] += int(n)

# transpose the list of tuples (keys) to get the two dimensions, remove the duplicates
x,y = map(set, zip(*d))

print list(x)
for yy in y:
    print yy, [d[xx,yy] for xx in x]

and the result from running this is

['1', '2']
Jan [0, 4]
Dec [3, 9]

Cute, isn't it?

Nas Banov