views:

444

answers:

2

I have a data.frame in R. It contains a lot of data : gene expression levels from many (125) arrays. I'd like the data in Python, due mostly to my incompetence in R and the fact that this was supposed to be a 30 minute job.

I would like the following code to work. To understand this code, know that the variable path contains the full path to my data set which, when loaded, gives me a variable called immgen. Know that immgen is an object (a Bioconductor ExpressionSet object) and that exprs(immgen) returns a data frame with 125 columns (experiments) and tens of thousands of rows (named genes). (Just in case it's not clear, this is Python code, using robjects.r to call R code)

import numpy as np
import rpy2.robjects as robjects
# ... some code to build path
robjects.r("load('%s')"%path) # loads immgen
e = robjects.r['data.frame']("exprs(immgen)")
expression_data = np.array(e)

This code runs, but expression_data is simply array([[1]]).

I'm pretty sure that e doesn't represent the data frame generated by exprs() due to things like:

In [40]: e._get_ncol()
Out[40]: 1

In [41]: e._get_nrow()
Out[41]: 1

But then again who knows? Even if e did represent my data.frame, that it doesn't convert straight to an array would be fair enough - a data frame has more in it than an array (rownames and colnames) and so maybe life shouldn't be this easy. However I still can't work out how to perform the conversion. The documentation is a bit too terse for me, though my limited understanding of the headings in the docs implies that this should be possible.

Anyone any thoughts?

+2  A: 

This is the most straightforward and reliable way i've found to to transfer a data frame from R to Python.

To begin with, I think exchanging the data through the R bindings is an unnecessary complication. R provides a simple method to export data, likewise, NumPy methods for data import. The file format is the only common interface required here.

data(iris)
iris$Species = unclass(iris$Species)

write.table(iris, file="/path/to/my/file/np_iris.txt", row.names=F, sep=",")

# now start a python session
import numpy as NP

fpath = "/path/to/my/file/np_iris.txt"

A = NP.loadtxt(fpath, comments="#", delimiter=",", skiprows=1)

# print(type(A))
# returns: <type 'numpy.ndarray'>

print(A.shape)
# returns: (150, 5)

print(A[1:5,])
# returns: 
 [[ 4.9  3.   1.4  0.2  1. ]
  [ 4.7  3.2  1.3  0.2  1. ]
  [ 4.6  3.1  1.5  0.2  1. ]
  [ 5.   3.6  1.4  0.2  1. ]]

According to the Documentation (and my own experience for what it's worth) 'loadtxt' is the preferred method for conventional data import.

You can also pass in to loadtxt a tuple of data types (the argument is 'dtypes'), one item in the tuple for each column. Notice that 'skiprows=1' to step over the column headers.

Finally, i converted the dataframe factor to integer (which is actually the underlying data type for factor) prior to exporting--'unclass' is probably the easiest way to do this.

If you have big data (ie, don't want to load the entire data file into memory but still need to access it) NumPy's memory-mapped data structure ('memmap') is a good choice:

from tempfile import mkdtemp
import os.path as path

filename = path.join(mkdtemp(), 'tempfile.dat')

# now create a memory-mapped file with shape and data type 
# based on original R data frame:
A = NP.memmap(fpath, dtype="float32", mode="w+", shape=(150, 5))

# methods are ' flush' (writes to disk any changes you make to the array), and 'close'
# to write data to the memmap array (acdtually an array-like memory-map to 
# the data stored on disk)
A[:] = somedata[:]
doug
Thanks Doug! This is the solution I had settled on too - the only problem being that the resulting files are +50MB which is sort of OK, but seems a touch clunky! I kind of want the rpy2 bindings to let me write a function that says `array,colnames,rownames = from_df("data.frame()"`.
Mike Dewar
in that case (big data) i would just use NumPy's memory-mapped data structure, to avoid loading the entire thing into RAM. Editing my answer w/ example.
doug
+1  A: 

Why going through a data.frame when 'exprs(immgen)' returns a /matrix/ and your end goal is to have your data in a matrix ?

Passing the matrix to numpy is straightforward (and can even be made without making a copy): http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy

This should beat in both simplicity and efficiency the suggestion of going through text representation of numerical data in flat files as a way to exchange data.

You seem to be working with bioconductor classes, and might be interested in the following: http://pypi.python.org/pypi/rpy2-bioconductor-extensions/

lgautier
argh you're right. It is a matrix. That's brilliant, thanks. Just so the solution is clear, i can do: e = np.array(robjects.r('exprs(immgen)'))and now e is a numpy array with all my floating point numbers in it. Thanks Laurent! I am interested in the bioC rpy2 stuff, but can't get it to install. A question for the support list though maybe...
Mike Dewar