ansaurus

Question

rpy2: Converting a data.frame to a numpy array

Answer 1

+2 A:

This is the most straightforward and reliable way i've found to to transfer a data frame from R to Python.

To begin with, I think exchanging the data through the R bindings is an unnecessary complication. R provides a simple method to export data, likewise, NumPy methods for data import. The file format is the only common interface required here.

data(iris)
iris$Species = unclass(iris$Species)

write.table(iris, file="/path/to/my/file/np_iris.txt", row.names=F, sep=",")

# now start a python session
import numpy as NP

fpath = "/path/to/my/file/np_iris.txt"

A = NP.loadtxt(fpath, comments="#", delimiter=",", skiprows=1)

# print(type(A))
# returns: <type 'numpy.ndarray'>

print(A.shape)
# returns: (150, 5)

print(A[1:5,])
# returns: 
 [[ 4.9  3.   1.4  0.2  1. ]
  [ 4.7  3.2  1.3  0.2  1. ]
  [ 4.6  3.1  1.5  0.2  1. ]
  [ 5.   3.6  1.4  0.2  1. ]]

According to the Documentation (and my own experience for what it's worth) 'loadtxt' is the preferred method for conventional data import.

You can also pass in to loadtxt a tuple of data types (the argument is 'dtypes'), one item in the tuple for each column. Notice that 'skiprows=1' to step over the column headers.

Finally, i converted the dataframe factor to integer (which is actually the underlying data type for factor) prior to exporting--'unclass' is probably the easiest way to do this.

If you have big data (ie, don't want to load the entire data file into memory but still need to access it) NumPy's memory-mapped data structure ('memmap') is a good choice:

from tempfile import mkdtemp
import os.path as path

filename = path.join(mkdtemp(), 'tempfile.dat')

# now create a memory-mapped file with shape and data type 
# based on original R data frame:
A = NP.memmap(fpath, dtype="float32", mode="w+", shape=(150, 5))

# methods are ' flush' (writes to disk any changes you make to the array), and 'close'
# to write data to the memmap array (acdtually an array-like memory-map to 
# the data stored on disk)
A[:] = somedata[:]

doug 2010-04-19 18:32:01

Thanks Doug! This is the solution I had settled on too - the only problem being that the resulting files are +50MB which is sort of OK, but seems a touch clunky! I kind of want the rpy2 bindings to let me write a function that says `array,colnames,rownames = from_df("data.frame()"`.

Mike Dewar 2010-04-19 20:04:37

in that case (big data) i would just use NumPy's memory-mapped data structure, to avoid loading the entire thing into RAM. Editing my answer w/ example.

doug 2010-04-19 20:16:28

Answer 2

+1 A:

Why going through a data.frame when 'exprs(immgen)' returns a /matrix/ and your end goal is to have your data in a matrix ?

Passing the matrix to numpy is straightforward (and can even be made without making a copy): http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy

This should beat in both simplicity and efficiency the suggestion of going through text representation of numerical data in flat files as a way to exchange data.

You seem to be working with bioconductor classes, and might be interested in the following: http://pypi.python.org/pypi/rpy2-bioconductor-extensions/

lgautier 2010-04-20 08:19:46

argh you're right. It is a matrix. That's brilliant, thanks. Just so the solution is clear, i can do: e = np.array(robjects.r('exprs(immgen)'))and now e is a numpy array with all my floating point numbers in it. Thanks Laurent! I am interested in the bioC rpy2 stuff, but can't get it to install. A question for the support list though maybe...

Mike Dewar 2010-04-20 15:39:38

ansaurus

tags:

views:

answers:

rpy2: Converting a data.frame to a numpy array

related questions