tags:

views:

116

answers:

2

I am writing a data processing program in Python and R, bridged with Rpy2.

Input data being binary, I use Python to read data out and pass them to R, then collect results to output.

Data are organized into pieces, each being around 100 Bytes (1Byte per value * 100 values).

They just work now, but the speed is very low. Here are some of my test on 1GB size (that is, 10^7 pieces) of data:

If I disable Rpy2 calls to make a dry run, it takes about 90min for Python to loop all through on a Intel(R) Xeon(TM) CPU 3.06GHz using one single thread.

If I enable full functionality and multithreading on that Xeon dual core, it (will by estimation) take ~200hrs for the program to finish.

I killed the Python program several times the call stack is almost alwarys pointing to Rpy2 function interface. I also did profiling, which gives similar results.

All these observations indicates the R part called by Rpy2 is the bottleneck. So I profiled a standalone version of my R program, but the profiling summary points to "Anonymous". I am still pushing my way to see which part of my R script is the most time consuming one. *updated, see my edit below**

There are two suspicious candidates through, one being a continuous wavelets transformation (CWT) and wavelets transformation modulus maxima (WTMM) using wmtsa from cran[1], the other being a non-linear fitting of ex-gaussion curve.

What come to my mind are:

  1. for fitting, I could substitute R routing with inline C code? there are many fitting library available in C and fortran... (idea from the net; I never did that; unsure)

  2. for wavelet algorithms.... I would have to analyze the wmtsa package to rewrite hotspots in C? .... reimplementing the entire wmtsa package using C or fortran would be very non-trivial for me. I have not much programming experience.

  3. the data piece in file is organized in 20 consecutive Bytes, which I could map directly to a C-like char* array? at present my Python program just read one Byte at a time and append it to a list, which is slow. This part of code takes 1.5 hrs vs. ~200 hrs for R, so not that urgent.

This is the first time I meet program efficiency in solving real problems . I STFW and felt overwhelmed by the informations. Please give me some advice for what to do next and how.

Cheers!

footnotes:

  1. http://cran.r-project.org/web/packages/wmtsa/index.html

* Update *

Thanks to proftools from cran, I managed to create a call stack graph. And I could see that ~56% of the time are spent on wmtsa, code snippet is like:

W <- wavCWT(s,wavelet="gaussian1",variance=1/p) # 1/4
W.tree <-wavCWTTree(W) # 1/2
holderSpectrum(W.tree) # 1/4

~28% of time is spent on nls:

nls(y ~ Q * dexGAUS(x, m, abs(s), abs(n)) + B, start = list(Q = 1000, m = h$time[i], s = 3, n = 8, B = 0), algorithm="default", trace=FALSE)

where evaluation of dexGAUS from gamlss.dist package takes the majority of time.

remaining ~10% of R time are spent on data passing/split/aggregation/subset.

+1  A: 

For option 3.. getting your data in efficiently... read it all in as one long str type in python with a single read from the file. Let's assume it's called myStr.

import array
myNums = array.array('B', myStr)

Now myNums is an array of each byte easily converted... see help(array.array)... in fact, looking at that it looks like you can get it directly from a file that way through the array.

That should get rid of 1.4 hours of your data reading.

John
and then as I read your question more carefully it looks like you aren't asking python to do anything that could not be entirely done in R. So, why the python code at all?
John
I just don't know know to process highly customized data in R. And I use python for lexical analysis in the description region of the data file.
heroxbd
+1  A: 

My understanding is that you have:

  • python code that uses rpy2 in places

  • performance issues that can be traced to calls to rpy2

  • the performance issues do not currently appear to have much to do with rpy2 itself, as the underlying R is largely responsible for the running time

  • a part of your R code was reading bytes one at a time and append them to a list, which you improved by moving that part to Python

It is somehow hard to try helping without seeing the actual code and you may want to consider:

  • a buffering strategies for reading bytes (as this was already answered by John).

  • work on optimizing your R code

  • consider trivial parallelization (and eventually rent compute space on the cloud)

lgautier
Hi lgautier, your understandings are right, except the 4th: I read the data from binary from with Python then pass them to R via Rpy. That reading strategy is byte-by-byte and not optimized.I have used trivial parallelization, and the time consumed is still not acceptable. I have some updates on profiling R code in my post. Plz take a look :)
heroxbd