I am writing a data processing program in Python and R, bridged with Rpy2.
Input data being binary, I use Python to read data out and pass them to R, then collect results to output.
Data are organized into pieces, each being around 100 Bytes (1Byte per value * 100 values).
They just work now, but the speed is very low. Here are some of my test on 1GB size (that is, 10^7 pieces) of data:
If I disable Rpy2 calls to make a dry run, it takes about 90min for Python to loop all through on a Intel(R) Xeon(TM) CPU 3.06GHz using one single thread.
If I enable full functionality and multithreading on that Xeon dual core, it (will by estimation) take ~200hrs for the program to finish.
I killed the Python program several times the call stack is almost alwarys pointing to Rpy2 function interface. I also did profiling, which gives similar results.
All these observations indicates the R part called by Rpy2 is the bottleneck. So I profiled a standalone version of my R program, but the profiling summary points to "Anonymous". I am still pushing my way to see which part of my R script is the most time consuming one. *updated, see my edit below**
There are two suspicious candidates through, one being a continuous wavelets transformation (CWT) and wavelets transformation modulus maxima (WTMM) using wmtsa from cran[1], the other being a non-linear fitting of ex-gaussion curve.
What come to my mind are:
for fitting, I could substitute R routing with inline C code? there are many fitting library available in C and fortran... (idea from the net; I never did that; unsure)
for wavelet algorithms.... I would have to analyze the wmtsa package to rewrite hotspots in C? .... reimplementing the entire wmtsa package using C or fortran would be very non-trivial for me. I have not much programming experience.
the data piece in file is organized in 20 consecutive Bytes, which I could map directly to a C-like char* array? at present my Python program just read one Byte at a time and append it to a list, which is slow. This part of code takes 1.5 hrs vs. ~200 hrs for R, so not that urgent.
This is the first time I meet program efficiency in solving real problems . I STFW and felt overwhelmed by the informations. Please give me some advice for what to do next and how.
Cheers!
footnotes:
* Update *
Thanks to proftools from cran, I managed to create a call stack graph. And I could see that ~56% of the time are spent on wmtsa, code snippet is like:
W <- wavCWT(s,wavelet="gaussian1",variance=1/p) # 1/4
W.tree <-wavCWTTree(W) # 1/2
holderSpectrum(W.tree) # 1/4
~28% of time is spent on nls:
nls(y ~ Q * dexGAUS(x, m, abs(s), abs(n)) + B, start = list(Q = 1000, m = h$time[i], s = 3, n = 8, B = 0), algorithm="default", trace=FALSE)
where evaluation of dexGAUS from gamlss.dist package takes the majority of time.
remaining ~10% of R time are spent on data passing/split/aggregation/subset.