tags:

views:

174

answers:

3
+2  Q: 

Numpy for R user?

Hi, long-time R and Python user here. I use R for my daily data analysis and Python for tasks heavier on text processing and shell-scripting. I am working with increasingly large data sets, and these files are often in binary or text files when I get them. The type of things I do normally is to apply statistical/machine learning algorithms and create statistical graphics in most cases. I use R with SQLite sometimes and write C for iteration-intensive tasks; before looking into Hadoop, I am considering investing some time in NumPy/Scipy because I've heard it has better memory management [and the transition to Numpy/Scipy for one with my background seems not that big] - I wonder if anyone has experience using the two and could comment on the improvements in this area, and if there are idioms in Numpy that deal with this issue. (I'm also aware of Rpy2 but wondering if Numpy/Scipy can handle most of my needs). Thanks -

+5  A: 

I use NumPy daily and R nearly so.

For heavy number crunching, i prefer NumPy to R by a large margin (including R packages, like 'Matrix') I find the syntax cleaner, the function set larger, and computation is quicker (although i don't find R slow by any means). NumPy's Broadcasting functionality for instance, i do not think has an analog in R.

For instance, to read in a data set from a csv file and 'normalize' it for input to an ML algorithm (e.g., mean center then re-scale each dimension) requires just this:

data = NP.loadtxt(data1, delimiter=",")    # 'data' is a NumPy array
data -= NP.mean(data, axis=0)
data /= NP.max(data, axis=0)

Also, i find that when coding ML algorithms, i need data structures that i can operate on element-wise and that also understand linear algebra (e.g., matrix multiplication, transpose, etc.). NumPy gets this and allows you to create these hybrid structures easily (no operator overloading or subclassing, etc.).

You won't be disappointed by NumPy/SciPy, more likely you'll be amazed.

So, a few recommendations--in general and in particular, given the facts in your question:

  • install both NumPy and Scipy. As a rough guide, NumPy provides the core data structures (in particular the ndarray) and SciPy (which is actually several times larger than NumPy) provides the domain-specific functions (e.g., statistics, signal processing, integration).

  • install the repository versions, particularly w/r/t NumPy because the dev version is 2.0. Matplotlib and NumPy are tightly integrated, you can use one without the other of course, but both are the best in their respective class among python libraries. You can get all three via *easy_install*, which i assume you already.

  • NumPy/SciPy have several modules specifically directed to Machine Learning/Statistics, including the Clustering package and the Statistics package.

  • As well as packages directed to general computation, but which are make coding ML algorithms a lot faster, in particular, Optimization and Linear Algebra.

  • There are also the SciKits, not included in the base NumPy or SciPy libraries; you need to install them separately. Generally speaking, each SciKit is a set of convenience wrappers to streamline coding in a given domain. The SciKits you are likely to find most relevant are: ann (approximate Nearest Neighbor), and learn (a set of ML/Statistics regression and classification algorithms, e.g., Logistic Regression, Multi-Layer Perceptron, Support Vector Machine).

doug
shouldn't /= np.max() be /= np.std() ?
Denis
no, it should not. i gave an example above, not an exhaustive recitation of pre-processing data for ML input. Sometimes rescaling so each dimension has unit variance is what i want, other times not. In any event, my example describes a very common technique for preparing data as ML input, but there are quite a few others (see e.g., "Machine Learning: An Algorithmic Perspective, Stephen Marsland, Ch 3., 2009, which uses the method in my answer).
doug
Thanks for all the references - I will give them a try...
Stephen
+4  A: 

R's strength when looking for an environment to do machine learning and statistics is most certainly the diversity of its libraries. To my knowledge, SciPy + SciKits cannot be a replacement for CRAN.

Regarding memory usage, R is using a pass-by-value paradigm while Python is using pass-by-reference. Pass-by-value can lead to more "intuitive" code, pass-by-reference can help optimize memory usage. Numpy also allows to have "views" on arrays (kind of subarrays without a copy being made).

Regarding speed, pure Python is faster than pure R for accessing individual elements in an array, but this advantage disappears when dealing with numpy arrays (benchmark). Fortunately, Cython lets one get serious speed improvements easily.

If working with Big Data, I find the support for storage-based arrays better with Python (HDF5).

I am not sure you should ditch one for the other but rpy2 can help you explore your options about a possible transition (arrays can be shuttled between R and Numpy without a copy being made).

lgautier
Yes, the copy-by-value semantics is a killer. Though I understand it's more like copy-on-modification... I've been meaning to look into Big Data also. But as you say, the way may be easing my way into SciPy.
Stephen
Yes. The functional world has quickly moved to optimization tricks such as copy-on-modification (in the case of R, called "promises" (to copy if modified)). However, this only helps when objects are not modified (doesn't help if one changes the rownames of a very large matrix while passed as a parameter to the function).
lgautier
Actually, machine learning is an area that is covered pretty well in python, for example http://mdp-toolkit.sourceforge.net/ and a more comprehensive listing athttp://mloss.org/software/language/python/
+1  A: 

I can't comment on R, but here are a couple of links on Numpy/Scipy and ML:

And a book (I've only looked at some of its code): Marsland, Machine Learning (with numpy), 2009 406p isbn 1420067184

If you could collect a few notes on your experience up the Numpy/Scipy learning curve, that might be useful to others.

Denis
Thx for the references!
Stephen