views:

920

answers:

5

What are some advantages and disadvantages to doing statistical analyses in SciPy vs. R? They seem to have been designed with opposite philosophies (plain old library vs. DSL). What are some rules of thumb about which is the right tool for the job?

+6  A: 

If you are a bioinformatician, R has bioconductor, which is basically a must. You cannot escape.

Stefano Borini
I don't know how they compare, but Python has an active bioinformatics community--e.g., several dedicated libraries ('biskit' is the one i'm familiar with), and at least a couple of textbooks on 'python and bioinformatics' (one by Ruediger-Marcus Flaig, another by Sebastian Bassi)
doug
@doug python and bioinformatics is a book for bioinfo who don't know python to learn python. It is about python not about bioinfo. There are a lot of this kinds of books.
Yin Zhu
+6  A: 

See the related question on stackoverflow: What can be done in R that can’t be done with Python/Numpy/SciPy.

I would argue the R has considerably more packages, so you should use it if you're trying to do something that's not available in SciPy. You might want to use SciPy where there is overlap, you expect to have very large datasets, and you're primarily a Python programmer. For statistical analysis and visualization generally, you won't be disappointed if you use R in the long run. The primary weakness of R is around memory management and performance, but there are a sufficient number of ways to address this in the high performance computing area.

Lastly, this isn't an either/or proposition: you can use both by using RPy.

Shane
+1: for RPy (or RPy2) -- I find programming in R a royal pain in the neck, but really love all the libraries that are available.
Noah
See also here: http://rpy.sourceforge.net/rpy2/doc/html/numpy.html
Noah
+11  A: 

I use both for this purpose and though i am far from systematic in selecting one for a given task, here's my list:

Task attributes that (for me) tend to favor R:

  • unless you have an extraordinarily well-organized snippet/custom function library, there's really nothing in scipy/numpy to match the convenience of R functions like: fivenum, IQR, stem, table, sample, etc.; likewise R can generate a number of complex plots with just a single command (multi-step in numpy/scipy), for instance, 'qqplot', 'boxplot', and 'mosaicplot (in the vcd library).

  • sometimes the task just requires a pool of datasets to experiment with; once again, R excels here. R has approx. 50 datasets in the 'datasets' library alone, included in the R base install; Packages like DAAG and Ecdat are goldmines of clean, well-documented data. Even better, nearly all of them are the same object type ('dataframe')--nothing comparable in numpy/scipy; and

  • factors (discrete variables) are not an organic concept in numpy/scipy, even in the SciPy's stats library.

For Scipy/Numpy:

  • pure crunching tasks, for manipulation/computation of arrays of any size, i prefer scipy/numpy. It's faster, the indexing syntax is cleaner (in my opinion) and features like broadcasting (via functions like 'newaxis', etc.) give you a dense syntax for dealing with operations on arrays of different shapes ('dims')--i.e., you can code a vector quantizer, or a kNN algorithm in 2-3 lines w/ broadcasting.

(A few things which often seem to dictate the choice between a DSL and a general purpose language, but which are surprisingly balanced w/r/t R and python + numpy/scipy: (i) database integration--i have found both R & python do this very well (my experience is limited to RMySQL and RSQLite--both install w/o a hitch and 'just work'); and (ii) I/O, R is the only statistics platform i've worked in which getting data in and getting data, graphics, tables, etc. out is as straightforward as using a general-purpose language. Both points might be obvious to everyone else, but they surprised me.)

doug
+2  A: 

Both are very good choices, and I think that they have slightly overlapping spheres of influence in my work.

I prefer to use Python to do generation, processing and cleanup of data. The rationale for this is that Python has a very well-defined object-oriented syntax, lots of libraries fo processing common data formats, and if I'm interested in speeding things up it's easy for me to mix Python with C++. My co-workers also like Python very much, which is a very important factor when picking languages.

Once I have my data in a form that I like, I use R for analysis. It's concise, has good support for most statistical techniques, and can make beautiful graphics very concisely. Concise graphics are very useful when you're doing exploratory analysis, and examining the same data in several different ways is very important when you're exploring data your data.

As far as rules of thumb go, I'd look at the following factors:

  • What do people around you know and use at work?
  • Can those people help you, and will they find your work helpful if they can read it?
  • Can leverage an existing code-base more effectively in either language?
  • Which language do you know better already?
  • Is only one language installed/supported on your systems?
James Thompson
+4  A: 

In my opinion, choosing what kind of software usually depends on

1. what are your collaborators using? Usually don't use a tool that your collaborators are not using or not familiar. In our group, we use Pyhon/Matlab for most of the data analysis and modeling tasks. I personally know R and like the language, but I found it is hard to communicate with other students, so I go back to Matlab/Python now.

2. what kind of tasks are you doing?

There are two kinds of data analysis: 1) applying data models 2) programming an analysis/numerical computing model. If you are doing the first one, R is superior to Python with Numpy/Scipy. Because R core is very stable, R also has a lot of 3rd party statistical packages contributed by researchers in academic. If you have heard a statistical procedure, you 98% could find it in R. You can build a decision tree/linear regression using only a few lines including loading your data, plotting the results. While in Python, there are not many packages, and some of the packages are more buggy.

If you are mainly doing the second task, Python is better than R, because Python is a better programming language than R. Scipy package has a good support for matrices, linear algebra, optimization, which are the basics for scientific computing. Python is also a very expressive language for small-to-middle sized projects. So if you want to implement a decision tree in Python, it would be easier than R.

Last point, maybe biased. If you want to deploy your statistical model into a production product or a framework. Python is definitely better than R as R is very weak in interinvoke and there is a lot of license problems.

Yin Zhu