views:

1325

answers:

7

In the past week I've been following a discussion where Ross Ihaka wrote:

I’ve been worried for some time that R isn’t going to provide the base that we’re going to need for statistical computation in the future. (It may well be that the future is already upon us.) There are certainly efficiency problems (speed and memory use), but there are more fundamental issues too. Some of these were inherited from S and some are peculiar to R.

He then continued explaining. This discussion started from Xi'an's Og, and was then followed by comments at reddit, statalgo, DecisionStats, columbia.edu, Hacker News, r-help mailing list, and maybe other places.

As someone who isn't a computer scientist, I am trying to understand what to make of this.

  • Is R so flawed that it is better to rewrite it then to fix it? Searching on stackoverflow, I came by When to rewrite a code base from scratch and Under what situation should code be rewritten from scratch? (based on Joel's article Things You Should Never Do), both threads argue that a very(!) extreme case is needed in order to justify a rewrite of the code. But is this the case with R?
    • Can R be patched in a way to fix these problems and do become "the stat language of the future" ?
    • What about the social aspect of this? R already has a large user base. If R were to "die", is it possible to imagine all the users willing to move to a new language?

I think this question is not subjective, but since it has so many uncertainties, I decided to mark it as a community wiki.

+12  A: 

Somehow I believe some knowledgeable people to be rather unfair towards R. For one, R still started as a modified and free version of S, and is in the first place a statistical package. I hear nobody complain about the hideous SAS and SPSS coding, and the slow speed at which they do some of their calculations. For what it's most often used, i.e. statistical analysis, R is without doubt the best thing around.

Secondly : although R is indeed not the most optimal programming language, it is rather powerful if you know how to use it. The vectorization of R is simply the best feature ever. As far as I'm concerned, I use a for-loop three times a year and an apply twice a week. All the rest is vectorized calculations. But in the benchmarks, this feature of R is often ignored (e.g. here ), leading to the -in my eyes wrong- accusation of R being slow. It ain't the fastest around, but if you know what you're doing, you can still race nicely.

Thirdly : R is in the first place a scripting language, allowing me to try out different analyses on the fly. This is impossible in C++, Fortran, Ruby, Python, Perl or any other "faster alternative". They're not alternatives, simple as that. They are superior for a number of tasks, but not for efficient statistical analysis. I need the R command line for that.

Yes, I would appreciate the quirks of R to disappear. Yes, it would be wonderful to have some more programming power in R that allows us to use R even more as a programming language as well. A complete rewrite that is compatible with old code and has an order of magnitude speed-up, cleaner codes etc... would be great.

But until somebody wants to do all that effort completely for free, I'll stay with R.


EDIT : as some people - rightfully - pointed out, Python has a command line as well. This command line however is NOT directed towards statistical analysis, and even with the scipy and stat.py installed, IPython/DreamPie/whatever other command line you have doesn't come even close to R in ease of use, completeness and availability of techniques.

R is far too often mistaken for yet-another-programming-language. It isn't. It is a statistical package that allows for full-blown programming as well. Python is a scripting language that has some libraries for statistics as well. There's a whole world of difference between those two.

Joris Meys
That’s a nice rant but it doesn’t address the problems raised by Ross Ihaka. **If** performance is going to be come a forbidding problem (and this is more than likely) then all R’s niceties won’t save it. Nobody doubts that for “statistical analysis, R is without doubt the best thing around,” so that argument is addressing a straw man.
Konrad Rudolph
Ross Ihaka is completely right. But R is not C++. R is a scripting language, allowing me to test out analyses on the fly. I can't do that in Python, Perl, Java, C++, Ruby, Fortran or any other so-called "alternative". They're simply not alternatives.
Joris Meys
@Joris Meys Python and Perl are also scripting languages that allow you to write and test analysis on the fly (with the proper math and statistical packages, of course).
fortran
The R core is based on Fortran code, which works flawlessly but is often harder to read than C++. Still, all the numeric work is done efficiently and effectively.
wok
@fortran I do program in python and perl, and in neither of them I have a command line where I can easily run different models and compare them with one command. If you can show me how to do that, I'll be happy to switch.
Joris Meys
@Joris: they do have that command line, in fact. I do all my statistical work in a Python shell (with numpy/scipy/matplotlib). To be honest, the only reason I don’t use R is that I’m too lazy to learn it (and I wouldn’t advise a switch). **But**, and this was my point, R is not only used for cheap on-the-fly work but also for complex analyses on *huge* data sets (cf. bioconductor package). And let me tell you, doing such analyses in a slow environment *sucks*. They may well take days instead of hours.
Konrad Rudolph
@Konrad Funny, I never considered DreamPie, IPython or any other Python shell as a full replacement of the R console. There are numerous attempts at proving you can do with Python everything R does. In fact, you can. You can construct a structured array which resembles an R data-frame. You can define all the functions R has to work on that structured array, but frankly, but you're not going to do e.g. ave(DF,list(DF$varA,DF$varB),sd) in one line of Python code. let alone run a Generalized Mixed Model with a continuous AR1 autocorrelation structure.
Joris Meys
@Konrad -- it is not so bad; indeed in practice most time/space-demanding algorithms in R are just C or Fortran chunks, so it is not the problem at all. The problem is that writing demanding programs in pure R sux usually (definitely not always) a bit more than in Python or MATLAB -- but usually even using them you will start thinking about making some C chunks to speed those up.
mbq
Never looked at that Matlab v. R page before but man that R code is inefficient. Some of it can be orders of magnitude faster with small changes.
John
John - do you mean the R code used in the example could be optimized, or that the base functions used can be optimized ?
Tal Galili
**Aside:** [cint](http://root.cern.ch/drupal/content/cint) will allow you to do script-dinking in c++. Not that I recommend that as a replacement for R, but you can do it. The transition from PAW to ROOT in the particle physics community could be instructive for the future of R though I don't know enough to draw particular conclusions beyond noting that many physicist preferred to continue with the outdated PAW through four major versions of ROOT before they felt the replacement was ready to take over...
dmckee
please forgive this dumb comment: Why is everybody stating that there is no command line in python? doesn't ipython qualify ?
ran2
+10  A: 

Some of the accusations that have been levelled at R over the years are:

  1. It is slow.

  2. It doesn't play well with really big datasets.

For many people, code-writing-time is a much more important factor than execution-time, so 1 isn't a big problem. Similarly, the value of really big has been increasing with faster computers to the point where it isn't an issue for many researchers.

What Ross Ihaka is talking about is a language to deal with problems like "go analyse this genome", or "go find trends in Facebook's social graph" that R can't easily scale to. Such a language could be a niche big-data-processing langage, or may have more widespread usage; it's way too early to say.

Some things to bear in mind are that Ross Ihaka presumably enjoys creating new languages (at the very least he's done it before), so it's natural for him to want to have another go. Secondly, he could write something that would play nicely with R code so that the work on CRAN wouldn't be wasted. And thirdly, any new language is a good decade away from mainstream use, so reports of R's death are greatly exaggerated.

EDIT: To answer the question of "is it better to fix the bugs or rewrite from scratch?", I don't see why it has to be either/or. In the short term, there are many improvements that can and will be made to R. Over the longer term, new languages will be inevitably be created, and complement or supercede existing ones. One intermediate stage that hasn't been really been discussed for R is an equivalent of Python 3: A reworking of the language that drops compatibility in favour of removing some of the warts.

Richie Cotton
+1 for the reference to Python 3. R would benefit from starting a 3.0 series as well. Although I hope they don't port the S4 system in its current state, it gives more headaches than solutions as for now.
Joris Meys
Google's V8 is a perfect example how a totally screwed language can be made pretty fast without a single modification of itself.
mbq
And +1 from me for pointing out that not all of us are dealing with web- or genome-scale data. I think it's great that we're developing the things we need to address those problems, but it's not as though we've solved all of our small-data problems. It's vastly more important to me to have access to the incredible array of statistical methods on CRAN.
Matt Parker
A: 

I believe a design review should be considered by people who know enough about the fortran, c++ (and c and python). Also people from Google and Canonical (and Red Hat and Sun) can be asked to help with code review

Ajay Ohri
People from Canonical would at most change the default plot background to dirty bronze and allow proprietary packages on CRAN ;-)
mbq
I'm voting you up because surely some review would be nice, though probably of limited value. Contrary to what people seem to believe, the R Core group actually does know pretty well what they're doing, computer-sciencewise. But more eyes can't be bad.
Ken Williams
for R, Python not needed. Writing a scripting language in another scripting language ain't too smart. And it is already possible today to link R with Java GUIs and Python GUIs. Fairly easily by now, actually.
Joris Meys
It's a great idea! After all, the people from the Google and Canonical (and Red Hat and Sun and ...) don't have anything better to do than to review the source code for every programming language that they use.
Shane
A: 

Richie Cotton comments that "any new language is a good decade away from mainstream use", surely an important point.

If the new system leaves the R syntax largely intact (albeit with some rationalisation), migration from R to the new R may not be too traumatic, might happen quite quickly, and my further comments do not apply.

If there is a substantial rewrite that affects syntax as well as internals, then uptake may be slow, limited for a long time to those users (large datasets, lengthy computations?) who really do need such benefits as the new R has to offer. The carrying of packages across to the new R will be a severe initial challenge. Once that is resolved, there will be the further challenge of revising or rewriting or replacing of what is now a very substantial R literature.

Before any of this can happen, there has to be a substantial research/development momentum behind one or more directions for change. Perhaps Ross Ihaka has the standing that will enable him to marshall that kind of momentum behind his ideas. We shall see. In any case, I think it very unlikely that any new initiatives will fork in more than 2 or 3 different directions. Getting momentum in support of any new initiative is just too difficult. This is fortunate, because most users will stay with "current" R until such time as there is a clear winner among claimants to the "new R" throne. Even then, most existing users are likely to stay with "current" R until change is pretty much forced on them.

john Maindonald
+4  A: 

Obviously this is just my understanding of the situation, but I haven't read what Ross Ihaka has said as 'R is broken and must be thrown away' but as 'some of the Stats jobs of the future aren't really suitable for R, adn we may need some other solution' to which he's probably right.

R started as a teaching tool, then grew organically, picking up it's link to the S language and the user base, the R-Core team and CRAN, and frankly, R is amazing at what it does (if you haven't guessed, I Love R), but it can't do everything, and that's what Ross is saying, but I think it will continue to grow and expand into new areas, but it can't do everything.

So in the future there may be a new language (maybe 'Q' ?) that is better suited to those types of problem (like massive, massive datasets, real time problems etc), and it may be related to R to make it easier for the people who might use both to learn, but it won't kill R.

So to answer the original question; No, R isn't so broken it needs re-writing, but there are jobs that we migth need a new language for, Yes, R can be patched to fix exisiting limitations, or to expand into new areas, and No, R won't die, it just might get some new friends....

PaulHurleyuk
There is already a Q language. :)
Shane
A: 

Should R be rewritten from scratch? Yes. Sooner the better. Get a jump on multi-core, parallel processing, etc. In the process fix all the legacy issues Ross Ihaka summarized so well.

Those who think R is not broken means they have not noticed it, or having noticed it, do not consider them problems. So Ross is really asking those who have encountered these problems and foresee the issues he raises. All good points he raises. Yes R should be rewritten, perhaps with stronger type safety.

Dave F
Ken Williams
+1  A: 

data.table goes a long way toward making R performant, relative to the grouping and lookup performance of vanilla dataframes. Syntax issues are less of a problem because you can simply route around them. Any rewrite of R would need to be in the form of a python 3 non-backwords compatible but very close spin-off or it is simply not going to be adopted.

frankc