views:

3057

answers:

17

I use mostly R and C for statistics-related tasks. Recently I have been dealing with large datasets, typically 1e7-1e8 observations, and 100 features. They seem too big for R too handle, and the package I typically use are also more prone to crashing. I could develop tools directly in C or C++, but this would slow down the development cycle. I am searching on the web for alternatives to R for large-scale analysis, or for R extensions in this direction. I would like to poll the Stack Overflow community for specific suggestions on what to use. Ideally, a good candidate should have stable and multiplatform implementations, a robust user community (or at least a committed and growing small user base), and of course be faster than R (by passing references to functions, compiling, facilitate parallelization of embarassingly parallel jobs).

I have been looking at functional languages. Lush (by Bottou and LeCoun) and Clojure/Incanter are specifically geared for numerical computation, but seem to have very few users. Haskell, Common Lisp, Scheme have a solid user base, but I am not sure that people use them for numerical work.

Apologies if this question seems too generic. I am not asking for philosophical statements regarding the merit of this or that language. I Just would like to know what you use for custom analysis of very large data sets.

+8  A: 

I've had good experiences using Python and its package Numpy to work with large data sets. It's not a strictly functional language though. Your question doesn't make it clear why you're searching for a functional language.

If you do want a functional language, I recommend Haskell. Its performance is excellent and it has a phenomenal user community.

Greg
I found that with numpy I had twist my head around to find how to vectorize my calculations, and sometimes still I had to resort to write some Pyrex or C code. I think Haskell makes a better candidate for number-crunching problems.
yairchu
Thanks. I know Numpy. It doesn't address the basic issue, since large data sets are handled and processed not unlike R (in memory, and LAPACK and BLAS/ATLAS do the heavy lifting). Python is neat, yet slow.
gappy
For Python you can also use Cython - Python coder directly compiled to C source Python extention. Very fast.
Jiri
+19  A: 

I am content to stick with R. I also have large data sets on similar dimensions, though you could call them 'sparse' (in a slight abuse of the term). I find filtering / condensing the data first, possibly using some fairly quickly written C++ subroutines, and the modeling and analyzing in R is still the best bet. R is mature, well tested, extensible and has 1900+ packages on CRAN --- not sure how much of this I'd find in other languages.

Notes on how to do 'more' with R are in my intro to high-performance computing with R tutorial notes if you permit the blatant self-reference. Maybe you will find something useful in there -- it covers profiling, C/C++ extension building and parallel computing with R.

That said, the best bet may well be with the old Unix philosophy of combining several well-designed tools in a larger chain. Maybe some of the newer / functional languages can help you to in a processing step before or after you do other work with R or C.

Dirk Eddelbuettel
Dirk, I have read your presentation(s), which are excellent. It is possible to combine tools, but it's a second best. I would like to do everything in one language. My concern is more fundamental. What happens when a) I have to process even bigger data sets, possibly in a streaming fashion; b) taking advantage of a 128-core machine? I think I am 2-3 years away from this scenario. Some colleagues of mine are already wrestling with nVidia Tesla cards, with hundreds of cores.
gappy
Hi Gappy -- Thanks for compliments! People are happy doing 'large data', streams etc with R. John Chambers had a paper on R and streaming data a few eyars ago; Simon Urbanek did work there too. I guess it all depends -- I think there may never be 'one language for all needs' and we will always need to balance goals that may well be conflicting. At some point the non-multithreaded nature of the interpreter may bite us, yet on the other hand interesting new packages like 'multicore' keep appearing. And yes, Tesla and GPUs may well be next.
Dirk Eddelbuettel
Dirk, I use R on Ubuntu, so I owe you for automagically converting the R package universe to Debian. I visit your site regularly. I even used the Quantian distro a few years back!!
gappy
Dirk - thank you so much for that presentation link!
Vince
+10  A: 

You are not the only one thinking about this.

Back To The Future: Lisp as a Base for a Statistical Computing System Presentation, Paper, Code.

I would recommend one of the 64bit Common Lisp implementations: Overview

Rainer Joswig
Yes, but elsewhere on his site Ross Ihaka remanrks how only 3 people are working on this project. It is evolving slowly, and doesn't seem backed up by the core team.
gappy
why not contact him and ask HIM how it is going?
Rainer Joswig
Good point. I will shoot him an email.
gappy
The presentation link is 404.
Brad Ackerman
I have updated the link...
Rainer Joswig
+5  A: 

F# maybe? Good thing about F# you can use rich OCaml codebase, and at the same time have a full access to .NET Framework with all the neat IDE's, tools, 3rd party libraries, and fancy interactive F# with on the fly code compilation/injection.

Scala (JRE based) also seem to be very powerful multiparadigm language. Additionaly it provides some very nice metaprogramming features, so you can extend the language how you like it.

Nemerle (CLR based) - similar as Scala but even more powerful metaprogramming.

The great advantage of these three languages that they are based on highly optimized and polished virtual machines, so the performance will be nearly as good as native C/C++ if that concerns you.

Ray
I should have mentioned F# as a candidate. It seems promising, but I haven't heard of anyone using it for numerical/statistical applications.
gappy
Actually, so far most of code examples / screencasts / blog articles that I've seen were featuring exactly numerical/statistical problem solving in F#. F# just like any functional language seem to be very well suited to solve these kind of problems.
Ray
Ray, thanks. Would you mind adding one or two links to scientific applications?
gappy
Check VSLab http://cvslab.di.unipi.it/vslab/blog/page/Visual-Tutorial.aspx. Its a scientific environment somehow similar to MatLab built on top of F# Interactive and DirectX (for visualizations).Also take a look at F# Math tools project at codeplex http://www.codeplex.com/fsmathtools
Ray
JD Long
You can search for "fsharp" on channel 9.
Robert Harvey
From what I've heard of, F# is primarily used for and suited for numerical type computation. It also has the best potential for scaling that I've personally seen. (Erlang and maybe scala are also quite good scaling wise).
Paul Nathan
Note that the fsmathtools quoted above got merged into dnAnalytics, which got merged in turn into Math.NET numerics. Not sure how much F# is actually left in there...
Benjol
+18  A: 

I share your desire for fast prototyping AND fast runtimes, and am similarly dissatisfied with kludging together {R or Python} + {Fortran or C or C++}. I don't really want a full-out Scheme-style (or worse, Haskell-style) functional language though. (Been there, done that; they sound pretty, but aren't practical.) I just want an imperative language that doesn't suck, with a few functional and maybe OOP features.

That said, I'm now looking into Scala, with the Scalala MATLAB-like library. It does linear algebra and I think, LAPACK bindings (through some Java/JNI library), and also some plotting. Since Scala has an interactive interpreter (albeit a bit immature), you can use it in an R/MATLAB interactive style, which is nice.

This is fairly new, so way less library support than R.

Scala is like Java but less sucky and more functional-y and type-inference-y. Supports lightly typed, fast prototyping better than either C++ or Java. In terms of language features, I think it's similar to OCaml or F#. I find it easier to grok, personally, though it's a somewhat complex language.

I've found Scala to be faster than Python (supposedly it's as fast as Java) which I avoid in the same situations you stated -- sometimes you need to write your own inner loops and it would be nice to not have to bounce in and out of C. (FWIW, Python's ctypes is a lot nicer than R's C API, but you do still get the basic annoyances of C/C++ world like segfaults.) I know some folks who write L-BFGS and other performance-crucial sort of numeric code directly in Scala and they seem happy. But I have not tried this myself yet.

The Scalala author tells me he does all his algorithms and analysis in Scala now. Of course, he'd be the first one to do that :) but since you mentioned complete integration in a single language as a goal, that is one success story.

For the longer term hardware innovation problems you mentioned, like GPU usage, the JVM (Scala, Clojure, Java) seems like a bad bet because it has such insufficient C/C++ integration. I wonder if any functional language is very good here. I suspect hardware innovations will always require close-to-the-metal C++ coding, at least for several years after they come out.

JVM is good at multiple cores though. Scala has some nice parallelism libraries, though I suspect any functional JVM language should have good ones.

Positive sides about the JVM: good JIT compilation, garbage collection and runtime type safety (no segmentation faults), yet only slightly slower than C++. (Supposedly at least.) Very widespread in corporate programming, so it'll be around for years to come on open source platforms.

Negative sides:

  • insufficient C/C++ integration (JNI, though it gets the job done)

  • clunky integration into Unix-land. Just using Java's I/O library is such a pain. (ScalaNLP's Pipes.scala make this substantially easier. i need to bother those guys again about putting it on github so i could link to that file, it's great). And every time you want a commandline script you have to write a shell loop that adds dozens of jars to the CLASSPATH. Argh. This is NOT acceptable.

  • insufficient package management compared to Python's or Ruby's. I always thought RubyGems and eggs/easy_install were clunky, but watching Maven 2.x crash and burn is a disheartening experience. This is a general problem with the Java ecosystem. On the other hand people don't seem to have as much of a problem as me.

  • any JVM language will always be slower than well-tuned C++ or Fortran

On the other languages people are talking about...

  • F#: If I had Windows for free, I'd try C#/F# because supposedly they have very good C++ integration. But I don't, and Mono doesn't look widely enough used to bet on as a platform. Despite the annoyance the JVM is, it's definitely here to stay.

  • Clojure: I don't understand what the big deal is. It's cool because it has irritating parentheses?

Brendan OConnor
JNA as a direct FFI is to be preferred to the JNI these days.
Steve Gilham
oh, very good to know. thanks.
Brendan OConnor
Thanks Brendan. Very informative and funny answer. My bet is that Scala has the best chance to succeed among the new batch of promising languages (Scala, Clojure, Haskell). I have heard good things about F#, but I am a Linux/EC2 user, and have never used Mono. I would like to implement a few streaming algorithms that have appears in conferences and JMLR. To do this I need easy I/O, fast execution, no memory management, decent math libraries, easy parallelization. Do you think Scala can deliver?
gappy
i still consider myself a scala newbie, so i decline to answer for sure :)But my guess would be that java, scala, or clojure all could do it... I bet Haskell would die at some point, or you'd be forced into writing C.
Brendan OConnor
+1 Many good points made. I do wish you hadn't made the flippant remark about Clojure though, as it detracts from your otherwise well-informed answer.
alanlcode
-1. Great answer but the killer problems with the JVM are generics and value types, both of which destroy performance and scalability, e.g. filling a hash table with primitive types is 32x slower in Java than F#.
Jon Harrop
Jon: my bad for dissing Clojure without enough information. I did write this answer a while ago; and now it seems there's tons of numeric and data processing code coming out for Clojure. Maybe it will take over the world.
Brendan OConnor
@specialiazed in Scala 2.8 promises the best of both worlds with regards to generic code that doesn't need to box primitives. Scala bindings exists for OpenCL, which can exploit advanced processing capabilities of GPUs or CPUs. (http://code.google.com/p/nativelibs4java/wiki/OpenCL)
retronym
+5  A: 

Have you looked at http://www.revolution-computing.com/ and their tools for large/parallel data processing? That's probably where I'd start - first with their free stuff and then if you have the cash, their pay stuff.

Personally I think it's a losing battle to try to find a new tool just to work with large data. I think the existing good tools have to be outfitted with good support for addressing data structures too big to fit comfortably into RAM. I don't see another way forward for the community. I'm also pretty confident R will have good solutions for large data before some of these other, newer packages have deep benches of community-created modules.

Ken Williams
+7  A: 

Incanter is excellent. It's pretty new, but it's very capable as most of its features are inspired by R, but with a more performant runtime.

technomancy
A: 

I'm also looking at doing some large-scale processing in R. I'm slowly ramping up to trying out the amazon web services/ec2/hadoop route. I think it depends somewhat on what you need to do. If you are doing permutations and sorting of huge data sets than performing those tasks in parallel can be very efficient. If your inverting a huge matrix, I think it is much less applicable to use a parallel approach. This might be a bit of a kludge as well but for the right problems could be highly scalable.

kpierce8
+3  A: 

Seconding Clojure / JVM / Incanter.

I haven't tested it on very large dataset though, but the combination of lazy processing of Clojure, and Incanters' use of the Java parallel Colt libraries might make things work for you.

What I'm doing right now with it is mapcatting through hashtables with values that contain vectors, and doing some statistics on that. My data size is less than a million entries though, so I can't tell you if it will hold up.

bOR_
+3  A: 

I am not sure if you are going to like this suggestion or not, but have you ever heard of a software package called Root? It is developed by CERN (the European Nuclear Physics research organization) and is used to analyse data from the Large Hadron Collider. It is based on a C/C++ interpreter called CINT. Root is designed to handle the types of workloads you are describing and supports OOP. I use it in my research which involves, in a given run, over a billion events each with between 1 and 6 data points to be analysed. Root can be a pain to learn how to use but is very versatile once you do. Even though it is designed for high energy nuclear physics, it can be used for other things like statistical analysis. Root is run on parallel facilities so I know the capability is there (Personally, I am not sure how easy it is to get this feature working on a cluster though as I have never tried myself).

You can find Root at CERN's webpage:

http://root.cern.ch

It is available for direct download or checkout via Subversion. It is designed to run on Linux, but will run on Mac OS X or Windows if you use Cygwin.

Chicken Fried Steak
I knew CInt but not Root. I'll check it out. Not sure it works for my needs though.
gappy
+1  A: 

I got led here via my blog somehow. I'm actually using Lush for that at present, but I'm swapping to disk a lot: it barfs at around the same time R does, and for the same reasons. While I haven't tried this in OCaml or Common Lisp, the same thing is likely to happen unless you have a 64 bit version and a big computer. Numerics facilities in all the other Lisps and ML's I've looked at have been very much beneath those in Lush. Sad, really. There is a lot of potential there.

If you want to hold more data in memory on a big machine, you can compile a 64 bit version of R, or buy a 64 bit version of MATLAB. I'm guessing that MATLAB is going to work better, but you can try the 64 bit R for free first to see if it works. I have also seen a lot of 64 bit SAS used for this type of thing.

Seems like the last Lush release is dated 2006... Is it still actively maintained?
nimrodm
A: 

If Mathematica were free, it would be really good for your needs. But it's not.

I have the same questions you do ... tell us if you find something useful.

Dan
And Mathematica really sucks for memory consumption and performance.
Jon Harrop
Mathematica also sucks for stability and maintainability. I've worked with it a lot over the last decade, and I flat out don't recommend it (unless you are a Mathematica guru and know exactly what you are doing).
Leo Alekseyev
+1  A: 

I just mention (since it looks to have been overlooked) that there is an open source version of the APL language called J. Here are the relevant links:

I've only toyed with it a little, but it shows some promise.

Shane
+4  A: 

Without knowing what exactly you want to accomplish: we use SAS (not functional at all) for jobs of this size on a regular basis on a good sized PC (3 ghz 4GB ram). Its not instantaneous but it does get the job done, and you don't have to code any differently.

SAS is kind of like an abusive relationship, most of the time its ok (using most of the procs), sometimes its really good (data step and options, decent sized data), and other times its so horrible you have to change your zip code (inconsistent syntax, macro debugging ye gods, nearly anything macro related).

For tests, we generally sample down to 1e6 size sets and go from there. SAS is great at that size on a personal PC, procs running particularly quickly at that size.

Pros:

  • Lots of builtin statistical tests and procedures, if you pay for it. Change one word and you get tons of output. The builtin stuff can be very powerful programmatically.
  • Data transformation is incredibly easy and powerful. The data step is intuitive, and has a ton of functions and options. Also lots of built-ins for statistical transformations.
  • Huge community of examples. Can almost always google through a problem, but the examples often come from non-programming savvy statisticians, usually you should attempt to find a few solutions to get the best one.

Cons

  • its very not free, in fact really expensive. Think $3k+ / year a box
  • IDE is horrible: no code completions, unstable, bad help, and weird code highlighting
  • Macro language is even worse. Bad characters in comments(this is horrible) can cause macros to break, and macro debugging is nearly impossible. Requires serious ju-jitsu skills
  • Language was developed by committee, and you can really tell. Development teams worked on separate procs, and they've actually changed the syntax of different elements required proc by proc. The user-guides become indispensable when writing anything in SAS
mcpeterson
Thanks for your nice answer. I used SAS only briefly in the past, but your points ring true to me. One more thing I would add: as disk access time and data size becomes bottleneck for computations, languages that were developed with those constraint in mind, like SAS, may have a new edge.
gappy
+1  A: 

F# is the only thing that comes close but it is not multiplatform: it only works reliably on .NET under Windows.

Outside Windows, you have lots of choices (OCaml, Scala, Clojure, Common Lisp, Scheme, Standard ML, Haskell, Felix) but they each have at least one serious deficiency.

I have been working on a project called HLVM that will provide a foundation for high-performance high-level F#-like programming languages in the future but it is not quite ready yet.

Jon Harrop
A: 

If you want to stick with R, supposedly Hadoop and Map/Reduce is the way to go. AFAIK R is terrible at memory management, unfortunately.

Axl
I agree that Hadoop/M-R is a good choice with R (see e.g. http://bit.ly/bNHvP7), but I disagree that R is terrible at memory management. Most of the memory management problems I ever encounter are just me forgetting to clear out my interactive workspace.
Ken Williams
+1  A: 

I've had surprisingly good results with Ocaml. The syntax is ugly to my taste, but as with other functional languages it's easy to to translate abstractions into code, and lends itself to problems where you're working at multiple levels of abstraction; e.g., linear algebra routines -> overarching mathematical algorithms -> data management -> interface with external world. It just works, and it's relatively fast and memory efficient.

While I haven't tried it out, I'm intrigued by Single Assignment C. SAC looks to be still in the experimental stage, but embodies a lot of thoughtful work and experimentation on linear algebra in a functional context. Implicit parallelization is a major focus.

Jeff