views:

985

answers:

15

I'm doing a bit more statistical analysis on some things lately, and I'm curious if there are any programming languages that are particularly good for this purpose. I know about R, but I'd kind of prefer something a bit more general-purpose (or is R pretty general-purpose?).

What suggestions do you guys have? Are there any languages out there whose syntax/semantics are particularly oriented towards this? Or are there any languages that have exceptionally good libraries?

+2  A: 

The pystats library (for Python) is well-suited for statistical analysis.

AJ
It seems that the project files haven't been updated since 2005. That is usually a very bad sign.
signalseeker
I have a 2005 Jeep that still runs great!
AJ
I have a bit of cheese from 2005!
Thomas
@AJ - it's too bad that pystats isn't a jeep... :-)
Jason Baker
@Jason - indeed. my point was merely that just because code itself is old...does that mean it's no longer valid? how much has statistical analysis changed since 2005?
AJ
@AJ - Probably not enough to make a difference. However, it may mean that nobody's maintained the library since 2005 and thus it may be difficult to get help/fix bugs. Granted, that may not always be the case, but I'll tend to avoid such projects if I can.
Jason Baker
+1  A: 

Matlab is good at statistics too. It's not exactly free, though.

Octave is a free clone that might also do what you need.

Thomas
+2  A: 

A friend of mine who focuses on market statistics uses SAS. I don't know much about it- it doesn't seem like a "real" language, but might be worth checking out.

I'm all for Python with R bindings.

Matt Luongo
SAS is VERY expensive. If one wants a paid statistical software, there are more choices (also cheaper ones) like: spss, jmp, mathlab and so on. Personally, I would prefer R :)
Tal Galili
+1, Python and R together is a dream come true. Check out rpy2: http://rpy.sourceforge.net/rpy2.html
Mark
+2  A: 

Have you considered using somethinbg like MatLab? It has many advanced capabilities to perform data analysis and you can do some programming in the environment.

Vincent Ramdhanie
+3  A: 

Hi

I would say R as most of the Statistics courses in my University use R and most of my friends who have taken such courses are quite content with its range and reach.

I have even tried MATLAB and found it pretty handy.

cheers

Andriyev
+1  A: 

APL is apparently one of the best language around for statistics work. It is not general purpose though...

It does require a special keyboard and font as it does not use ascii.

See Conway's Game of Life in one line of APL for a bit of an overview of what can be done with it.

Oded
APL is as general-purpose as anything else, just a lot harder to learn. +1 for nostalgia
Norman Ramsey
If you're thinking of APL, then why not go with J or K instead, which is a little more practical but uses the same basic approach?
Shane
+1  A: 

What about Stata? I have a friend who is a PhD Economics student and he raves about Stata all the time. And I have a personal affinity for Mathematica.

Andrew Noyes
A: 

Have a look into the RooFit package for ROOT. It is used by e.g. particle physicists for data analysis.

ROOT is a C++ framework and also comes with python and ruby bindings. It is also includes a limited interactive C++ interpreter.

honk
+30  A: 

No contest -- R as the main implementation of S (and one that happens to be proper Open Source and a GNU project as well).

Not only as the S language designed precisely for this purpose (see the books by John Chambers), but the rather rich support of domain-specific packages at CRAN is second to none: over 2000 packages with proper quality control, often authored by experts in the field.

The ACM sees it the same way when it gave the ACM Software Systems Award to John Chambers in 1998 with the following citation

John M. Chambers

For The S system, which has forever altered how people analyze, visualize, and manipulate data.

For reference, other winners of this award were TeX, Smalltalk, Postscript, RPC, 'the web', Mosaic, Tcl/Tk, Java, Make, ... Not a bad company to be in.

Now, if you 'only' want to collect and summarize some data just about any procedural or functional language will do. But if you want something that was designed for programming with data then R as the main S implementation it is.

Dirk Eddelbuettel
I fully understand R's power as a statistical language. However, I need to do some things aside from just statistics (parsing logfiles and accessing a sqlite database). Can R do that?
Jason Baker
Yup! There is e.g. the RSQLite package which has everything you need to read/write to/from SQLite files. Plus, it uses the DBI interface so you re-use your code on different backends. As for parsing, R contains several regex engines, including basic, extended and Perl---see help(regex)---so it does this very well too. You can use R for scripting via the 'Rscript' executable on Windows, OS X, Linux as well as 'r' ("littler") on OS X and Linux. [ I co-wrote / maintain littler ].
Dirk Eddelbuettel
You *can* do anything you want in R, but you probably don't want to. My suggestion would be to learn R and some other language that plays well with R. If you're building heavy-duty applications, maybe Java or Scala. If you're building medium-sized systems that are mostly wrappers around R, maybe Python or Ruby. Then use the various libraries that people have written to call R from your other language when you have a need for sophisticated work with data and statistics.
Harlan
+5  A: 

Have a look at Incanter, based on clojure. "Incanter is a Clojure-based, R-like platform for statistical computing and graphics." Clojure is a lisp based language implemented on the top of the JVM. It has easy access to java libraries. Can't get more general purpose than that.

Eduardo Leoni
I was just looking at that, and it seems pretty interesting!
Jason Baker
+2  A: 

R is great if all you're doing is statistics. It's got a nice interactive interface and visualization tools. However, it's pretty hard to use as a general purpose language because its syntax and semantics are very highly optimized for doing statistics. If you want a more general-purpose language, Python with SciPy would be a decent choice, though I've used it and found the statistical routines in it to be somewhat immature. They often are inefficient or fail in corner cases.

If you're doing data mining on large datasets, making performance important, and/or you don't mind using alpha-ish tools, the D programming language and the dstats library can be pretty good. D is about as general-purpose a language as you get, but IMHO dstats is very easy to use because template metaprogramming makes it easy to design a nice API even in a statically compiled, close-to-the-metal language. (Full disclosure: I wrote most of dstats, so of course I think it's good.)

dsimcha
R works quite well for general purpose programming -- eg the code behind the CRANberries html and rss summaries of changes at CRAN is less than 200 lines of ... R. Likewise, more and more of the behind the scenes scripts used by R for building R, running tests, updating documentation from a latex-alike meta format are now in R. And no other language comes even close to CRAN and its 2000+ packages.
Dirk Eddelbuettel
@Dirk: I guess it's pretty subjective, but I find most math-oriented languages (R, Matlab, etc.) very awkward and strange for general purpose programming, not just R.
dsimcha
Many comparisons are subjective. Also, R != Matlab and this comparison is generally not a good one. Second, I gave you concrete examples of R as a general programming environment. It is quite possible thanks to numerous POSIX calls, wrapping of filesystem / OS level calls, regexp libraries etc pp. So with that I still rebuke your 'if *all* you are doing is statistics'.
Dirk Eddelbuettel
I disagree with this, R quickly replaced Perl as my weapon of choice for most general--purpose programming tasks.
Sharpie
+9  A: 

No question that R is the best language for statistics, as Dirk says. I just want to add a few points to this:

First, I think that the primary reason that you should use R is because of the community. It is used so heavily by experts in academia and industry at this stage, that no other language even comes close to rivalling the wealth on CRAN.

Second, it should be acknowledged that R the language is a joy to work with. It is my primary language, and having tried alternatives, I have no intention of abandoning it any time soon. But it also doesn't have a monopoly on it's strength for programming with data and this claim can be taken too far. All the Lisp and Functional languages are strong at data programming. Lisp, after all, was derived from "list programming", and it is Lisp's influence on R that make the language what it is.

There are members of the R community (eg. Ross Ihaka) who are actually viewing Lisp as the statistical languge of the future (see the "back to the future" paper for a reference) due to some deep design problems in the R language (eg. no multithreading).

So while R is undoubtedly the best language for statistical computing, I see some value in being familiar with another language like OCaml, Haskell, or (possibly) Clojure/Incanter.

Shane
+1  A: 

You can have a look at the program sage, which is a re-implementation of the python interpreter that allow you to call different programming languages for statistics (R, matlab, octave, etc..) using a python syntax.

One of the major issues while writing programs to do statistics is that you may end up with having many different small scripts, each one doing a separate task, and you can end up with having messy folders and confusion in your results.

So, apart from choosing a programming language (I think other people have answered to your question already) you also need a syntax to define pipelines of scripts: you can make it with the program 'gnu/make' (e.g. read this) or with this sage, or there are other solutions.

dalloliogm
spellcheck: mayor -> major
Tshepang
fixed, thanks!!
dalloliogm
+2  A: 

From my experience, R is an exceptionally powerful language in these areas:

  1. Manipulation and transformation of data.

  2. Statistical analysis.

  3. Graphics.

But R is by no means a three-trick pony. I have also applied the language to tasks that do not fit entirely into the above categories. Some examples are:

  • A script to assist in the creation of OSX universal binaries by identifying and matching static and dynamic libraries of different architectures and then running the resulting groups through lipo.

  • Scripts to scrape information from web pages.

  • A set of scripts to create georeferenced imagery, cut the images into tilesets using GDAL, form a JSON manifest that describes the output and upload the result to a website for immediate display by OpenLayers.

My favorite part of using the R is the frequency with which I get to say:

WHOA! There's a package that does THAT?!

Sharpie
+1  A: 

I´d also like to +1 for R. It might not be as easy to handle as STATA or even SPSS, in particular for non-programmers. Though I guess the average stackoverflower is way more of a programmer than I am.

That being said, i´d like to give a short overview, because I have seen a couple of statistical packages from a users (economists( point of view.

STATA is still the choice for the majority of economists, and indeed it has some pluses. STATAs GUI helps to stay in charge of a load of options and statistical functions. Besides STATA appears to be only package which has a mailing list that comes at least somewhat near to the benchmark: the one-of-a-kind R Mailing list. Still one could write sophisticated .do files or download some from the web. STATA might not be as close to a programming language as R but still offers a nice programming language for statistical purposes. Depending on the size of you datasets you should check what license you need.

You could also use SPSS which is even more of a GUI Tool than STATA and is a little less comprehensive for example for econometric work such as TOBIT models or Panel regressions, particularly discrete choice models.

There´s also Eviews – unfortunately I have forgot most about it and only used it for a couple of easy regressions in my studies. Thus I just name it here. Same about GAUSS, which appears more mathematical than the rest of the pack. Recently I have heard about Octave, which is also more mathematical.

For my personal usage R is head and shoulders above anything else. Occasionally I pair it in Python or connect it to MySQL or PostgreSQL databases which also works well. R really helps you to learn statistics because you need to understand more in order to do something than you would need clicking your way through the likes of SPSS. Though if you need a GUI, you could try RKward or consider installing Komodo / Sciviews-R or Tinn-R on windows. The latter ones aren´t GUIs, but editors more or less that support Code Highlight and code suggestions which also help to go get it done. Farnsworth Econometrics in R is a good read. Ah, and I can´t forget to mention the plotting. the ggplot2 package from Hadley Wickham is just off the hook. The best way to create graphics as long as you do not need them to be interactive. At the end of the day R is really to most flexible package: you can even install it on a webserver and build some nice webinterface – the sky is the limit.

ran2
use python for parsing, and write your stuff to a local SQL database, create some nice views and then use RMySQL for example. It´s worth the hustle!
ran2