tags:

views:

458

answers:

6

I work with machine learning with fairly large datasets (they still fit in memory) and I have written some calculations in R which I find to be too slow. Thus I would like to replace the "critical parts" of the program with compiled code that I would call from R. An example problem that I have in hand is implementing the forward-backward algorithm.

My question is whether I should learn Fortran or C++ to do this? I only need to work with numeric vectors or matrices. I'm mainly interested in which language is easier to learn and interface from R and I don't really care which one looks better on my CV.

I have read the R extensions manual and played a bit with the inline package with some simple Fortran and C++ code. My current impression is that Fortran95 would be simpler to learn, although the Rcpp package also looks very interesting. I currently know R, Python and Matlab.

+1  A: 

If you are in academia, a lot of people still use Fortran, so this could be a good plus. And fortran is really good at chewing through numbers.

Alexandre C.
I work in the University and my field is Agricultural Engineering, I know the Animal Breeders in my Department use Fortran.
Matti Pastell
So I would suggest going with fortran then.
Alexandre C.
I don't actually share any code with them so there is not real synergy... I guess I could get some advice. I'll definitely need to tell them that interfacing Fortran subroutines from R is relatively easy.
Matti Pastell
+14  A: 

I write a fair bit of Fortran, lots of Matlab, and recently started seriously learning C++. I think that you will be productive in your new language sooner if you go with Fortran rather than C++. I suggest this bearing in mind:

  • I guess that most of the number crunching you want to do is to process large arrays of numbers. Fortran is very good at this and has fundamental language constructs and intrinsic functions for whole-array operations (not always better performing than loops mind you). C++ misses these features, you either have to program them yourself or use a library such as Boost (highly recommended by people far more knowledgeable than me).
  • A lot of the features which make C++ an attractive language for a large range of application types (features such as templates, all the OO stuff, pointers, references, and more) are not terribly useful within your domain. I suspect that if you need to do any 'clever' programming you'll do it in R, leaving Fortran for simple heavy-lifting. Fortran has most of those features too, but they're not so widely used in the Fortran communitty.
  • The Fortran mindset is not far from the Matlab mindset, so the leap from the latter to the former is not huge. Right now, too, my view is that learning enough Fortran to be productive in your domain is going to be quicker than learning enough C++.
  • As for the relative performance of Fortran and C++: believe nothing unless you have measurements in front of you. But I think that you have to work hard and smart to get C++ to match Fortran performance. It can certainly be done, but I think it's more demanding of the programmer's skills. Fortran compilers have had over 50 years of work on them and optimisation for execution speed is very important to we Fortran programmers.

I can't comment at all on the ease of integrating R and Fortran or C++

High Performance Mark
As a note on performance: I believe fortran programs are usually better than C/C++ ones when processing arrays because Fortran has no pointers. A fortran compiler can then perform optimizations based on the fact that an array is only accessed through its name (google "aliasing problem" for more info).
Alexandre C.
@Alexandre: well, actually, Fortran does have pointers. But I think, as you do, that they are not required for array processing.
High Performance Mark
Thanks for the insight. I agree that Fortran syntax is very easy to read when you know Matlab. I found that the Armadillo library and RcppArmadillo seem to provide the array operations that I need for C++ and there is even a syntax conversion table http://arma.sourceforge.net/docs.html#syntax for Matlab users. I think I'll try implement some fairly small projects in both languages and go with what seems more natural.
Matti Pastell
I strongly disagree on point 2 above -- it is because of templates that e.g. as modern C++ libraries such as Armadillo allow you to write code in C++ that looks and feel the same as the corresponding Matlab expression --- yet runs circles around it.
Dirk Eddelbuettel
+6  A: 

If you will be writing all the code yourself, then it may depend on which language you like better, or can learn better / faster. Though Rcpp may give you in edge in getting R objects to C++ and back more easily. Also, the most recent additions in 0.8.3 give you R-alike vector expressions in compiled code.

On the other hand if you plan use / re-use / adapt existing libraries, then I would take a good look at e.g. mloss.org and see what language provides you with the most useful machine learning libraries and have that guide your decision too.

To me, C++ provides rather useful abstractions plus access to an enormous code base of general good quality. But others are content with Fortran. It really depends on you, and to some extend the people around you who can give support.

Dirk Eddelbuettel
The new features Rcpp seem very appealing indeed. So many thanks for the hard work on the package! Is it possible to use the new vectorised statements with RcppArmadillo?
Matti Pastell
Yes -- Armadillo has its expression template magic, and Rcpp now brings its variants alongside. Mind you, 'Rcpp sugar' works on our vectors, not Armadillos. That could possibly be bridged. Oh, and feel free to accept this answer if this is what you end up doing :)
Dirk Eddelbuettel
I have a look at it and reconsider what's the final accepted answer:) The problem is that I know very little C++ so getting started with Fortran seems easier, but I see more possibilities with C++. I don't have too much time to learn right now, but I'd really like to speed up some calculations.
Matti Pastell
I decided that this is the accepted answer as calling Fortran from R turned out to be too complex and limited to my taste. See my own reply to the question for some details. Many thanks for the numerous Rcpp Vignettes, they are extremely helpfull!
Matti Pastell
+2  A: 

Fortran is the java of HPC. You can write very efficient programs in C++ but it is easier to write the same program in Fortran, as long as it is suited for number crunching. Nobody would seriously write a GUI application in Fortran, but in HPC it is unbeatable in speed and conciseness.

f.jamitzky
+6  A: 

Fortran was the first programming language I learned, since then I have also picked up C and some C++. My two cents is that if you need to quickly speed up some matrix processing, definitely go with Fortran. The reasons are:

  • Fortran is really good at efficiently processing numerical data, especially when it is stored in matrices or arrays. This sort of work is the 'sweet spot' of the language.

  • Because Fortran has a narrow focus on numerical operations, it has a lower learning curve compared to C and C++. There are fewer language features and quirks to learn and you don't have to deal with pointers. This is a big win if all you want to do is speed up some calculations as quickly as possible and move on with your work.

  • Multidimensional Arrays and array operations are first-class citizens in the Fortran language. With C or C++ you need to worry about using external libraries or writing functions/macros to provide the same functionality.

On the other hand, C and C++ are decidedly better suited for general purpose programming tasks outside the realm of numerical computation. If you see the possibility for something like lots of string manipulation in your future then you probably want to invest your time in a language other than Fortran.

Update

One other important consideration is how your data is stored and processed on the R side. If you use fortran then you will have to pass your data into the compiled routines in a very basic manner- scalars, vectors, etc. No lists or fancy objects.

Since R is implemented in C, there is a richer interface available that allows you to directly pass arbitrary R objects to C and C++ routines and then return arbitrary R objects. You can also execute callbacks that allow you to execute R functions from within the compiled C code.

Sharpie
+3  A: 

I have now done some experiments in using Fortran, C++ and R and I think I'm at least half ready to answer my own question now. I ended up writing the diff function (and some other small tests) in both Fortran and C++ and calling it from R.

For starters I think anyone faced with this problem should read Writing R extensions, Rcpp introduction and Rcpp FAQ.

I have now discovered some important points about interfacing the code from R that haven't yet been covered in the answers:

  • Rcpp with inline package makes calling C++ from R extremely easy and even takes care of the compiling the extension (see Rcpp FAQ), you can specify everything that you wan't to go into the function and what you wan't to get out.
  • Using Rcpp and RcppArmadillo makes it possible to write efficient computations and call them from R very easily with very basic knowledge of C++.
  • The R interface to Fortran ".Fortran" is much more limited, you need to use a subroutine to do it and you need to pass all the parameters in that you wan't to get out. That is (as I understand) that you need to preallocate and pass also the result vector(s) (or array) to the subroutine and the subroutine also returns all the parameters. It's not that difficult, but much more error prone, tedious and limited.
  • If you wan't to write a portable package you need to use F77 see here.

So as a conclusion: for what I need writing Fortran and C++ (with Armadillo) seems ~ equally easy (or difficult), but interfacing the C++ code from R is a whole lot easier with Rcpp.

Matti Pastell
Cool! That is pretty much what we aim for in writing it, so it is quite gratifying to see that it actually works for you that way! ;-)
Dirk Eddelbuettel
I just finished doing exactly what I wanted and was able to make the slowest part of the code 70 times faster! It only took me a couple of hours to do, I don't think thats bad for a first time of using Rcpp, Armadillo and C++ for a real project! Thanks again for great package!
Matti Pastell