views:

413

answers:

3

I working on a possible machine learning project that would be expected to do high speed computations for machine learning using SVM (support vector machines) and possibly some ANN.

I'm resonably comfortable working on matlab with these, but primarly in small datasets, just for experimentation. I'm wondering if this matlab based approach will scale? or should i be looking into something else? C++ / gpu based computing? java wrapping of the matlab code and pushing it onto app engine?

Incidentally, there seems to be a lot fo literature on GPUs, but not much on how useful they are on machine learning applications using matlab, & the cheapest CUDA enlabled GPU money can buy? is it even worth the trouble?

+4  A: 

Both libsvm and SVM light have matlab interfaces. Besides, most learning tasks are trivially parallelizable, so take a look at matlab commands like parfor and the rest of the Parallel Computing Toolbox.

AVB
+2  A: 

I work on Pattern Recognition problems. Let me please to give you some advices if you plan to work effectively on SVM/ANN problems and if you realy don't have access to a computer cluster:

1) Don't use Matlab. Use Python and its large number of numerical libraries instead for Visualisation/Analysis of your computations.
2) Critical sections better to implement using C. You can integrate them then with your Python scripts very easy .
3) CUDA/GPU is not a solution if you mostly deal with non-polinomial time complexity problems which is typical in Machine Learning, so it brings no great speed-up; dot/matrix products are only a tiny part of SVM calculations - you still will have to deal with feature extractions and lists/objects processing, try instead to optimize your algorithms and devise effective algorithmic methods. If you need parallelism (e.g. for ANNs), use threads or processes.
4) Use GCC compiler to compile your C program - it will build the very fast executable code. To speed-up numerical computations you can try GCC optimization flags (e.g. Streaming SIMD Extensions)
5) Run your program on any modern CPU under Linux OS.

For realy good performance, use Linux clusters.

psihodelia
Is Matlab that bad for SVN, or do you just think Python is generally preferrable?
Jonas
@Jonas: yes, Matlab is not recommended because: 1) It is a proprietary non Open Source product --> It can run only in a very limited set of environments (e.g. OS={Windows,Mac}, CPU={x86}, etc.). 2)Matlab uses parentheses for both indexing into an array and calling a function --> you will have problems reading your large enough program. 3)Matlab is extremely slow, because the input arguments to a function are copied and not referenced like it is in Python.
psihodelia
Actually (1) Matlab runs in all common environments. I use it on 64-bit Windows, OSX and Linux, for example. (2) Since indexing into an array is basically calling a function (subsref, which you can overload if you want), I don't see why this should be a problem - and at least for me, readability comes from structuring and commenting of code, not from parentheses, and (3) Matlab does copy-on-write, which would anyway be much more a problem of memory than of speed. In other words, there does not seem to be a problem with the SVN implementation of Matlab, but you just don't like the program.
Jonas
interesting - I'd be cool learning python, but will there be a substantial performance loss compared to matlab in using python? I also forsee doing a substantial amount of text mining (its an academic project so getting a matlab license isnt a problem)
flyingcrab
Ignoring obvious linux-opensource bias in this answer... OS doesn't affect performance of CPU-bound applications. GCC is not the best optimizing compiler available on Linux (by far). SIMD optimizations without using intrinsics are likely to give only a few percents performance gain. Be problems polynomial or not, two times faster is always two times faster, algorithms don't replace hardware. Writing in C together with Python requires knowledge of two niche languages, neither of which really suits the job.
ima
There should be no performance loss, likely a gain - if use C libraries for all calculations... Still, numpy-scipy is a compromise for people who already find matlab limited, but are still wary of doing full-scale programming. If you can use Java or C++ (C#, Scala, etc - doesn't matter), you can just as well go the full way.
ima
A: 

I would advice against using Matlab for anything beyond prototyping. When the project becomes more complex and extensive, proportion of your own code will grow versus functionality provided by matlab and toolboxes. The more developed the project becomes, the less you benefit from matlab and the more you need features, libraries and - more importanly - practices, processes and tools of general purpose languages.

Scaling of matlab solution is achieved by interfacing with non-matlab code, and I've seen matlab project turn into nothin more than a glue calling modules written in multi-purpose languages. Causing everyday pains for everyone involved.

If you are comfortable with Java, I'd recommend using it together with some good math library (at least, you can always interface MKL). Even with recent Matlab optimisations, MKL + JVM are much faster - scaling and maintainability are beyond comparison.

C++ with processor specific intrinsics can provide better performance, but at a price of development time and maintainability. Adding CUDA imporves performance further, but the amount of work and specific knowledge is hardly worth it. Certainly not if you don't have prior experience with GPU calcucations. As soon as you go beyond single processor, it's much more effective to add another CPU or two to system than to struggle with GPU calculations.

ima
ima, thanks for that - i've done a reasonable amount of coding with java, so i suppose that is something i need to look into more for this project.the vast majority of stuff i seem to find on the net is quite evangelical, one way or another - its hard to find balanced opinions - so thanks for your comments :)
flyingcrab