Datamining models in FORTRAN or C (or managed code)?

F# compiles to the CLR, which has a just-in-time compiler. It's a dialect of ML, which is strongly typed, allowing all of the nice optimisations that go with that type of architecture; this means you will probably get reasonable performance from F#. For comparison, you could also try porting your code to OCaml (IIRC this compiles to native code) and see if that makes a material difference.

If it really is too slow then see how far that scaling hardware will get you. With the performance available through a modern PC or server it seems unlikely that you would need to go to anything exotic unless you are working with truly brobdinagian data sets. Users with smaller data sets may well be OK on an ordinary PC.

Workstations give you perhaps an order of magnitude more capacity than a standard dekstop PC. A high-end workstation like a HP Z800 or XW9400 (similar kit is available from several other manufacturers) can take two 4 or 6 core CPU chips, tens of gigabytes of RAM (up to 192GB in some cases) and has various options for high-speed I/O like SAS disks, external disk arrays or SSDs. This type of hardware is expensive but may be cheaper than a large body of programmer time. Your existing desktop support infrastructure shouldn be able to this sort of kit. The most likely problem is compatibility issues running 32 bit software on a 64-bit O/S. In this case you have various options like VMs or KVM switches to work around the compatibility issues.

The next step up is a 4 or 8 socket server. Fairly ordinary wintel servers go up to 8 sockets (32-48 cores) and perhaps 512GB of RAM - without having to move off the Wintel platform. This gives you fairly wide range of options within your platform of choice before you have to go to anything exotic[1].

Finally, if you can't make it run quickly in F#, validate the F# prototype and build a C implementation using the F# prototype as a control. If that's still not fast enough you've got problems.

If your application can be structured in a way that suits the platform then you could look at a more exotic platform. Depending on what will work with your application, you might be able to host it on a cluster, cloud provider or build the core engine on a GPU, Cell processor or FPGA. However, in doing this you're getting into (quite substantial) additional costs and exotic dependencies that might cause support issues. You will probably also have to bring a third-party consultant who knows how to program the platform.

After all that, the best advice is: suck it and see. If you're comfortable with F# you should be able to prototype your application fairly quickly. See how fast it runs and don't worry too much about performance until you have some clear indication that it really will be an issue. Remember, Knuth said that premature optimisation is the root of all evil about 97% of the time. Keep a weather eye out for issues and re-evaluate your strategy if you think performance really will cause trouble.

Edit: If you want to make a packaged application then you will probably be more performance-sensitive than otherwise. In this case performance will probably become an issue sooner than it would with a bespoke system. However, this doesn't affect the basic 'suck it and see' principle.

For example, at the risk of starting a game of buzzword bingo, if your application can be parallelized and made to work on a shared-nothing architecture you might see if one of the cloud server providers [ducks] could be induced to host it. An appropriate front-end could be built to run locally or through a browser.

However, on this type of architecture the internet connection to the data source becomes a bottleneck. If you have large data sets then uploading these to the service provider becomes a problem. It may be quicker to process a large dataset locally than to upload it through an internet connection.

Or even throw more hardware at the bottlenecks. It might be less expensive, than throwing developers' time at them. Anyway: make it work, measure performance, then optimize.

Tadeusz A. Kadłubowski 2009-10-19 07:52:11

Thank you for your answer. This is of course a valid point. We will probably prototype in F# anyhow since it is such a nice language. Let's see how much of an issue this really is. Although I have a gut feeling that the optimizations algorithms needed (for support vector machine-training etc) will be a great bottleneck for large datasets.

Rickard 2009-10-19 08:04:09

@Rickard: There are many good open source SVM libraries in C++ (and other languages). You could probably interface with these from F#.

csl 2009-10-22 08:42:51

Thank you. We are aiming for regular user PCs rather than dedicated workstations (we are basically aiming for an easy-to-use application similar to excel but with strong linear algebra-support and easy-to-use data mining-models integrated). We'll try F# to begin with.

Rickard 2009-10-19 09:21:12

ansaurus

tags:

views:

answers:

Datamining models in FORTRAN or C (or managed code)?

related questions