views:

205

answers:

7

I am looking for a hardware, which must run about 256 computationally intensive real-time concurrent tasks in 24 hour mode (one multi-threaded C application). Each task takes about 40-50 MFLOPs, so all tasks require about 10 GFLOPs. CPU-RAM speed is insignificant. All tasks must be managed by a Linux Kernel (32 bit, with SMP).

I am looking for a one-mainboard solution with one multi-core CPU (if such CPU exist). If such CPU doesn't exist, then I need one mulit-socket mainboard solution (with multiple CPUs).

Can you please recommend me any professional CPU/Mainboard solution which will satisfy such requirements? It is also very important that there are no issues with Linux Kernel (2.6.25). No virtualization, no needs in huge RAM or CPU cache. I also would prefer Intel architecture and well-proved stability. I still have doubts that it is feasible at all.

Thank you in advance.

UPDATE: I think I have found a right answer here and here.

+1  A: 
  1. Rent some Amazon EC2 nodes.

  2. Updated: How about PS3's then? The NASA uses them for their simulation engines.

  3. Maybe use CPU+GPU's in commercial servers?

  4. Build it around FPGAs: nowadays, some variants include processors that can run Linux.

jldupont
impossible. this is very specific real-time solution
psihodelia
i would use FPGA solution, but it requires longer time/money to develope specific cores than I can afford
psihodelia
A: 

Get a bunch of four- or eight-core machines and split the processing across the machines using some sort of grid or clustering software. Maybe have a look at Beowulf.

As you mentioned, 10GFlops isn't exactly to be sneezed at so in a single machine, it'll be expensive. There's also the problem what you do when the machine breaks, you're unlikely to have a second machine of similar spec available. If you build a cluster using commodity hardware, you're a little more resilient and it's easier to find replacement machines.

Timo Geusch
I am not sure that it is feasible to achieve 10GFLOPs. Do you have some particular models in mind?
psihodelia
No particular models, but if you're building a grid already you can add more computers if you don't get the necessary floating point throughput. One of the reasons why I'd be tempted to stay away from a single machine to achieve that sort of performance is that if it goes pop, you're not in a position to quickly replace it with a similar spec machine. If you're using a compute grid or cluster, you'll end up with dimished throughput but you'll still be able to process data.
Timo Geusch
@Time: almost any real-time solution to operate on I/O cannot be run on any network device because of network delays (millisecs range!)
psihodelia
A: 

Not Intel architecture but these run linux and have 64 cores on a single die.

TILEPro64

Aaron
Can these do Floating Point? I see no mention of it.
jsbueno
@jsbueno - good point - I'm not sure.
Aaron
there is no info on FLOPS, MIPS -> so it is not serious to buy smth. if you cannot estimate its performance
psihodelia
+1  A: 

Even though you've given us the specs you think you need, we might be able to help you out better if you tell us what the application is intended to accomplish, and how it was implemented.

There may be a better way to split the work up or deal with it rather than your current solution.

Adam Davis
It is for signal processing needs. This application deals with convolution in time domain and different digital filters.
psihodelia
Then you are much better considering a DSP based architecture, where you should be able to reduce the number or speed of CPUs because they have built-in instructions that handle these computations much more quickly and efficiently than general purpose processors. There are a number of processors that have a general purpose CPU (such as ARM) combined with a DSP for the heavy duty processing. You may find that you need very few DSPs to meet your computational model.
Adam Davis
+3  A: 

UltraSPARC T2 has 8 cores with 8 threads each. Integrated high-bandwidth memory and IO. The T5140 carries two of them for 128 hardware threads.

The theoretical max raw performance of the 8 floating point units is 11 Giga flops per second (GFlops/s). A huge advantage over other implementations however is that 64 threads can share the units and thus we can achieve an extremely high percentage of theoretical peak. Our experiments have achieved nearly 90% of the 11 Gflop/s. - (http://blogs.sun.com/deniss/entry/floating%5Fpoint%5Fperformance%5Fon%5Fthe)

Joe Koberg
Thank you. This is already interesting because they have info on GFLOPs! But I fear, it is very hard to find where to buy.
psihodelia
"T5140" in the first paragraph above links to Sun's product page. $23k.
Joe Koberg
http://blogs.sun.com/deniss/entry/overview_of_t2_systems
Joe Koberg
actually there is no GFlops/s, there is GFLOPS = floating point operations per second divided by 10^9
psihodelia
A: 

I see you'd prefer intel, but if you need one chip, I will again suggest the cell processor - its theoretical peak performance is arount 25GFlops - kernel 2.6.25 had support for it already.

You could try a pre-slim playstation 3 for experimenting with (that would cost you little) or get yourself a server-based solution at around US$8K - you will have to re-write and fine tune your threads to take advabtage of the SPU co-processors there, but you could achieve your computational needs without breaking a sweat with a single CELL (1 PPC core + 8 SPU's)

NB.: with a playstation 3, you'd have only 6 available co-processors - but you don't seen to be on a budget with this project - So you could at least try IBM's cell developer kit, which offers an emulator, to see if you can code your solution to run on it.

Thre are commercially available CELL products, both as stand-alone servers in blade form factory, and PCI Express add-on boards for PC workstations from Mercury Computer Systems: http://www.mc.com/microsites/cell/products.aspx?id=6986

Mercury does not list any prices on the site, but the pricing seens to be around the previoulsy mentioned U$8000.00 for these PCI Express cards.

A playstation 3 videogame can be purchased for about U$300.00 - and would allow you to prototype your application, and check if it is up to the needed performance. (I myself got one and have Fedora 9 running on it, although I did that as a hobbyst and have not, so far, used it for any calculations - I had also put together a Playstation-3 12 machinne cluster for Molecular simulations at the local University. The application they run did not take advantage of the multimedia SPU's, while I was in touch with then. But even so, clocked at 3.5GHz they performed better than standard ,s imlarly priced, PC's, even considering PS3's are priced 5x higher around here)

jsbueno
but where can I buy this processor, proper mainboard, etc. ? I need many of them in perspective.
psihodelia
A: 

MFLOPS and GFLOPS are very poor indicators of how well a program can run on any given CPU. These days, cache footprint is much more important; perhaps branch prediction accuracy as well.

There's almost no way to gauge performance of a given application on different architectures without actually giving it a spin. And even then, you may not get a good idea if you were unlucky enough to unknowingly build with compiler options that ruined your cache footprint, or used a bad threading library, or any of a hundred other things.

Eric Seppanen