views:

477

answers:

10

So I did not look at the right location before posting this..

I was looking at the result of the computer language benchmark game:

<http://shootout.alioth.debian.org/u32q/index.php&gt;

And it seems that most of the fastest solutions are still C/C++ using just 
a single core of the 4 core machine that runs the tests.

I was wondering if multi-core is worth it at all for single tasks or if you 
really need some more speed just tune up your code, rewrite in C/C++ instead.

When you click on the full benchmark link like: http://shootout.alioth.debian.org/u32q/benchmark.php?test=knucleotide&amp;lang=all it is obvious that quite a few solutions use multiple core.

It would still be interesting to hear of your personal experiences:

Have you had success using 4 or 8 cores in order to actually improve performance on a single task?

What tools/language did you use?

How big was the improvement?

Was it worth the effort?

+2  A: 

In order to increase performance for a single task on a multicore system, you'd have to design your task to split up into different parts (ala mapreduce) and hand each part to a different core. Lots of programs do things like that, and it does increase performance.

A few compression algorithms currently support more than one processor, such as 7zip. It's not terribly difficult to do something like that, but if your task can't break into cooperative parts, you're not going to get any help from more than one core.

Alex Fort
Notice that the Google MapReduce infrastructure (or even the more general map/reduce schema) is just one of many possibilities.
Konrad Rudolph
7zip is limited to how much it can do with multiple cores (there was a discussion about this on Jeff's blog: http://www.codinghorror.com/blog/archives/001231.html ) Apparently, 7zip doesn't split a file into blocks...
R. Bemrose
+3  A: 

And it seems that the fastest solutions are still C/C++ using just a single core of the 4 core machine that runs the tests.

No, that's not true for all codes. In fact, of the codes I've looked at, all use multiple parallel threads, and thus multiple cores. In fact, some (e.g. k-nucleotide) use fancy architecture like OpenMP (or, also interesting, SSE parallelization) to help parallelization.

EDIT In fact, the fastest C++ solution for every problem uses parallel threads, with three exceptions:

  1. fasta benchmark, hard (but altogether possible) to parallelize due to random generator usage.
  2. pidigits, uses the GMP library.
  3. n-body, could be parallelized.

… and most other solutions also use SSE2 support.

Konrad Rudolph
Uh, why was this downvoted?
Konrad Rudolph
Neat, I missed that k-nucleotide solution.
James Dean
See my update. In fact, almost all solutions use parallel threads (either as pthreads or via OpenMP).
Konrad Rudolph
three exceptions: 3. n-body
igouy
igouy: n-body at least uses SSE parallelization. But true, this could also be parallelized across all four cores rather easily.
Konrad Rudolph
the OP asked about multi-core not SSE. "parallelized across all four cores rather easily" - it's always easier to say rather than do :-)meteor-contest also (but that's a contest and not usually included)
igouy
igouy: yes, I wasn't questioning the truth of your correction. I should have amended my answer.
Konrad Rudolph
+1  A: 

It really depends on how the algorithm works and the size of the dataset you're processing as to whether it scales well across multiple cores. Staying on the same core gives you an awful lot of advantages, including taking advantage of processor pipelining and using registers and cache - all of which are super-quick.

As multiple core become more important in the future, we'll probably see some interesting cross-core optimizations becoming available.

James L
A: 

How do you define a "single task"? Many single conceptual tasks can nevertheless be split up into many independent subtasks. That's where multiple cores may provide a performance boost.

Of course, this requires you to actually structure your program so that these subtasks are actually able to be processed independently.

jalf
A: 

I use multicore quite regularly when performing monte carlo simulations. In this case it can be an absolute godsend, because sometimes these simulations take forever and each run is independent of every other run. In fact, right now I'm waiting for a monte carlo simulation to run on my quad core.

Another use case is when testing a machine learning algorithm using cross-validation. The dataset can be loaded once and stored in an immutable object. Then, each cross-validation iteration can be performed independently. For things like this, the key is to be careful about memory allocations and avoid the implicit lock acquisition this involves. If you allocate and free/garbage collect infrequently enough, the speedup can be near linear in cores used.

dsimcha
A: 

I've seen 4x improvement easily gained on a processing pipeline.

Joshua
A: 

There are certain types of tasks which can be trivially multithreaded, and thus, allow a performance increase in systems which have multiple cores.

Image processing is one arena which could benefit from multicore. For example, applying an image filter is a process that is independent of results from other parts of an image. Therefore, as mentioned previously in Alex Fort's answer, by splitting the problem, in this case, the image filtering, into multiple parts and running the processing in multiple threads, I was able to see a decrease in processing time.

In fact, multithreading increased the performance not only on multicore processors, but also on my Intel Atom N270-based system, which only has a single core but offers two logical cores through simultaneous multithreading (hyper-threading).

I performed a few tests of applying an image filter using multiple threads (by splitting the processing into four threads) and a single thread.

For multithreading, the ExecutorService from the java.concurrent package was used to coordinate the multithreaded processing. Implementating this functionality was fairly trivial.

Although not exact numbers, nor an perfect benchmark, on a dual-core Core 2 Duo, the processing time for multithreaded code decreased by 30-50% compared to single-threaded code, and on a hyper-threaded Atom, a decrease of 20-30%.

As with other problems which involve splitting the problem into parts, the scalability of this method of processing is going to depend on the time spent by the steps where the problem is split up and combined.

coobird
A: 

I wrote a CD label editing program in REALbasic which was cross-platform (hence not being able to just rely on GDI+ or Cocoa). It allows layering of multiple masked images with clipping to label shapes.

I switched from the image blit and zooming routines built into the language to using a plugin which was able to use up to 4 cores and achieved a significant speedup of key user operations, especially when zoomed.

This was a nice domain separation for a drop-in solution - I passed in a single image to the binary plugin and it internally partitioned the work across the processors. As a library solution, it required no multi-core awareness on the part of my program.

Andy Dent
A: 

make -j 6

Took 6 minutes out of a 7 minute build. :)

dicroce
+1  A: 

I would like to remind everyone of Amdahl's Law, which is a description of the diminishing returns gained by increased parallelism, and serves also to model how much speedup can be expected for a given algorithm.

Paul Nathan