views:

706

answers:

+5 Q:

What language/platform would you recommend for CPU-bound application?

I'm developing non-interactive cpu-bound application which does only computations, almost no IO. Currently it works too long and while I'm working on improving the algorithm, I also think if it can give any benefit to change language or platform. Currently it is C++ (no OOP so it is almost C) on windows compiled with Intel C++ compiler. Can switching to ASM help and how much? Can switching to Linux and GCC help?

+6 A:

Switching to ASM is not going to help much, unless you're very good at it and/or have a specific critical path routine which you know you can do better. As several people have remarked, modern compilers are just better in most cases at taking advantages of caching/etc. than anyone can do by hand.

I'd suggest:

Try a different compiler, and/or different optimization options
Run a code coverage/analysis utility, and figure out where the critical paths are, and work on optimizing those in the code

C++ should be able to give you very near the best possible performance from the code, so I wouldn't recommend switching the language. Depending on the app, you may be able to get better performance on multi code/processor systems using multiple thread, as another suggestion.

Nick 2008-10-11 21:32:12

+4 A:

While just switching to asm won't give any benefits, since the Intel C++ Compiler is likely better at optimizing than you, you can try one of the following options:

Try a compiler that will parallelize your code, like the VectorC compiler.
Try to switch to asm with heavy use of MMX, 3DNow!, SSE or whatever fits your needs (and your CPU). This will give more of a benefit than pure asm.
You can also try GPGPU, i.e. execute large parts of your algorithm on a GPU instead of a CPU. Depending on your algorithm, it can be dramatically faster.

Edit: I also second the profile approach. I recommend AQTime, which supports the Intel C++ compiler.

OregonGhost 2008-10-11 21:34:38

+3 A:

Personally I'd look at languages which allow you to take advantage of parallelism most easily, unless it's a thoroughly non-parallelisable situation. Being able to bolt on some extra cores and get (if possible!) near-linear improvement may well be a lot more cost-effective than squeezing the extra few percent of efficiency out.

When it comes to parallelisation, I believe functional languages are often regarded as the best way to go, or you could look at OpenMP for C/C++. (Personally, as a managed language guy, I'd be looking at libraries for Java/.NET, but I quite understand that not everyone has the same preferences!)

Jon Skeet 2008-10-11 21:35:05

Yes, and specifically, I'd look at Clojure. It's a JVM language built by somebody who seems obsessive about parallelism, so it takes a lot from functional languages (like immutable data structures).

Ken 2009-01-30 18:20:58

+13 A:

It's always algorithm, rarely language. Here's my clue: "while I'm working on improving the algorithm".

Tweaking may not be enough.

Consider radical changes to the algorithm. You've got to eliminate processing, not make the processing go faster. The culprit is often "search" -- looping through data looking for something. Find ways to eliminate search. If you can't eliminate it, replace linear search with some kind of tree search or a hash map of some kind.

S.Lott 2008-10-11 21:37:05

I disagree. It's algorithm and hardware. Even after you ground down the algorithm to to the bare minimum, you may be paying huge CPU overheads if you don't play with the way things are organized (e.g., caching and paging)

Uri 2008-10-12 04:29:02

Yes. It's always algorithm. And some languages can express things others cannot. Some can express things easier than other languages. Some tasks are easier in C++, some are harder.

Aaron 2008-10-12 04:37:34

@Aaron: on a technical level, almost all languages are the same -- they're all Turing Complete. However, some things are more convenient in one language. Algorithms are always independent of language.

S.Lott 2008-10-12 11:29:03

@Uri: Most algorithm issues trade off memory vs. time. If you want less time, you often have to use more memory. Cache is never an issue. Paging problems, however, mean an unwise balance between time and memory. Not a hardware problem. An algorithm problem.

S.Lott 2008-10-12 12:07:44

Completely wrong. It's indeed algorithm plus hardware. Processors have caching, pipeline and branching issues which mean the same algorithm may perform slower than another one on a chip, and yet you may get opposite results on another microprocessor.

Joe Pineda 2008-11-05 00:21:32

Samples: Jeff Atwood once noted gzip runs way slower on Athlons than on Core Duos, even though "GZip is simply Deflate plus a checksum and header/footer". On the Mersenne Prime Search mail-list, it's been noted Prime95's Lucas-Lehmer runs better on Pentium than on AMD due to branch penalty.

Joe Pineda 2008-11-05 00:25:46

@Joe Pineda: Sorry, a bad algorithm will never run well on amy processor. The right algorithm may run better on Intel than Sparc, but the wrong algorithm will be horrible everywhere.

S.Lott 2008-11-05 00:53:48

What I mean is you can have 2 functionaly equivalent *good* algorithms yet one may run faster on a processor than the other, the inverse on other proc. If you're CPU-bound you need to **profile** - algorithm analysis alone is no replacement for this, nor sufficient, never.

Joe Pineda 2008-11-05 01:13:31

An anecdote: I once found a *terrible* query in a stored procedure. Refactored it until the execution plan was considerably shorter. Tested and found it, in production, *slower* than previous version. Why? Turned out SQL S. was running the older in parallel, couldn't do so with the new.

Joe Pineda 2008-11-05 01:17:35

@Joe Pineda: Profiling is good. Tweaking is a waste of time. Discarding a bad algorithm and replacing it with a good algorithm is essential. Hardware isn't irrelevant. It's not as important as fundamental algorithm-dsta structure design.

S.Lott 2008-11-05 01:23:14

For an alternative approach, you could look into Distributed Computing which sounds like it could suit your needs.

QAZ 2008-10-11 22:03:30

+14 A:

Just to be thorough: the first thing to do is to gather profile data and the second thing to do is consider your algorithms. I'm sure you know that, but they've got to be #included into any performance-programming discussion.

To be direct about your question "Can switching to ASM help?" the answer is "If you don't know the answer to that, then probably not." Unless you're very familiar with the CPU architecture and its ins and outs, it's unlikely that you'll do a significantly better job than a good optimizing C/C++ compiler on your code.

The next point to make is that significant speed-ups in your code (aside from algorithmic improvements) will almost certainly come from parallelism, not linear increases. Desktop machines can now throw 4 or 8 cores at a task, which has much more performance potential than a slightly better code generator. Since you're comfortable with C/C++, OpenMP is pretty much a no-brainer; it's very easy to use to parallelize your loops (obviously, you have to watch loop-carried dependencies, but it's definitely "the simplest parallelism that could possibly work").

Having said all that, code generation quality does vary between C/C++ compilers. The Intel C++ compiler is well-regarded for its optimization quality and has full support not just for OpenMP but for other technologies such as the Threading Building Blocks.

Moving into the question of what programming languages might be even better than C++, the answer would be "programming languages that actively promote / facilitate concepts of parallelism and concurrent programming." Erlang is the belle of the ball in that regard, and is a "hot" language right now and most people interested in performance programming are paying at least some attention to it, so if you want to improve your skills in that area, you might want to check it out.

Larry OBrien 2008-10-11 22:23:19

+3 A:

Try Fortran 77 - when it comes to computations still nothing beats the granddaddy of programming languages. Also, try it with OpenMP to take advantage of multiple cores.

Nemanja Trifunovic 2008-10-11 23:04:34

It's hard to believe but it's true. (I'm a hardcore C++ programmer by the way.) The main reason is the tighter aliasing rules in Fortran -- in C/C++, the compiler has no guarantee that a pointer is the only pointer pointing at a given piece of data, which makes certain optimisations impossible.

j_random_hacker 2009-01-29 10:29:06

Sometimes you can find libraries that have optimized implementations of the algorithms you care about. Often times they will have done the multithreading for you.

For example switching from LINPACK to LAPACK got us a 10x speed increase in LU factorization/solve with a good BLAS library.

Greg Rogers 2008-10-12 01:20:47

+1 A:

First, figure out if you can change the algorithm, as S.Lott suggested.

Assuming the algorithm choice is correct, you might look a the memory access patterns, if you have a lot of data you are processing. For a lot of number crunching applications these days, they're bound by the memory bus, not by the ALU(s). I recently optimized some code that was of the form:

// Assume N is a big number
for (int i=0; i<N; i++) {
  myArray[i] = dosomething(i);
}
for (int i=0; i<N; i++) {
  myArray[i] = somethingElse(myArray[i]);
}
...

and converted it to look like:

for (int i=0; i<N; i++) {
  double tmp = dosomething(i);
  tmp = somethingElse(tmp);
  ...
  myArray[i] = tmp;
}
...

In this particular case, this yielded about a 2x speedup.

Mr Fooz 2008-10-12 01:42:43

+2 A:

As lobrien said, you haven't given us any information to tell you if hand-optimized ASM code would help... which means the answer is probably, "not yet."

Have you run your code with a profiler?

Do you know if the code is slow because of memory constraints or processor constraints?

Are you using all your available cores?

Have you identified any algorithms you're using that aren't O(1)? Can you get them to O(1)? If not, why not?

If you've done all that, how much control do you have over the environment your program is running in? (presumably a lot if you're thinking of switching operating systems) Can you disable other processes, give your process highest priority, etc? What about just finding a machine with a faster processor, more cores, or more memory (depending on what you're constrained on)

And on and on.

If you've already done all that and more, it's certainly possible you'll get to a point where you think, "I wonder if these few lines of code right here could be optimized better than the assembly that I'm looking at in the debugger right now?" And at that point you can ask specifically.

Good luck! You're solving a problem that's fun to solve.

Moishe 2008-10-12 02:09:24

If you're sticking with C++ on the intel compiler, take a look at the compiler intrinsics (full reference here). I know that VC++ has similar functionality, and I'm sure you can do the same thing with gcc. These can let you take full advantage of the parallelism built into your CPU. You can use the MMX, SSE and SSE2 instructions to improve performance to a degree. Like others have said, you're probably best looking at the algorithm first.

Eclipse 2008-10-12 04:14:30

I suggest you rethink your algorithm, or maybe even better, your approach. On the other hand maybe what you are trying to calculate just takes a lot of computing time. Have you considered to make it distributed so it can run in a cluster of some sort? If you want to focus on pure code optimization by introducing Assembler for your inner loops then often that can be very beneficial (if you know what you're doing).

Luke 2008-10-12 04:27:16

+3 A:

Hand optimizing your ASM code compared to what C++ can do for you is rarely cost effective.

If you've done anything you can to the algorithm from a traditional algorithmic view, and you've also eliminated excesses, then you may either be SOL, or you can consider optimizing your program from a hardware point of view.

For example, any time you follow a pointer around the heap you are paying a huge cost due to cache misses, possibly paging, etc., which all affect branching predictions. Most programmers (even C gurus) tend to look at the CPU from the functional standpoint rather than what happens behind the scenes. Sometimes reorganizing memory, for example by "flattening" or manually allocating memory to fit on the same page can obtain ENORMOUS speedups. I managed to get 2X speedups on graph traversals just by flattening my structures.

These are not things that your compiler will do for you since they are based on your high-level understanding of the program.

Uri 2008-10-12 04:28:03

For modern processors, learning ASM will take you a long time. Further, with all the different versions of SSE around, your code will end up very processor dependant.

I do quite a lot of CPU-bound work, and have found that the difference between intel's C++ compiler and g++ usually isn't that big (at most 15% or so), and there is no measurable difference between Mac OS X, Windows and Linux.

You are going to have to optimise your code and improve your algorithm by hand. There is no "magic fairy dust" which can make existing code that much faster I'm afraid.

If you haven't yet, and you care about performance, you MUST run your code through a good profiler (personally, I like kcachegrind & valgrind on Linux, or Shark on Mac OS X. I don't know what is good for windows I'm afraid).

Based on my past experience, there is a very good chance you'll find some method is taking 95% of your CPU time, and some simple change or addition of caching will make a massive improvement to your performance. On a similar note, if some method is only taking 1% of your CPU time, no amount of optimising is going to gain you anything.

Chris Jefferson 2008-10-12 15:48:52

The 2 obvious answers to "CPU-bound" are: 1. Use more CPU (core)s 2. Use something else.

Using 2 threads instead of 1 will cut the time spent by up to 50%. In comparision, C++ to ASM rarely gives you 5% (and for novice ASM programmers, it's often -5%!). Some problems scale well, and may benefit from 8 or 16 cores. That kind of hardware is still pretty mainstream, so see if your problems fall in that category.

The other solution is to throw more specialized hardware at the task. This could be the vector unit of your CPU - considering Windows=x86/x64, that's going to be a flavor of SSE. Another kind of vector hardware is the modern GPU. The GPU also has its own memory bus, which is quite speedy.

MSalters 2008-10-13 10:36:32

First get the lead out. Then if it's as fast as it can possibly be without going to ASM, so be it. But thinking you have to go to ASM assumes you know what's making it slow, and I'll bet a donut that you're guessing.

Mike Dunlavey 2008-11-04 23:22:39

If you feel you have optimized your code to a point there is no improvement, increase your CPU's. This can be done on different platforms. One I develop with is Appistry. A few links:

http://www.appistry.com/resource-library/index.html

and you can download the product free from here:

http://www.appistry.com/developers/

I work for Appistry and we have done many installations for tasks that were cpu bound by spreading work out over 10's or 100's of machines.

Hope this helps, -Brett

Brett McCann 2008-12-10 21:00:13

+1 A:

As Oregonghost already hinted - The VectorC compiler might help. It does not really parallelize the code though, instead you can use it to leverage on extended command sets like mmx or sse. I used it for the most time-critical parts in a software rendering engine and it resulted in a speedup of about 150%-200% on most processors.

Adrian Grigore 2009-01-29 10:06:15

Probable small help:

Optimization of 64-bit programs

AMD64 (EM64T) architecture

Debugging and optimization of multi-thread OpenMP-programs

Introduction into the problems of developing parallel programs

Development of Resource-intensive Applications in Visual C++

2009-01-30 18:11:20