views:

659

answers:

13

The way to see how fast your code is going, is performance profiling. There are tools for it and such, but I wonder what the factors are for code speed.

For instance, I've been told that image-editting software will use bitwise operations instead of integer variables to calculate their stuff, simply because it's faster.

So that must mean working with integers and other primitive types takes a few more steps to calculate compared to binairy.

There must be other stuff, but I don't have enough experience with how an OS connects to your hardware and the internal workings of many coding languages to know what.

So I'm asking here: do you know what influences the speed of code?

Not necessarily the speed of programs.

+9  A: 

I think you're mistaken. Integers are binary numbers. Image editing software will do anything it can to avoid floating point calculations, because they are incredibly slow compared to integer or bit-shift operations.

But normally, first and foremost you optimize by choosing the right algorithm, not by fiddly little things like worrying about whether to post-increment or pre-increment.

For instance: I just spent the last two days speeding up the way we recalculated a particular set of values. I yanked some stuff out of a loop and pre-calculated it, so it was only done M times instead of M x N times, and stored a value in a variable instead of looking it up each time from somewhere else because that value was used in a Comparator so it would be called a lot during the Collections.sort phase. I got the total execution time from around 45 seconds to 20 seconds. And then one of my coworkers who'd been here a lot longer pointed out that I didn't need to recalculate those values, I could pull them out of a different object. And suddenly it executes in 2 seconds. Now that is optimization I can believe in.

Paul Tomblin
Floating point ops can be just as fast as integer, depending on context. With SIMD instructions, they may be faster as well. While you're obviously right about preferring macro-optimizations, that doesn't really answer his question which is obviously curiosity about the low-level side of things.
jalf
It's great to hear experience talking. The only nit I would pick is, both you and your coworker found problems by thinking about them. As you know, I suggest letting the problem itself tell you what it is, by interrupting it.
Mike Dunlavey
Mike, the problem is that I knew what was taking the time. I just didn't know enough about the problem domain to know that the data I needed was already pre-calculated and available in another object.
Paul Tomblin
Mike Dunlavey
+1  A: 

The speed of code is influenced, basically by the amount of operations it has to perform.

The more operations it has to do, the slower the code will be. The operations performed by certain code are directly related with the algorithm it uses.

So at the end is the algorithm.

Additionally ( only to remark the difference between code speed and program speed )

The speed of todays computers have been increased so much, that the speed of an application is not necessary related with the speed of the code it uses, but with the amount of data it was to transport from a and to other devices.

For instance ( and I know you are marking a clear distinction here ) the speed of a web app, is much more affected by the amount of data sent through the network ( from DB to app server and from app server to client ) than the time it spend processing that data inside one of the nodes.

OscarRyz
You're right, Oscar. I only have a problem with the word "algorithm", which to most of us means things like hash-coding vs. bubble-sort. In my experience, that's not usually the problem. The problem is galloping generality and over-design in data structure.
Mike Dunlavey
A: 

I think you're thinking of bitwise operations, not "binary numbers".

overslacked
+20  A: 

Integers are binary. Or rather, integers are simply integers, and can be represented in any number base. In base 10 you'd write 13, in base 16 you'd write d (or 0xd), in binary you'd write 1101 (or 0b1101). The Romans would've written XIII. They all represent the same concept of the number 'thirteen'. Among humans, we tend to use the base 10 representation, but when you ask a computer to process integers, it uses the binary representation. And it doesn't make a difference. Thirteen plus fortyfive yields the same result no matter how I write it. XIII + 0x2D = 13 + 45 = 0xd + 0b101101. It doesn't matter which representation you use, the result of an arithmetic operation is the same. Which is why we allow the CPU to use the binary representation for all integer processing.

Some programming languages also give you a "decimal" datatype, but that's generally related to floating-point arithmetics, where not all values can be represented in all bases (1/3 can be represented easily in base 3, but not in 2 or 10, for example. 1/10 can be represented in base 10, but not 2)

However, it is surprisingly difficult to single out any particular operations as "slow", because it depends. A modern CPU employs a lot of tricks and optimizations to speed up most operations most of the time. So really, what you need to do to get efficient code is avoid all the special cases. And there are a lot of them, and they're generally more to do with the combination (and order) of instructions, than which instructions are used.

Just to give you an idea of the kind of subtleties we're talking about, floating point arithmetic can be executed as fast (or sometimes faster than) integer arithmetics under ideal circumstances, but the latency is longer, meaning that ideal performance is harder to achieve. Branches which are otherwise almost free become painful because they inhibit instruction reordering and scheduling, in the compiler and on the fly on the CPU, which makes it harder to hide this latency. Latency defines how long it takes from an instruction is initiated until the result is ready; most instructions only occupy the CPU one clock cycle Even if the result isn't yet ready the following cycle, the CPU is able to start another instruction then. Which means that if the result is not immediately needed, high-latency instructions are almost free. But if you need to feed the result to the next instruction, then that'll have to wait until the result is finished.

Certain instructions are just slow no matter what you do, and will typically stall the relevant parts of the CPU until the instruction has completed (square root is a common example, but integer division may be another. On some CPU's, doubles in general suffer from the same problem) - on the other hand, while a floating-point square root will block the FP pipeline, it won't prevent you from executing integer instructions simultaneously.

Sometimes, storing values in variables which could be recomputed again as needed, will be faster, because they can be put in a register saving a few cycles. Other times, it'll be slower because you run out of registers, and the value would have to be pushed to the cache, or even to RAM, making recomputation on every use preferable. The order in which you traverse memory makes a huge difference. Random (scattered) accesses can take hundreds of cycles to complete, but sequential ones are almost instant. Performing your reads/writes in the right pattern allows the CPU to keep the required data in cache almost all the time, and usually "the right pattern" means reading data sequentially, and working on chunks of ~64kb at a time. But sometimes not. On an x86 CPU, some instructions take up 1 byte, others take 17. If your code contains a lot of the former, instruction fetch and decoding won't be a bottleneck, but if it's full of the longer instructions, that'll probably limit how many instructions can be loaded by the CPU each cycle, and then the amount it'd be able to execute doesn't matter.

There are very few universal rules for performance in a modern CPU.

jalf
+2  A: 

Supposing that code has "speed" is the wrong way to think about the problem, IMO.

Executing any given opcode takes a constant amount of time. The more opcodes a function has, the more time it takes to execute. Knowing how to minimize the number of opcodes executed (and thus, to create the fastest code) is why we learn ASM, study algorithms, etc.

Duke Navarre
Although this is generally correct, it is worth pointing out that at the lowest level, some opcodes take much longer than others. Each type of CPU has a different set of 'expensive' ones.
AShelly
Some take far longer than others. And even one specific opcode can take variable time depending on the arguments, as well as the instructions that came before. And several instructions can be executed in parallel, and the CPU happily reorders them on the fly, so assigning a fixed cost is impossible
jalf
Mike Dunlavey
+4  A: 

The "speed of programs" usually comes down to algorithm choice. The wrong algorithm can turn a 2 second task into a 2 minute one or worse. You will see the best performance gains when you concentrate on getting this right.

Once you have a good algorithm, you may still need a more efficient implementation. Attaining this this often relies on "speed of code" type choices. There are several things to consider, which tend to be very hardware dependant. Optimizations for one processor may actually slow down the code on others.

Some "code speed" factors:

  • integer vs floating point numbers
  • costs of branching, function calls, context switches
  • CPU instruction pipeline disruptions
  • maintaing locality of memory access, cache coherence.
AShelly
A: 

Conventional wisdom says that better algorithms are going to give you faster code, but in reality it's not so simple. Jalf is right, it just depends. For compiled languages, the particular compiler can make a bigger difference than algorithm changes. Also, a faster processor should make things run faster. Usually. Just read Code Complete. It will answer all your questions about optimizing code for speed.

Chris
+3  A: 

The biggest single thing I've run into, on the rare occasions where I've actually cared, is locality. Modern CPUs run very fast, but they have a limited amount of cache memory, which is easily available. When they can't find what they need on cache, they have to read from memory, which is comparatively slow. (When what they're looking for isn't in physical memory, it's really slow.)

Therefore, when it matters, try to keep your code and data compact. Roll up loops as much as possible.

In a multithreaded environment, there's different constraints, and you'll be better off if all your threads are working on data from different cache "lines".

David Thornley
A: 

What influences the speed of code?

Everything.

Usually the biggest factor is something you never even thought of.

Michael Borgwardt
+4  A: 

To expand a little on what Oscar Reyes said, data transport is actually a crucial factor in performance even at lowest levels, so that not just the number but the types of operations being performed by the CPU are crucial to overall performance.

This is particularly true of CISC processors like x86, where instructions themselves may have varying cycle counts, although modern designs mostly mitigate this. However, it is true of all CPUs in the case of memory load and store operations which can be many, many, many times more expensive than any other operation. With the clock rates of modern CPUs vs. memory latency, you can be looking at perhaps dozens of instructions wasted in the case of a cache miss to millions if you get a page fault and have to go out to disk. In many cases, when cache behavior and load and store behavior are taken into account it may actually be much, much faster to perform a significant number of arithmetic operations to recompute a value instead of caching it and re-reading it from memory, even though the load is one assembly instruction vs. many instructions for performing the computation. In some cases it may actually make sense to consider your performance to be bounded only by load and stores and treat other operations as "free," although this naturally depends on the circumstances.

Whatever
+4  A: 

The code speed is mostly influenced by low level optimizations of the computer architecture, both in terms of CPU as well as other optimizations.

There are a lot of factors in code speed and they're usually low level questions which are automatically handled by the compiler, but that can make your code faster if you know what you're doing.

First of all, obviously Word Size. 64 bit machines have a bigger word size (an yeah, bigger usually means better here) so that most operations can be carried out faster, as for example double precision operations (where double usually means 2 * 32 bits). A 64 bits architecture also benefits from a bigger data bus which provide faster data transfer rates.

Second, pipeline is also important. Basic instructions can be classified in different states, or phases so that, for example, instructions are usually divided in:

  • Fetch: The instruction is read from the instruction cache
  • Decode: The instruction is decoded an interpreted to see what we have to do.
  • Execute: The instruction is executed (usually that means carrying operations in the ALU)
  • Memory Access: If the instruction has to access memory (for example load a registry value from the data cache) its performed here.
  • Writeback: The vaues are written back to the destination register.

Now, the pipeline allows the processor to divide instructions on those phases and perform them simultaneously, so that while it's executing one instruction, it is also decoding the next one an fetching the one after that.

Some instructions have dependencies. If I'm adding to registers together, executing phase of the add instruction will need the values before they're actually recovered from memory. By knowing the pipeline structure, the compiler can reorder the assembly instructions in order to provide enough "distance" between the loads and the add so that the CPU need not to wait.

Another CPU optimization will be superscalar, which makes use of redundant ALUs (for example) so that two add instructions can be performed simultaneously. Again, by knowing exactly the architecture you can optimize ordering of instructions to take advantage. For example, if the compiler detects no dependencies exists in the code, it can rearrange loads and arithmetic so that the arithmetic is delayed to a later place where all data is available and then perform 4 operations at the same time.

This is mostly used by compilers though.

What can be of use when designing your application and which can really improve code speed is knowing the cache policies and organization. The most typical example is put for an incorrectly ordered access to an double array in a loop:

// Make an array, in memory this is represented as a 1.000.000 contiguous bytes
byte[][] array1 = new byte[1000, 1000];
byte[][] array2 = new byte[1000, 1000;
// Add the array items

for (int j = 0; j < 1000; i++)
  for (int i = 0; i < 1000; j++)
     array1[i,j] = array1[i,j] + array2[i,j]

Let's see what's happening here.

array1[0,0] is brought to cache. Since cache works in blocks you get the first 1000 bytes into cache, so that cache holds array1[0,0] to array1[0,999].

array2[0,0] is borught to cache. Again blocks so that you have array2[0,0] to array2[0,999].

On the next step we access array1[1,0] which is not in the cache, and neither is array2[1,0] so we bring them from memory to cache. Now, if we suppose we have a very little cache size, this will make array2[0...999] to be taken out of the cache... and so on. So when we access array2[0,1] it will no longer be in the cache. Cache won't be useful for array2 or array1.

If we reorder the memory accesses:

for (int i = 0; i < 1000; i++)
  for (int j = 0; j < 1000; j++)
     array1[i,j] = array1[i,j] + array2[j,i]

The no memory has to be brought out of the cache and the program will run considerably faster.

This are all naive, academic examples, if you really want or need to learn computer architecture you need a very deep knowledge of the specifics of the architecture, but again that will only be useful when programming compilers. Nonetheless, a basic knowledge of cache and basic low level CPU can help you improve your speed.

For example, such knowledge can be of extreme value in cryptographic programming where you have to handle very big numbers (as in 1024 bits) so that the correct representation can improve the underneath math that needs to be carried out...

Jorge Córdoba
A: 

On 2009 hardware, as in real estate, there are just three things to keep in mind when thinking about the speed of code: memory, memory, memory. (I include cache and locality.)

Reduce allocations. Reduce writes. Reduce reads.

After that you're usually in the noise, and it's harder to state categorical results. (For example, it used to be true that computations done in a floating-point unit were almost always slower than similar computations done in an integer unit. That's not true any more.)

If you can keep multiple integer units and a floating-point unit busy at once, that's a plus.

Norman Ramsey
A: 

Yes, profiling is a way to tell what the speed is.

And yes, various things can cause slow execution.

However, the fundamental approach to take to performance problems is to assume that you don't know or can even guess what the problem is, even if you have a bag of speedup tricks.

Just today I was working on an application, intending to make it much faster by using closed-form gradient computations in an optimization problem. Guess what? That wasn't what the problem was, it was something else.

If X, Y, or Z is causing time-wastage, it will be on the call stack while it is doing so, where you can easily see it. If it is not typically on the call stack, then it probably is not causing it. Look here.

Mike Dunlavey