Optimizing for ARM: Why different CPUs affects different algorithms differently (and drastically)

views:

318

answers:

+3 Q:

Optimizing for ARM: Why different CPUs affects different algorithms differently (and drastically)

I was doing some benchmarks for the performance of code on Windows mobile devices, and noticed that some algorithms were doing significantly better on some hosts, and significantly worse on others. Of course, taking into account the difference in clock speeds.

The statistics for reference (all results are generated from the same binary, compiled by Visual Studio 2005 targeting ARMv4):

Intel XScale PXA270

Algorithm A: 22642 ms
Algorithm B: 29271 ms

ARM1136EJ-S core (embedded in a MSM7201A chip)

Algorithm A: 24874 ms
Algorithm B: 29504 ms

ARM926EJ-S core (embedded in an OMAP 850 chip)

Algorithm A: 70215 ms
Algorithm B: 31652 ms (!)

I checked out floating point as a possible cause, and while algorithm B does use floating point code, it does not use it from the inner loop, and none of the cores seem to have a FPU.

So my question is, what mechanic may be causing this difference, preferrably with suggestions on how to fix/avoid the bottleneck in question.

Thanks in advance.

+2 A:

Clock speed is only one factor. Bus width and latency are big if not bigger factors. Cache is a factor. Speed of the media the program is run from if run from media and not memory.

Is this test using any shared libraries at all at any point in the test or is it all internal code? Fetching shared libraries on media that will vary from platform to platform (even if it is say the same sd card).

Is this the same algorithm compiled separately for each platform or the same binary? You can and will see some compiler induced variation as well. 50% faster and slower can easily come from the same compiler on the same platform by varying compiler settings. If possible you want to execute the same binary, and insure that no shared libraries are used in the loop under test. If not the same binary disassemble the loop under test for each platform and insure that there are no variations other than register selection.

dwelch 2009-10-06 17:05:51

+1 For the CACHE setting.

Alphaneo 2009-10-07 04:59:01

+1 A:

One possible cause is that the 926 has a shorter pipeline (5 cycles vs. 8 cycles for the 1136, iirc), so branch mispredictions are less costly on the 926.

That said, there are a lot of architectural differences between those processors, too many to say for sure why you see this effect without knowing something about the instructions that you're actually executing.

Stephen Canon 2009-10-07 00:21:29

I ran a prediction benchmark (test bit 16 of i++ vs bit 16 of a LSFR) and it indeed shows that the 1136 shows a significant loss compared to the 926, which maps to the implementations (lots of nested ifs per item versus a finite state machine which sticks in the same state for most of the time)

Combuster 2009-10-07 08:52:19

+1 A:

From the data you have presented, its difficult to point the exact problem, but we can share some of the prior experience

Cache setting (check if all the processors has the same CACHE setting)
You need to check both D-Cache and I-Cache

For analysis,

Break down your code further, not just as algorithm but at a block level, and try to understand the block that causes the bottle-neck. After you find the block that causes the bottle-neck, try to disassemble the block's source code, and check the assembly. It may help.

Alphaneo 2009-10-07 05:02:33

Looks like the problem is in cache settings or something memory-related (maybe I-Cache "overflow"). Pipeline stalls, branch miss-predictions usually give less significant differences.

You can try to count some basic operations, executed in each algorithm, for example:

number of "easy" arithmetical/bitwise ops (+-|^&) and shifts by constant
number of shifts by variable
number of multiplications
number of "hard" arithmetics operations (divides, floating point ops)
number of aligned memory reads (32bit)
number of byte memory reads (8bit) (it's slower than 32bit)
number of aligned memory writes (32bit)
number of byte memory writes (8bit)
number of branches
something else, don't remember more :)

And you'll get info, that things get 926 much slower. After this you can check suspicious blocks, making using of them more or less intensive. And you'll get the answer.

Furthermore, it's much better to enable assembly listing generation in VS and use it (but not your high-level source code) as base for research.

p.s.: maybe the problem is in OS/software/firmware? Did you testing on clean system? OS is the same on all devices?

zxcat 2009-10-07 21:41:50

ansaurus

tags:

views:

answers:

Optimizing for ARM: Why different CPUs affects different algorithms differently (and drastically)

related questions