I was doing some benchmarks for the performance of code on Windows mobile devices, and noticed that some algorithms were doing significantly better on some hosts, and significantly worse on others. Of course, taking into account the difference in clock speeds.
The statistics for reference (all results are generated from the same binary, compiled by Visual Studio 2005 targeting ARMv4):
Intel XScale PXA270
- Algorithm A: 22642 ms
- Algorithm B: 29271 ms
ARM1136EJ-S core (embedded in a MSM7201A chip)
- Algorithm A: 24874 ms
- Algorithm B: 29504 ms
ARM926EJ-S core (embedded in an OMAP 850 chip)
- Algorithm A: 70215 ms
- Algorithm B: 31652 ms (!)
I checked out floating point as a possible cause, and while algorithm B does use floating point code, it does not use it from the inner loop, and none of the cores seem to have a FPU.
So my question is, what mechanic may be causing this difference, preferrably with suggestions on how to fix/avoid the bottleneck in question.
Thanks in advance.