Hi guys,
I want to change this code into assembly code, working on mac, how to do this?
while (a --)
{
*pDest ++ += *pSrc ++;
}
Hi guys,
I want to change this code into assembly code, working on mac, how to do this?
while (a --)
{
*pDest ++ += *pSrc ++;
}
The actual assembler instructions will differ, but here's pseudocode that can be translated into assembler pretty easily.
note that the *4 is because I'm assuming you're transferring ints. It's going to vary depending on the size of the data being transferred.
incrementor = 0 ;really easy
top:
jump to bottom if a equals 0 ;jump if zero is the intel instruction here.
memoryDest[incrementor*4] = memorySrc[incrementor*4] ;this will be a bit messy, you'll probably need some temp variables
incrementor += 1 ;dead easy
jump to top: ;goto. PLEASE DON'T CITE 'CONSIDERED HARMFUL`, THIS IS ASM!!!!11ONEONE
bottom:
You say that you're developing for iPhone and are trying to improve speed. It looks like you're trying to copy a block of memory, for which you probably want to use memcpy(dest, src, size).
It's intel mac, and on iPhone. I am working on a program that uses this code in a thread, and the thread is always doing such stuff, sometimes it's stuck, so I am wondering whether it's because the calculation is too heavy for iPhone.
No, your problem has nothing to do with this code. Let the compiler do its job and optimize this. Your problem is elsewhere. It sounds like you have a race condition or deadlock between threads somehow. I can't psychically debug your problem without more information, but I can say for sure you're barking up the wrong tree.
Assuming that the arrays in question are of reasonable length and depending on what the types of pDest and pSrc are, you may be able to get a reasonable speedup on this by using the NEON instructions on ARMv7 (iPhone 3GS and the new Touch), and by using SSE on Intel.
The specific code, and how much of a speedup you can get, will depend on the type of data in the source and destination arrays, what alignment guarantees you have on the array addresses, and what the distribution of typical lengths in the arrays is like.
As always, none of this is worth doing unless you have a Shark trace showing that this loop is an appreciable portion of your execution time. If you're doing application-level performance tuning on the Mac or iPhone and you aren't using Shark or Instruments, you're doing it wrong.
If the arrays are floating-point, you can get well-tuned vector code on the Intel mac by including the Accelerate.framework and using the vDSP_vadd( ) function. No assembly coding necessary.
If you have access to the 2008 WWDC talks, Eric Postpischil gave a nice talk on basic vectorization techniques in which he walked through writing vector code to handle exactly this loop (in the case where pSrc and pDest are single-precision arrays) on Intel, though for simplicity he used C with vector intrinsics instead of ASM.
A few stackshots will show if this is actually where you're spending time.
If it is, unrolling the loop could help, as in:
while (a >= 8){
pDest[0] += pSrc[0];
pDest[1] += pSrc[1];
pDest[2] += pSrc[2];
pDest[3] += pSrc[3];
pDest[4] += pSrc[4];
pDest[5] += pSrc[5];
pDest[6] += pSrc[6];
pDest[7] += pSrc[7];
pDest += 8;
pSrc += 8;
a -= 8;
}
// followed by your loop
You could code it in assembler, but it probably would not be much better.
So this is for an arm? (iphone?). What is the size of these pointers (bytes, halfwords, words, etc?) are you having alignment problems (copying words on a non-word boundary)? If these are bytes then yes the code generated is likely painfully slow, the optimizer cant do too much with it. Where does that leave you? You get what you get.
Here is an example:
mov ip, #0
.L3:
ldrb r3, [r0, ip] @ zero_extendqisi2
ldrb r2, [r1, ip] @ zero_extendqisi2
add r3, r3, r2
strb r3, [r1, ip]
add ip, ip, #1
cmp ip, r4
bne .L3
Because your code had the pointers counting up, the compiler added an instruction that it didnt need.
sub ip, rx, #1
.L3: ldrb r3, [r0, ip] @ zero_extendqisi2 ldrb r2, [r1, ip] @ zero_extendqisi2 add r3, r3, r2 strb r3, [r1, ip] subs ip, ip, #1 bne .L3
Since the carry bit is not used I wonder if there is a way to load a word and do word based adds, doing one word at a time.
load 0xnnmmoopp load oxqqrrsstt
mask one of them to guarantee no carry problems
0xnnmmoopp -> 0xn0mmo0pp
add
0xgghhiikk = 0xn0mmo0pp + 0xqqrrsstt
then store hh and kk as bytes
you have to go back to the original cripple the mm and pp bytes re-do the add and store the gg and ii bytes.
The two word reads should be significantly faster than four byte reads, if you keep all of the above in registers and do a word store instead of four byte stores that will save quite a bit more time.
You will have to save a lot of registers to the stack so it will cost you there so you dont want to do this for small values of a (less than 10 lets say).
Anyway, something to think about. Just the removal of the one line of code in the asm above should be noticeable for long runs.
EDIT:
Actually that modification I did to the compiler output was broken. This is more like it:
mov ip, ra
.L3:
subs ip, ip, #1
ldrb r3, [r0, ip]
ldrb r2, [r1, ip]
add r3, r3, r2
strb r3, [r1, ip]
bne .L3