views:

384

answers:

1

Hi!

in this document: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0301g/DDI0301G_arm1176jzfs_r0p7_trm.pdf

on page 21-25 (pdf page 875) the througput and latency timings are given for the assembly instructions of the VFP unit.

Are those numbers independant of vectorsize?

1: let's take FMULS which has throughput of 1 and latency of 8. does it mean that i can start in each cycle a new FMULS operation if i don't use a register which is not currently calculated by a previous function? for example:

FMULS s8, s16, s20
FMULS s12, s21, s25

will those exectue right after each other?

2: what happens if I have two FMULS functions after each other where one argument depends upon the previous computation

FMULS s8, s16, s20
FMULS s12, s21, s8

will the VFP wait for 8 cycles before starting to process the second instruction?

3: what if we are in vectormode with 4 elements and on the second FMULS instruction all inputregisters but one are available. what will happen?

4: sqrt and division: will a sqrt or division operation prevent any subsequent operation from being started for 19 cycles?

thanks!

+2  A: 

Your questions are all answered in the document that you linked. You should read it carefully.

Are those numbers independent of vectorsize?

No. See, for example, Table 21-15 in the document you linked. Note the latency of the short vector FADDS.

does it mean that I can start a new FMULS operation every cycle if it doesn't depend on an earlier result that isn't available yet?

Yes, that's the definition of throughput.

what happens if I have two FMULS functions after each other where one argument depends upon the previous computation

Execution will stall until the result of the first FMULS is available. See 21.6 "Operation of the scoreboards" for more detail.

what if we are in vectormode with 4 elements and on the second FMULS instruction all inputregisters but one are available. what will happen?

It will stall. Same reference.

sqrt and division: will a sqrt or division operation prevent any subsequent operation from being started for 19 cycles?

No. See section 21.10 "Parallel Execution". An example is given in Table 21-15, in which a non-dependent FADDS executes immediately following FDIVS.

Note that it can be a bit of a challenge (though not impossible) to write short-vector VFP code that performs substantially faster than scalar code for many types of computation. Even if you learn how to do it, it will be of questionable value since the NEON unit seems to be the new model for vector computation on ARM. You may be better served in the long run by ignoring the short-vector operation for now and focusing on learning NEON for the future.

Stephen Canon
thanks a lot for that info! Since i code for the iPhone and want to get some code running fast on the iPhone 3G i need to use the VFP, since the 3G doesn't have NEON.Yes i've read this example, but i don't really understand why it is possible in this case? page 804 suggests to avoid DIV and SQRT because it stalls both the DS and the FMACS pipeline. what exactly means 'If the short vector DS operation can be separated..." (on that page)
genesys