tags:

views:

494

answers:

10

Hey everyone,

I came from this thread: http://stackoverflow.com/questions/1536867/flops-intel-core-and-testing-it-with-c-innerproduct

As I began writing simple test scripts, a few questions came into my mind.

  1. Why floating point? What is so significant about floating point that we have to consider? Why not a simple int?

  2. If I want to measure FLOPS, let say I am doing the inner product of two vectors. Must the two vectors be float[] ? How will the measurement be different if I use int[]?

  3. I am not familiar with Intel architectures. Let say I have the following operations:

    float a = 3.14159; float b = 3.14158;
    for( int i = 0; i < 100; ++i) {
    a + b;
    }

    How many "floating point operations" is this?

  4. I am a bit confused because I studied a simplified 32bit MIPS architecture. For every instruction, there is 32 bits, like 5 bit for operand 1 and 5 bit for operand 2 etc. so for intel architectures (specifically the same architecture from the previous thread), I was told that the register can hold 128 bit. For SINGLE PRECISION floating point, 32bit per float point number, does that mean for each instruction fed to the processor, it can take 4 floating point numbers? Don't we also have to account for bits involved in operands and other parts of the instruction? How can we just feed 4 floating point numbers to a cpu without any specific meaning to this?

I don't know whether my approach of thinking everything in bits and pieces make sense. If not, what "height" of perspective should I be looking at?

+1  A: 

Floating Point Operations per Second.

http://www.webopedia.com/TERM/F/FLOPS.html

Your example is 100 floating point operations (adding the two floating point numbers together is one floating point operation). Allocating floating point numbers may or may not count.

The term is apparently not an exact measurement, as it is clear that a double-precision floating-point operation is going to take longer than a single-precision one, and multiplication and division are going to take longer than addition and subtraction. As the Wikipedia article attests, there are ultimately better ways to measure performance.

Robert Harvey
If the CPU has a floating point unit that natively uses double-precision format, doing calculations on doubles will be quicker than on singles, since no format conversion will need to be done.But as you remark, not all floating point operations are equally fast. In terms of the number of CPU clock cycles, typically (addition == subtraction) < multiplication < division. Some FPUs even have various transcendental functions like sin, cos, e^x, x^l, ln, etc., and those are WAY slower than division. You actually have to look at the CPU documentation to get the details.
Bob Murphy
A: 

1) Because many real world application runs crunching a lot of floating point numbers, by example all vector based apps (games, CAD, etc) relies almost entirely in floating point operations.

2) FLOPS is for Floating Point operations.

3) 100. The flow control use integer operations

4) That architecture is best suited for ALU. Floating point representations can use 96-128 bits.

Rodrigo
A: 

Floating point operations are the limiting factor in certain computing problems. If your problem isn't one of them, you can safely ignore flops ratings.

Intel architecture started out with simple 80 bit floating point instructions, which can load or store to 64 bit memory locations with rounding. Later they added the SSE instructions, which use 128 bit registers and can do multiple floating point operations with a single instruction.

Mark Ransom
A: 
  1. Floating point speed mattered a lot for scientific computing and computer graphics.
  2. By definition, no. You're testing integer performance at that point.
  3. 302, see below.
  4. x86 and x64 are very different from MIPS. MIPS, being a RISC (reduced instruction set computer) architecture, has very few instructions in comparison to the CISC (complex instruction set computer) architecture of Intel and AMD's offerings. For instruction decoding, x86 using variable width instructions, so instructions anywhere from one to 16 bytes in length (including prefixes, it might be larger)

The 128 bit thing is about the internal representation of floats in the processor. It uses really bit floats internally to try and avoid rounding errors, and then truncates them when you put the numbers back into memory.

fld  A      //st=[A]
fld  B      //st=[B, A]
Loop:
fld st(1)   //st=[A, B, A]
fadd st(1)  //st=[A + B, B, A]
fstp memory //st=[B, A]
Patrick
most compilers would probably only push a and b into registers once. this leaves the addition operation which stores result into another register, so the exact total is probably 102 FLOPS. then again, compilers might optimize away this entire loop and just leave you with 2 floating-point stores.
Igor
Based on what I know about the floating point stack of x86, I don't think that's correct. I've amended my answer with a possible rendering of it could be. But we both know that any compiler worth its salt would remove the whole set of statements for lacking any side effects! :)
Patrick
A: 

Yuck, simplified MIPS. Typically, that's fine for intro courses. I'm going to assume a hennesy/patterson book?

Read up on the MMX instructions for the Pentium architecture(586) for the Intel approach. Or, more generally, study the SIMD architectures, which are also known as vector processor architectures. They were first popularized by the Cray supercomputers(although I think there were a few forerunners). For a modern SIMD approach, see the CUDA approach produced by NVIDIA or the different DSP processors on the market.

Paul Nathan
+2  A: 

1.) Floating point operations simply represent a wider range of math than fixed-width integers. Additionally, heavily numerical or scientific applications (which would typically be the one who actually test a CPU's pure computational power) probably rely on Floating point ops more than anything.

2.) They would have to both be float. The CPU won't add an integer and a float, one or the other would implicitly be converted (most likely the integer would be converted to the float ), so it would still just be floating point operations.

3.) That would be 100 floating point operations, as well as 100 integer operations, as well as some (100?) control-flow/branch/comparison operations. There'd generally also be loads and stores but you don't seem to be storing the value :)

4.) I'm not sure how to begin with this one, you seem to have a general perspective on the material, but you have confused some of the details. Yes an individual instruction may be partitioned into sections similar to:

|OP CODE | Operand 1 | Operand 2 | (among many, many others)

However, operand 1 and operand 2 don't have to contain the actual values to be added. They could just contain the registers to be added. For example take this SSE instruction:

mulps      %%xmm3, %%xmm1

It's telling the execution unit to multiply the contents of register xmm3 and the contents of xmm1 and store the result in xmm3. Since the registers hold 128-bit values, I'm doing the operation on 128-bit values, this is independent of the size of the instruction. Unfortunately x86 does not have a similar instruction breakdown as MIPS due to it being a CISC architecture. An x86 instruction can have anywhere between 1 and 16(!) bytes.

As for your question, I think this is all very fun stuff to know, and it helps you build intuition about the speed of math-intensive programs, as well as giving you a sense of upper limits to be achieved when optimizing. I'd never try and directly correlate this to the actual run time of a program though, as too many other factors contribute to the actual end performance.

Falaina
Ok let say theoretically you can feed in 16 bytes per instruction. Then 16 byes is just nice for 4 floating point numbers. Regardless of how many floating point numbers this instruction hold, it's still counted as 1 Floating point operation right? If I have a instruction that contain 3 floating point numbers, that's still 1 floating point operating right?
confused
A: 
  1. There are lots of things floating point math does far better than integer math. Most university computer science curricula have a course on it called "numerical analysis".

  2. The vector elements must be float, double, or long double. The inner product calculation will be slower than if the elements were ints.

  3. That would be 100 floating point adds. (That is, unless the compiler realized nothing is ever done with the result and optimizes the whole thing away.)

  4. Computers use a variety of internal formats to represent floating point numbers. In the example you mention, the CPU would convert the 32-bit float into its internal 128-bit format before doing operations on the number.

In addition to uses other answers have mentioned, people called "quants" use floating point math for finance these days. A guy named David E. Shaw started applying floating point math to modeling Wall Street in 1988, and as of Sept. 30, 2009, is worth $2.5 billion and ranks #123 on the Forbes list of the 400 richest Americans.

So it's worth learning a bit about floating point math!

Bob Murphy
+2  A: 
  1. Floating point and integer operation use different pipelines on the chip, so they run at different speeds (on simple/old enough architectures there may be no native floating point support at all, making floating point operation very slow). So if you are trying to estimate real world performance for problems that use floating point math, you need to know how fast these operation are.

  2. Yes, you must use floating point data. See #1.

  3. A FLOP is typically defined as an average over a particular mixture of operations that is intended to be representative of the real world problem you want to model. For your loop, you would just count each addition as 1 operation making a total of 100 operations. BUT: this is not representative of most real world jobs and you may have to take steps to prevent the compiler from optimizing all the work out.

  4. Vectorized or SIMD (Single Instruction Multiple Data) can do exactly that. Example of SIMD systems in use right now include AltiVec (on PowerPC series chips) and MMX/SSE/... on Intel x86 and compatible. Such improvements in chips should get credit for doing more work, so your trivial loop above would still be counted as 100 operation even if there are only 25 fetch and work cycles. Compilers either need to be very smart, or receive hints from the programmer to make use of SIMD units (but most front-line compilers are very smart these days).

dmckee
+1 for noting that an optimizer might optimize that loop out.
Jonathan Leffler
A: 

1) Floating point is important because sometimes we want to represent really big or really small numbers and integers aren't really so good with that. Read up on the IEEE-754 standard, but the mantissa is like the integer portion, and we trade some bits to work as an exponent, which allows a much more expanded range of numbers to be represented.

2) If the two vectors are ints, you won't measure FLOPS. If one vector is int and another is float, you'll be doing lots of int->float conversions, and we should probably consider such a conversion to be a FLOP.

3/4) Floating point operations on Intel architectures are really quite exotic. It's actually a stack-based, single operand instruction set (usually). For instance, in your example, you would use one instruction with an opcode that loads a memory operand onto the top of the FPU stack, and then you would use another instruction with an opcode that adds a memory operand to the top of the FPU stack, and then finally another instruction with an opcode that pops the top of the FPU stack to the memory operand.

This website lists a lot of the operations.

http://www.website.masmforum.com/tutorials/fptute/appen1.htm

I'm sure Intel publishes the actual opcodes somewhere, if you're really that interested.

ajs410
Compilers really shouldn't be using the stack-based x87 FPU instructions any more; they have been deprecated by SSE, which has a new, much faster, set of *scalar* floating point instructions.Of course, despite SSE's introduction ten years ago, GCC still pointedly ignores it by default.
Crashworks
A: 

FLOPS: what you ask Silicon Graphics technical experts and they don't know

LarsOn