This is perhaps more of a discussion question, but I thought stackoverflow could be the right place to ask it. I am studying the concept of instruction pipelining. I have been taught that a pipeline's instruction throughput is increased once the number of pipeline stages is increased, but in some cases, throughput might not change. Under what conditions, does this happen? I am thinking stalling and branching could be the answer to the question, but I wonder if I am missing something crucial.
I would also think that increasing pipelining beyond the amount of time the longest instruction in a series would take to execute would not cause an increase in performance. I do think that stalling and branching are the fundamental issues though.
Definitely stalls/bubbles in long pipelines cause a huge loss in throughput. And of course, the longer the pipeline the more clock cycles are wasted.
I tried for a long time to think of other scenarios where longer pipelines could cause a loss in performance, but it all comes back to stalls. (And number of execution units and issue schemes, but those don't have much to do with pipeline length.)
Agreed. The biggest problems are stalls (waiting for results from previous instructions), and incorrect branch prediction. If your pipeline is 20 stages deep, and you stall waiting for the results of a condition or operation, you're going to wait longer than if your pipeline was only 5 stages. If you predict the wrong branch, you have to flush 20 instructions out of the pipeline, as opposed to 5.
I guess presumably you could have a deep pipeline where multiple stages are attempting to access the same hardware (ALU, etc), which would cause a performance hit, though hopefully you throw in enough additional units to support each stage.
The throughout can be stalled by other instructions when waiting for a result, or on cache misses. Pipelining doesn't itself guarantee that the operations are totally independent. Here is a great presentation about the intricacies of the x86 Intel/AMD architecture: http://www.infoq.com/presentations/click-crash-course-modern-hardware
It explains stuff like this in great detail, and covers some solutions on how to further improve throughput and hide latency. JustJeff mentioned out-of-order execution for one, and you have shadow registers not exposed by the programmer model (more than 8 registers on x86), and you also have branch prediction.
Instruction level parallelism has diminishing returns. In particular, data dependencies between instructions determine the possible parallelism.
Consider the case of Read after Write (known as RAW in textbooks).
In the syntax where the first operand gets the result, consider this example.
10: add r1, r2, r3
20: add r1, r1, r1
The result of line 10 must be known by the time the computation of line 10 begins. Data forwarding mitigates this problem, but...only to the point where the data gets known.