Visualizing CPU Pipelining

2024/11/30

I want to share what I’ve learned about CPU pipelining. Thanks to Dan Luu’s branch prediction write-up, I was vaguely aware how this worked conceptually, but I was motivated to dive into the details after reading Rodrigo Copetti’s Playstation MIPS write-up where he talked about branch delay slots and how they evolved into branch prediction. I quickly found many subtle and fascinating details on CPU pipelining that I had to share. For more details, I recommend chapter 4 of Computer Organization and Design.

This post will assume you are familiar with the laundry or “assembly-line” model of CPU pipelining, but are hazy on some of the lower-level details. It will also help to have a vague idea of the 5-stage MIPS pipeline (this is a nice, brief high-level summary), since our model CPU will be the 32-bit MIPS.

Visualizing pipelines

Let’s start by visualizing a basic CPU model that does not have pipelining (aka a single-cycle CPU design):

One bottleneck is that only one component of the CPU is active at a time: all others are inactive.

Pipelined CPUs fill in these vacancies by running instructions through the stages one after the other, rather than waiting for a single instruction to completely finish.

This seems pretty natural: it’s like an multiple person assembly line. But there are already some subtle problems that need to be solved.

Instruction Decoding

Instruction decoding orchestrates the entire pipeline by providing fields used by the other stages.

Let’s work through an simplified example. Say the instruction is

add $t1, $s2, $s3

The IF stage fetches the instruction and puts it into the ID stage register. Now the ID register contains fields that will be used by all the remaining stages.

pc: 0x014b4820
-> 
Field   op      rs      rt      rd      shamt   funct
Binary  000000  01010   01011   01001   00000   100000
#       add     $s2     $s3     $t1

With pipelining, we have a problem. For example, in cycle 3 in the previous visualization, add moves into EX and sub moves into ID. But WB needs to know the rd of the add instruction! That field used to be safely stored in the ID register, but sub overwrote it.

The solution is to have registers between each pipeline stage that carry fields from the ID stage (as well as other stages).

Now when add reaches WB, it has the rd field so it can write to the correct register. The actual value from EX or MEM is also stored in the MEM/WB register as well.

Hazard detection

The field metadata from ID needs to be available at each stage to enable basic operations like writing back to the registers, but the fields also turn out to be crucial to solve data hazards.

To start, let’s look at this example:

sub has a dependency on the output of add: this is called a data hazard. Structurally, the CPU could proceed but the result would be incorrect. We need a way to check if any of the inputs of the instruction in ID match the output register for any downstream instructions.

Since the field metadata is propagated to each stage register, we have all the data we need! What’s missing is a unit to calculate this:

ID/EXE.rs == EX/MEM.rd || ID/EXE.rt == EX/MEM.rd ||
ID/MEM.rs == MEM/WB.rd || ID/MEM.rt == MEM/WB.rd

That’s the Hazard Detection Unit (HDU). This unit will both detect hazards and prevent progress until the hazard is resolved.

This hazard detection logic works for “R-type” instructions that write to registers (like add and and), but not every instruction writes to registers (like sw). We’d technically need to check a few other conditions for a complete implementation, especially around data hazards involving lw and sw.

The solid lines show the inputs to the HDU, detecting the hazard. The dotted lines are the control signals, stopping the PC and IF and writing a nop instruction to the ID/EX register.

The nop will proceed through each stage, eventually causing the hazard comparison logic to be false. This will lift the stall on the PC and IF stages, and remove the signal allowing the sub instruction to continue.

Why are stalls also called bubbles? Because once they appear they flow through the pipeline until they pop out of the end. Like a diver who lets out air underwater, the bubble will rise to the surface. Stretching the analogy further, the diver, like the HDU, can let out more than one bubble depending on the circumstance. The previous example let out 2 bubbles.

Forwarding

There’s another way we can use the hazard detection logic on the register metadata to resolve certain types of hazards.

Instead of stalling, we can “forward” an intermediate result to the next instruction.

Aside: Even though the data is moving back in the pipeline, it’s called forwarding because in a multi-clock pipelining diagram the dependencies can be written “forward” in time. See page 364 of Computer Organization and Design 4th Ed. for more discussion.

Let’s continue with our add and sub example. The between-stage registers not only hold metadata, but they also hold the results of the prior stage. If we detect a hazard, we can immediately use that result instead of waiting for the instruction to write back to the register and exit the pipeline.

The forwarding unit (FU) uses the same comparison logic and if it detects a hazard the control sub-unit ensures the EX result is used instead of the input register data (r3 in this case). For this particular case, the hazard resolves with zero stalls! (see figure 4.60 in Computer Organization and Design 4th Ed. for the full picture, this is a bit of a simplification).

HDU and FU

In the FU diagram, I did not add the comparison between the ID/EX input registers and the MEM/WB output registers that I had for the HDU.

The problem is that forwarding can’t handle this problem alone. Let’s look at this example:

lw  r1, 0(r2)  
add r3, r1, r4

Our HDU handles this situation just fine: in fact, it’s identical to our previous example where we must stall until WB writes back to register r1, creating 2 bubbles.

But there’s an opportunity here: the intermediate result is available in the MEM/WB register before it’s written to r1 so we could forward to add’s EX stage. To do this, the FU and HBU have to work in conjunction to save 1 bubble.

I’m not going to go into all the details on how the HDU and FU work together, because it’s a little bit different than how I’ve laid it out here. And as I hinted at with this example, there are nuances around lw and sw instructions you have to be careful with.

Branching

The tools we’ve built up so far (register metadata, forwarding, and stalls) will help us tackle branching aka control hazards.

We’ll start out with the simplest technique.

Predict branch not taken

The logic and implementation behind branch not taken is very similar to forwarding to resolve data hazards. With forwarding, the EX (or MEM) result of one instruction is used as input to the EX stage of an upcoming instruction.

For branch prediction, the EX stage calculates the branch condition and saves the result to the EX/MEM register. If the branch is taken, we “forward” that fact to all previous stages. This information is not used as an input, but rather converts the instruction into a nop that will bubble through the pipeline.

Additionally, the branch target address was saved in the inter-stage registers and is used to update the PC (this is calculated from the branch offset instruction argument and the branch’s PC during the ID stage in the Branch Target Address Calculation (BTAC) unit. There is a diagram below showing the flow).

Branch delay slot

A simple improvement to this scheme actually uses less hardware. Instead of converting ID, IF, and EX into nop we assume that the instruction immediately after the branch will execute no matter what. The position immediately after the branch is called the branch delay slot and the compiler or programmer is responsible for filling it.

The advantage is that instead of 3 bubbles, we only have 2.

Suppose we add an instruction prior to the branch in our last example. The compiler may reorganize the instructions so add happens after the branch in the branch delay slot.

add t0, t1, t2  \   beq r1, r2, br
beq r1, r2, br  /   add t0, t1, t2
add r7, r1, r4      add r7, r1, r4
lw  r8, 0(t6)       lw  r8, 0(t6)

Now the pipeline can still “predict” branch not taken, but add t0 will still execute and do useful work.

Unfortunately, sometimes the compiler can’t find an instruction to fill the branch delay slot and will have to insert a nop anyway.

Take a look at this example, where we iterate over an array until we find zero.

loop:
    lw   t0, 0(s0)    
    beq  t0, 0, done 
    nop                 
    add  s0, s0, 4      
    j    loop           
done:

add can not go into the branch delay slot, because if the branch was taken, we would have incorrectly modified s0. s0 would no longer point to the first zero in the array. And we can’t move lw down into the slot, because the branch depends on the value of t0, so lw must execute first.

Now maybe a sufficiently intelligent compiler could figure out a solution (maybe it’s okay to put add in the slot and as long as we later subtract 4 to get the zero’s pointer), but these are the situations the compiler has to consider.

Dynamic branch prediction

In this section, I only want to briefly cover dynamic branch prediction resolution, because Dan Luu’s write-up covers many details about the prediction itself (I recommend the section on 2-bit saturating counters. His explanation really made things click for me).

Luckily, the resolution logic for dynamic branch is almost identical to the resolution logic when we predict branch not taken. In the simple branch prediction case, we compare the branch result of the EX stage to our assumption that the branch was not taken.

For dynamic branch prediction resolution, we need to store the prediction and compare it to the actual branch result in the EX/MEM register. The means when we fetch the branch prediction in the IF stage, the prediction must be stored in the IF/ID register and propagated to downstream inter-stage registers.

I’m not including an interactive visualization here, because it’s identical to the last one. I do have a rough schematic of branch prediction and resolution to show the high-level flow:

  • BRU (Branch Resolution Unit)
    • compares the actual branch result to the prediction
    • flushes the pipeline if the prediction was incorrect
    • signal to BPU to update the prediction (depending on the prediction scheme, it will signal to the unit whether the prediction is right or wrong)
    • on misprediction sets the PC to either the next address after the branch or the branch target address calculated by the BTAC (depending on if branch was taken or not)
  • BTAC (Branch Target Address Calculation)
    • calculates the branch target address from the branch’s PC and the branch offset
    • saves branch target address to the ID/EX register to be used if prediction was branch not taken, but branch should have been taken.
  • BPU (Branch Prediction Unit, aka Branch Target Buffer (BTB))
    • looks up the branch prediction based on the last bits of the branch’s PC
    • writes prediction to IF/ID register
    • stores whether the branch was last taken or not

Conclusion

I first got interested in CPU pipelining through high-level articles like Dan Luu’s branch prediction write-up and Rodrigo Copetti’s PlayStation architecture analysis. Conceptually, the ideas were fascinating, and they made me want to dive into the details to see how everything worked.

I was most impressed by how the simple core mechanisms of register metadata, stalls, and forwarding are continuously combined and remixed to solve increasingly complex problems.