Register Scoreboards: Beyond Simple Forwarding in Pipelined Processors

Introduction

Data forwarding (also known as bypassing) is a fundamental technique in pipelined processor design that allows results to be used before they’re written back to the register file. For single-cycle operations in a classic RISC pipeline, forwarding elegantly solves most Read-After-Write (RAW) data hazards. However, when we introduce multi-cycle operations like multiplication, division, or memory loads, simple forwarding becomes insufficient. This article explores why register scoreboards are essential for handling these complex scenarios, using a practical RISC-V implementation as a case study.

👉🏻 Tiny Vedas Codebase Here 👈🏻

The Promise and Limitations of Forwarding

How Basic Forwarding Works

In a standard 5-stage RISC pipeline (Fetch, Decode, Execute, Memory, Writeback), forwarding allows the Execute stage to receive operands directly from later pipeline stages rather than waiting for writeback to the register file:

Time:     t0    t1    t2    t3    t4    t5
Instr1:   IF    ID    EX    MEM   WB
Instr2:         IF    ID    EX    MEM   WB
                      ↑
                      Forwarding from EX/MEM

When instruction 2 depends on instruction 1’s result, we can forward from the EX/MEM pipeline register directly to the Execute stage, eliminating the need to stall.

Where Forwarding Falls Short

Consider a multiply operation that takes 3 cycles to complete:

Time:          t0    t1    t2    t3    t4    t5    t6
MUL r3,r1,r2:  IF    ID    EX1   EX2   EX3   MEM   WB
ADD r4,r3,r5:        IF    ID    EX    MEM   WB
                           ↑
                           Where is r3?

At time t2, when the ADD instruction reaches the decode stage and needs to read r3, the multiply is still in its first execution cycle (EX1). The result won’t be available for forwarding until time t5, three cycles later. Simply forwarding from the execute stage is no longer sufficient.

The Multi-Cycle Operation Challenge

Different Operation Latencies

Modern processors must handle operations with varying execution times:

ALU operations: 1 cycle (ADD, SUB, AND, OR, XOR, shifts)
Multiply operations: 1-4 cycles (depending on implementation)
Divide operations: 8-32+ cycles (often iterative)
Load operations: 1+ cycles (cache hit) to 100+ cycles (cache miss)

Each of these creates a different forwarding challenge:

Pipelined multi-cycle ops (multiply): Results available at different stages
Blocking multi-cycle ops (divide): No intermediate results to forward
Variable latency ops (loads): Unpredictable completion time

The Information Gap

The core problem is that forwarding logic needs to know:

Which registers are currently unavailable? (RAW hazard detection)
When will each register’s value be ready? (Stall duration calculation)
Where should the value be forwarded from? (Bypass path selection)

For single-cycle operations, all three questions have trivial answers. For multi-cycle operations, we need a mechanism to track in-flight register writes.

Enter the Register Scoreboard

What is a Register Scoreboard?

A register scoreboard is a hardware structure that tracks which registers have pending writes. At its simplest, it’s a bit vector where each bit indicates whether a register has an in-flight operation:

logic [31:0] scoreboard;  // One bit per register

// Set bit when issuing long-latency operation
scoreboard[rd_addr] <= 1'b1;

// Clear bit when operation completes
scoreboard[rd_addr] <= 1'b0;

// Check for hazards
raw_hazard = scoreboard[rs1_addr] | scoreboard[rs2_addr];

Implementation in the Tiny Vedas Core

Looking at a real implementation from the Tiny Vedas RISC-V core, the register scoreboard module is instantiated in IDU1:

rsb #(
    .N_REG(32)
) rsb_idu0_i (
    .clk           (clk),
    .rstn          (rstn),
    .pipe_flush    (pipe_flush),
    .rs1_addr      (idu0_out.rs1_addr),
    .rs2_addr      (idu0_out.rs2_addr),
    .rs1_rd_en     (idu0_out.rs1 & idu0_out.legal),
    .rs2_rd_en     (idu0_out.rs2 & idu0_out.legal),
    .rs1_hit       (rs1_rsb_hit_idu0),
    .rs2_hit       (rs2_rsb_hit_idu0),
    .set_rd_addr   (idu0_out.rd_addr),
    .set_rd_wr_en  (idu0_out.legal & (idu0_out.mul | idu0_out.load) & ~(pipe_stall | idu0_rsb_hit_stall)),
    .clear_rd_addr (exu_wb_rd_addr),
    .clear_rd_wr_en(exu_wb_rd_wr_en)
);

Key observations:

Selective tracking: Only multiply and load operations set the scoreboard bit (idu0_out.mul | idu0_out.load)
Early detection: Scoreboard is checked in IDU0, before register file read in IDU1
Writeback clearing: Scoreboard bits are cleared when results actually write back

Scoreboard + Forwarding: A Hybrid Approach

The scoreboard doesn’t replace forwarding; it complements it. The implementation shows this hybrid approach:

// WB to IDU1 forwarding
assign idu1_out_i.rs1_data = ((exu_wb_rd_addr == idu0_out.rs1_addr) & idu0_out.rs1 & exu_wb_rd_wr_en) 
                              ? exu_wb_data : rs1_data;
assign rs1_fwd_idu0 = ((exu_wb_rd_addr == idu0_out.rs1_addr) & idu0_out.rs1 & exu_wb_rd_wr_en);

// Only stall if scoreboard hit AND no forwarding available
assign idu0_rsb_hit_stall = (rs1_rsb_hit_idu0 & ~rs1_fwd_idu0) | (rs2_rsb_hit_idu0 & ~rs2_fwd_idu0);

The processor only stalls when there’s both a scoreboard hit (register in-flight) AND no forwarding path available. If the value happens to be at the writeback stage, forwarding rescues us from the stall.

Intelligent Pipeline Stall Management

Beyond Binary Stall/No-Stall

The Tiny Vedas implementation demonstrates sophisticated stall management that goes beyond simple hazard detection:

always_comb begin : pipe_stall_management
  pipe_stall = 1'b0;
  
  // Stall after MUL if next instruction is not MUL (allows pipelined multiplies)
  if (last_issued_instr.mul & (idu1_out_gated.legal & ~idu1_out_gated.mul) & ~idu1_out_gated.nop) begin
    pipe_stall = exu_mul_busy;
  end 
  
  // Always stall during division (blocking operation)
  else if (last_issued_instr.div) begin
    pipe_stall = exu_div_busy;
  end 
  
  // Pipeline loads when possible
  else if (last_issued_instr.load & (idu1_out_gated.legal & ~idu1_out_gated.load)) begin
    pipe_stall = exu_lsu_busy;
  end 
  
  // Pipeline stores when possible
  else if (last_issued_instr.store & (idu1_out_gated.legal & ~idu1_out_gated.store)) begin
    pipe_stall = exu_lsu_busy;
  end
  
  pipe_stall |= exu_lsu_stall;
end

Multiply Operation Optimization

The multiply logic handles pipelining:

Scenario 1: MUL followed by non-MUL
  MUL r3,r1,r2
  ADD r4,r5,r6    <- Independent, but must wait
  
Scenario 2: Back-to-back MULs (no dependency)
  MUL r3,r1,r2
  MUL r4,r5,r6    <- Can pipeline!

This optimization exploits pipelined multiply units. If you have a 3-stage pipelined multiplier, you can issue a new multiply every cycle even though each takes 3 cycles to complete. The stall logic recognizes this and only stalls when transitioning from MUL to non-MUL operations.

Division: The Blocking Operation

Division uses a different strategy:

else if (last_issued_instr.div) begin
  pipe_stall = exu_div_busy;
end

Because division is typically implemented as an iterative algorithm (non-restoring, SRT, or Goldschmidt), there are no intermediate results to forward. The pipeline simply blocks until completion.

Design Tradeoffs and Considerations

Scoreboard Granularity

The Tiny Vedas implementation uses a simple bit-vector scoreboard. More sophisticated designs might track:

Completion time: How many cycles until ready?
Forwarding points: Which pipeline stages have the value?
Operation type: Different handling for different ops

The tradeoff is area/complexity versus performance. For an in-order core like Tiny Vedas, a simple bit vector is sufficient.

What Operations to Track

Notice that the scoreboard only tracks multiply and load operations:

.set_rd_wr_en (idu0_out.legal & (idu0_out.mul | idu0_out.load) & ...)

Why not track all operations? Because:

Single-cycle ALU ops are fully handled by forwarding
Division ops use the blocking stall mechanism
Tracking overhead should be proportional to benefit

This selective tracking reduces hardware cost while still catching the critical cases.

Performance Impact

The scoreboard introduces a small amount of overhead:

Area: ~32 bits of state (one per register)
Timing: Comparison logic in the critical path
Stall cycles: When forwarding can’t resolve the hazard

But it enables much higher performance than the alternative (stalling for all multi-cycle operations or disallowing them entirely).

Comparison with Out-of-Order Techniques

Tomasulo’s Algorithm and Reservation Stations

Out-of-order processors use more sophisticated techniques like Tomasulo’s algorithm, which includes:

Reservation stations: Buffer instructions waiting for operands
Register renaming: Eliminate false dependencies (WAR, WAW)
Common Data Bus: Broadcast results to all waiting instructions

A register scoreboard is simpler but less powerful:

Feature	Scoreboard	Tomasulo
Out-of-order execution	No	Yes
Register renaming	No	Yes
Area cost	Low	High
Performance	Good (in-order)	Excellent
Complexity	Low	High

For an in-order core, a scoreboard provides an excellent balance of performance and complexity.

Speculative Execution

Modern high-performance cores also employ:

Branch prediction + speculative execution
Memory disambiguation
Load/store queues

The scoreboard is just one piece of the hazard detection puzzle. In Tiny Vedas, it works alongside:

Forwarding logic (immediate resolution when possible)
Stall management (intelligent blocking)
Pipeline flush on misprediction

Practical Implications

For Hardware Designers

When implementing a pipelined processor with multi-cycle operations:

Start with forwarding for single-cycle operations
Add a scoreboard for multi-cycle ops (multiply, load)
Implement intelligent stall logic that understands operation types
Optimize common cases (e.g., pipelined multiplies)
Consider the critical path when adding hazard detection

For Compiler Writers

Understanding scoreboard behavior can inform optimization:

Instruction scheduling: Separate dependent instructions when one is multi-cycle
Loop unrolling: Can help hide multiply latency
Load/store reordering: Minimize stalls from memory operations

Example transformation:

# Before optimization
MUL r3, r1, r2
ADD r4, r3, r5    # Stall! Depends on MUL result
SUB r6, r7, r8

# After optimization  
MUL r3, r1, r2
SUB r6, r7, r8    # No dependency, execute while MUL completes
ADD r4, r3, r5    # Still depends on MUL, but fewer wasted cycles

For Verification Engineers

Scoreboard correctness is critical. Key test scenarios:

Back-to-back dependent multiplies
MUL followed by dependent ALU operation
Forwarding race conditions (scoreboard clear vs. read)
Pipeline flush during multi-cycle operation
Load-use hazards with varying cache latencies

Conclusion

Register scoreboards solve a fundamental problem in pipelined processor design. While simple forwarding works well for single-cycle operations, multi-cycle operations require a mechanism to track in-flight register writes. The scoreboard provides this tracking with minimal hardware overhead.

The Tiny Vedas RISC-V core demonstrates a practical implementation that:

Uses a simple bit-vector scoreboard for multiply and load operations
Combines scoreboard checking with forwarding
Implements stall logic that understands operation characteristics
Balances performance with design complexity

For engineers designing processors, the lesson is straightforward: forwarding alone is not enough. Multi-cycle operations require tracking mechanisms like scoreboards, and the stall management strategy should match the characteristics of each operation type.

References

Patterson, D. A., & Hennessy, J. L. (2017). Computer Organization and Design RISC-V Edition: The Hardware Software Interface
Hennessy, J. L., & Patterson, D. A. (2011). Computer Architecture: A Quantitative Approach (5th ed.)
RISC-V Instruction Set Manual, Volume I: User-Level ISA
Thornton, J. E. (1964). Design of a Computer: The Control Data 6600

About Tiny Vedas

Tiny Vedas is an educational RISC-V processor implementation designed to demonstrate fundamental processor architecture concepts in real, synthesizable verilog. The design emphasizes clarity and educational value while maintaining practical efficiency.