Chapter_04-RISC-V
Chapter_04-RISC-V
Chapter_04-RISC-V
The Processor
§4.1 Introduction
Introduction
› CPU performance factors
– Instruction count
› Determined by ISA and compiler
– CPI and Cycle time
› Determined by CPU hardware
› AND-gate ◼ Adder A
– Y=A&B + Y
◼ Y=A+B B
A
Y
B
◼ Arithmetic/Logic Unit
◼ Multiplexer ◼ Y = F(A, B)
◼ Y = S ? I1 : I0
A
I0 M
u Y ALU Y
I1 x
B
S F
Sequential Elements
Clk
D Q
D
Clk
Q
Sequential Elements
› Register with write control
– Only updates on clock edge when write control input is 1
– Used when stored value is required later
Clk
D Q Write
Write D
Clk
Q
Clocking Methodology
› Combinational logic transforms data during clock
cycles
– Between clock edges
– Input from state elements, output to state element
– Longest delay determines clock period
§4.3 Building a Datapath
Building a Datapath
› Datapath
– Elements that process data and addresses
in the CPU
› Registers, ALUs, mux’s, memories, …
Increment
by 4 for
next
instruction
64-bit
register
R-Format Instructions add x9,x20,x21
Just
re-routes
wires
Sign-bit
wire
replicated
Composing the Elements
› First-cut data path does an instruction in one
clock cycle
– Each datapath element can only do one function at a
time
– Hence, we need separate instruction and data
memories
› Use multiplexers where alternate data sources
are used for different instructions
R-Type/Load/Store Datapath
Full Datapath
§4.4 A Simple Implementation Scheme
ALU Control
ALU ALU
opcode ALUOp Operation Opcode field function control
ld 00 load register XXXXXXXXXXX add 0010
sd 00 store register XXXXXXXXXXX add 0010
beq 01 branch on equal XXXXXXXXXXX subtract 0110
◼ Four loads:
◼ Speedup
= 8/3.5 = 2.3
◼ Non-stop:
◼ Speedup
= 2n/0.5n + 1.5 ≈ 4
= number of stages
RISC-V Pipeline
› Five stages, one step per stage
1. IF: Instruction fetch from memory
2. ID: Instruction decode & register read
3. EX: Execute operation or calculate address
4. MEM: Access memory operand
5. WB: Write result back to register
Pipeline Performance
› Assume time for stages is
– 100ps for register read or write
– 200ps for other stages
› Compare pipelined datapath with single-cycle
datapath
› In RISC-V pipeline
– Need to compare registers and compute target early in
the pipeline
– Add hardware to do it in ID stage
Stall on Branch
› Wait until branch outcome determined before
fetching next instruction
Branch Prediction
› Longer pipelines can’t readily determine branch
outcome early
– Stall penalty becomes unacceptable
› Predict outcome of branch
– Only stall if prediction is wrong
› In RISC-V pipeline
– Can predict branches not taken
– Fetch instruction after branch, with no delay
More-Realistic Branch Prediction
› Static branch prediction
– Based on typical branch behavior
– Example: loop and if-statement branches
› Predict backward branches taken
› Predict forward branches not taken
MEM
Right-to-left WB
flow leads to
hazards
Pipeline registers
› Need registers between stages
– To hold information produced in previous cycle
Pipeline Operation
› Cycle-by-cycle flow of instructions through the
pipelined datapath
– “Single-clock-cycle” pipeline diagram
› Shows pipeline usage in a single cycle
› Highlight resources used
– c.f. “multi-clock-cycle” diagram
› Graph of operation over time
Wrong
register
number
Corrected Datapath for Load
EX for Store
MEM for Store
WB for Store
Multi-Cycle Pipeline Diagram
› Form showing resource usage
Multi-Cycle Pipeline Diagram
› Traditional form
Single-Cycle Pipeline Diagram
› State of pipeline in a given cycle
Pipelined Control (Simplified)
Pipelined Control
› Control signals derived from instruction
– As in single-cycle implementation
Pipelined Control
§4.7 Data Hazards: Forwarding vs. Stalling
Data Hazards in ALU Instructions
› Consider this sequence:
sub x2, x1,x3
and x12,x2,x5
or x13,x6,x2
add x14,x2,x2
sd x15,100(x2)
Stall inserted
here
Datapath with Hazard Detection
Stalls and Performance
Flush these
instructions
(Set control
values to 0)
PC
Reducing Branch Delay
› Move hardware to determine outcome to ID
stage
– Target address adder
– Register comparator
› Example: branch taken
36: sub x10, x4, x8
40: beq x1, x3, 16 // PC-relative
branch
// to 40+16*2=72
44: and x12, x2, x5
48: orr x13, x2, x6
52: add x14, x4, x2
56: sub x15, x6, x7
...
72: ld x4, 50(x7)
Example: Branch Taken
Example: Branch Taken
Dynamic Branch Prediction
› In deeper and superscalar pipelines, branch
penalty is more significant
› Use dynamic prediction
– Branch prediction buffer (aka branch history table)
– Indexed by recent branch instruction addresses
– Stores outcome (taken/not taken)
– To execute a branch
› Check table, expect the same outcome
› Start fetching from fall-through or target
› If wrong, flush pipeline and flip prediction
1-Bit Predictor: Shortcoming
› Inner loop branches mispredicted twice!
outer: …
…
inner: …
…
beq …, …, inner
…
beq …, …, outer
› Interrupt
– From an external I/O controller
› Dealing with them without sacrificing
performance is hard
Handling Exceptions
› Save PC of offending (or interrupted)
instruction
– In RISC-V: Supervisor Exception Program Counter
(SEPC)
› Jump to handler
– Assume at 0000 0000 1C09 0000hex
An Alternate Mechanism
› Vectored Interrupts
– Handler address determined by the cause
› Exception vector address to be added to a
vector table base register:
– Undefined opcode 00 0100 0000two
– Hardware malfunction: 01 1000 0000two
– …: …
› Instructions either
– Deal with the interrupt, or
– Jump to real handler
Handler Actions
› Read cause, and transfer to relevant handler
› Determine action required
› If restartable
– Take corrective action
– use SEPC to return to program
› Otherwise
– Terminate program
– Report error using SEPC, SCAUSE, …
Exceptions in a Pipeline
› Another form of control hazard
› Consider malfunction on add in EX stage
add x1, x2, x1
– Prevent x1 from being clobbered
– Complete previous instructions
– Flush add and subsequent instructions
– Set SEPC and SCAUSE register values
– Transfer control to handler
› Similar to mispredicted branch
– Use much of the same hardware
Pipeline with Exceptions
Exception Properties
› Restartable exceptions
– Pipeline can flush the instruction
– Handler executes, then returns to the instruction
› Refetched and executed from scratch
› Load-use hazard
– Still one cycle use latency, but now two instructions
› More aggressive scheduling required
Scheduling Example
› Schedule this for dual-issue RISC-V
Loop: ld x31,0(x20) // x31=array element
add x31,x31,x21 // add scalar in x21
sd x31,0(x20) // store result
addi x20,x20,-8 // decrement pointer
blt x22,x20,Loop // branch if x22 < x20
Hold pending
operands
Pipeline stages 8 14
Pipeline schedule Static in-order Dynamic out-of-order
with speculation
Branch prediction Hybrid 2-level
1st level caches/core 16-64 KiB I, 16-64 KiB D 32 KiB I, 32 KiB D
2nd level caches/core 128-2048 KiB 256 KiB (per core)
3rd level caches (shared) (platform dependent) 2-8 MB
ARM Cortex-A53 Pipeline
ARM Cortex-A53 Performance
Core i7 Pipeline
Core i7 Performance
§4.12 Instruction-Level Parallelism and Matrix Multiply
Matrix Multiply
› Unrolled C code
1 #include <x86intrin.h>
2 #define UNROLL (4)
3
4 void dgemm (int n, double* A, double* B, double* C)
5 {
6 for ( int i = 0; i < n; i+=UNROLL*4 )
7 for ( int j = 0; j < n; j++ ) {
8 __m256d c[4];
9 for ( int x = 0; x < UNROLL; x++ )
10 c[x] = _mm256_load_pd(C+i+x*4+j*n);
11
12 for( int k = 0; k < n; k++ )
13 {
14 __m256d b = _mm256_broadcast_sd(B+k+j*n);
15 for (int x = 0; x < UNROLL; x++)
16 c[x] = _mm256_add_pd(c[x],
17 _mm256_mul_pd(_mm256_load_pd(A+n*k+x*4+i), b));
18 }
19
20 for ( int x = 0; x < UNROLL; x++ )
21 _mm256_store_pd(C+i+x*4+j*n, c[x]);
22 }
23 }
Matrix Multiply
› Assembly code:
1 vmovapd (%r11),%ymm4 # Load 4 elements of C into %ymm4
2 mov %rbx,%rax # register %rax = %rbx
3 xor %ecx,%ecx # register %ecx = 0
4 vmovapd 0x20(%r11),%ymm3 # Load 4 elements of C into %ymm3
5 vmovapd 0x40(%r11),%ymm2 # Load 4 elements of C into %ymm2
6 vmovapd 0x60(%r11),%ymm1 # Load 4 elements of C into %ymm1
7 vbroadcastsd (%rcx,%r9,1),%ymm0 # Make 4 copies of B element
8 add $0x8,%rcx # register %rcx = %rcx + 8
9 vmulpd (%rax),%ymm0,%ymm5 # Parallel mul %ymm1,4 A elements
10 vaddpd %ymm5,%ymm4,%ymm4 # Parallel add %ymm5, %ymm4
11 vmulpd 0x20(%rax),%ymm0,%ymm5 # Parallel mul %ymm1,4 A elements
12 vaddpd %ymm5,%ymm3,%ymm3 # Parallel add %ymm5, %ymm3
13 vmulpd 0x40(%rax),%ymm0,%ymm5 # Parallel mul %ymm1,4 A elements
14 vmulpd 0x60(%rax),%ymm0,%ymm0 # Parallel mul %ymm1,4 A elements
15 add %r8,%rax # register %rax = %rax + %r8
16 cmp %r10,%rcx # compare %r8 to %rax
17 vaddpd %ymm5,%ymm2,%ymm2 # Parallel add %ymm5, %ymm2
18 vaddpd %ymm0,%ymm1,%ymm1 # Parallel add %ymm0, %ymm1
19 jne 68 <dgemm+0x68> # jump if not %r8 != %rax
20 add $0x1,%esi # register % esi = % esi + 1
21 vmovapd %ymm4,(%r11) # Store %ymm4 into 4 C elements
22 vmovapd %ymm3,0x20(%r11) # Store %ymm3 into 4 C elements
23 vmovapd %ymm2,0x40(%r11) # Store %ymm2 into 4 C elements
24 vmovapd %ymm1,0x60(%r11) # Store %ymm1 into 4 C elements
Performance Impact
§4.14 Fallacies and Pitfalls
Fallacies
› Pipelining is easy (!)
– The basic idea is easy
– The devil is in the details
› e.g., detecting data hazards