15IF11 Multicore A PDF
15IF11 Multicore A PDF
15IF11 Multicore A PDF
Session-1
Syllabus:
Introduction to instruction pipeline & super scalar
processing. Memory hierarchy, On-Chip Interconnects,
Tiled Chip Multicore Processors, Scheduling
Instruction Execution Principles
CISC vs RISC processors
Instruction Pipeline, Hazards
Static and Dynamic Scheduling
Superscalar Processors
Cache Memory, Main Memory
Network On Chip
Multicore Processors
TCMP scheduling
Reference Books
Computer Architecture-A Quantitative Approach (5th edition),
John L. Hennessy, David A. Patterson, Morgan Kaufman.
Instruction
Fetch
Instruction
Decode
Operand
Fetch
Execute
Result
Store
Next
Instruction
How memory works ?
Introduction to RISC MIPS
Next PC
Next SEQ PC Next SEQ PC
Adder
4 RS1
Zero?
MEM/WB
Address
Reg File
EX/MEM
RS2
ID/EX
IF/ID
ALU
Memory
Memory
MUX
WB Data
Sign
Extend
Imm
RD RD RD
RISC Instruction Pipeline
Each instruction can take at most 5 clock cycles
Instruction fetch cycle (IF)
Based on PC, fetch the instruction from memory
Increment PC
Next PC
Instruction
Fetch Adder
4
Address
Memory
RISC Instruction Pipeline
Instruction decode/register fetch cycle (ID)
Decode the instruction + register read operation
Fixed field decoding [ADD R1,R2,R3] OR [LW R1,8(R2) ]
Ex: A3.01.02.03 : 10100011 00000001 00000010 00000011
Ex: 86.01.02.03 : 10000110 00000001 00001000 00000010
Next SEQ PC
Instr. Decode
RS1
Reg. Fetch
Reg File
RS2
Sign
Extend
Imm
RD
RISC Instruction Pipeline
Execution/Effective address cycle (EX)
Memory reference: Calculate the effective address
[LW R1,8(R2) ] EFF ADDR= [R2] +8
Register-register ALU instruction [ADD R1,R2,R2]
Execute
Zero?
Addr. Calc
ALU
RD
RISC Instruction Pipeline
Memory access cycle (MEM)
Load from memory and store in register [LW R1,8(R2)]
Store the data from the register to memory [SW R3,16(R4)]
Memory
Access
Memory
RISC Instruction Pipeline
Write-back cycle (WB)
Register-register ALU instruction or load instruction
Write to register file [LW R1,8(R2)] , [ADD R1,R2,R3]
RS1
RS2
Reg File
RD
Imm
Sign
Extend
Pipelined RISC Data path
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back
Next PC
Next SEQ PC Next SEQ PC
Adder
4 RS1
Zero?
MEM/WB
Address
Reg File
EX/MEM
RS2
ID/EX
IF/ID
ALU
Memory
Memory
MUX
WB Data
Sign
Extend
Imm
RD RD RD
RISC Instruction Pipeline
IF ID MEM WB
Visualizing Pipelining
Visualizing Pipelining
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
ALU
IM REG DM REG
ALU
IM REG DM REG
Pipelining Issues
ALU
Load Ifetch Reg DMem Reg
ALU
Instr 1 Ifetch Reg DMem Reg
ALU
Instr 2 Ifetch Reg DMem Reg
ALU
Instr 3 Ifetch Reg DMem Reg
Instr 4
Structural Hazard
Resolving Structural Hazard
Eliminate the use same hardware for two different things at the
same time
Solution 1: Wait
ALU
Load Ifetch Reg DMem Reg
ALU
Instr 1 Ifetch Reg DMem Reg
ALU
Instr 2 Ifetch Reg DMem Reg
ALU
Instr 3 Ifetch Reg DMem Reg
Eliminating Structural Hazards at Design
Adder
Next PC
Next SEQ PC Next SEQ PC
4 Zero?
RS1
MUX MUX
MEM/WB
Address
RS2
EX/MEM
Reg File
ID/EX
IF/ID
ALU
Cache
Cache
Instr
Data
MUX
WB Data
Sign
Extend
Imm
RD RD RD
Data Hazard
IF ID/RF EX MEM WB
ALU
add r1,r2,r3 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
sub r4,r1,r3
ALU
and r6,r1,r7 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
or r8,r1,r9
ALU
xor r10,r1,r11 Ifetch Reg DMem Reg
Three Generic Data Hazards
I : add r1,r2,r3
J: sub r4,r1,r3
ALU
add r1,r2,r3 Ifetch Reg DMem Reg
ALU
sub r4,r1,r3 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
and r6,r1,r7
ALU
Ifetch Reg DMem Reg
or r8,r1,r9
ALU
Ifetch Reg DMem Reg
xor r10,r1,r11
Hardware Change for Forwarding
NextPC
mux
MEM/WR
EX/MEM
Reg ALU
ID/EX
Data
mux
Memory
mux
Immediate
Data Hazard even with Operand Forwarding
ALU
lw r1, 0(r2) Ifetch Reg DMem Reg
ALU
sub r4,r1,r6 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
and r6,r1,r7
ALU
Ifetch Reg DMem Reg
or r8,r1,r9
Resolving the Load-ALU Hazard
ALU
lw r1, 0(r2) Ifetch Reg DMem Reg
ALU
sub r4,r1,r6 Ifetch Reg Bubble DMem Reg
ALU
and r6,r1,r7 Ifetch Bubble Reg DMem Reg
ALU
Bubble Ifetch Reg DMem
or r8,r1,r9
Software Scheduling for Load Hazards
Assume a, b, c, d ,e, and f in memory.
LW Rb,b
a = b + c;
d = e – f; LW Rc,c
LW Re,e
LW Rb,b
LW Rc,c ADD Ra,Rb,Rc
ADD Ra,Rb,Rc
LW Rf,f
SW a,Ra
LW Re,e SW a,Ra
LW Rf,f
SUB Rd,Re,Rf SUB Rd,Re,Rf
SW d,Rd SW d,Rd
Control Hazard on Branches
ALU
10: beq r1,r3,36 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
14: and r2,r3,r5
ALU
18: or r6,r1,r7 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
22: add r8,r1,r9
ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg
Conventional MIPS Pipeline
Branching at 4th stage
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back
Next PC
Next SEQ PC Next SEQ PC
Adder
4 RS1
Zero?
MEM/WB
Address
EX/MEM
RS2
Reg File
ID/EX
Memory
IF/ID
Memory
ALU
WB Data
Sign
Extend
Imm
RD RD RD
Branch Optimized MIPS Pipeline
Branching at 2nd stage
MEM/WB
Address
RS2
Memory
EX/MEM
Reg File
ID/EX
ALU
IF/ID
Memory
Data
WB Data
Sign
Extend
Imm
RD RD RD
Dynamic branch prediction
Use a Branch Prediction Buffer (BPB)
Also called Branch Prediction Table (BPT), Branch History
Table (BHT)
Records previous outcomes of the branch instruction.
How to index into the table is an issue.
A prediction using BPB is attempted when the branch
instruction is fetched (IF stage or equivalent)
It is acted upon during ID stage (when we know we have a
branch)
Two-bit Prediction Scheme
This method is more reliable than using a single bit to
represent whether the branch was recently taken or not.
Compiler scheduling
Advantages:
Compiler doesn’t need to have knowledge of micro-
architecture
Handles cases where dependencies are unknown at
compile time
Disadvantage:
Substantial increase in hardware complexity
How dynamic scheduling works ?
Limitation of simple pipelining.
In-order instruction issue and execution.
Instructions are issued in program order.
If an instruction is stalled in the pipeline, no later
instructions can proceed.
If instruction j depends on a long-running instruction i,
currently in execution in the pipeline, then all instructions after
j must be stalled until i is finished and j can execute.
How dynamic scheduling works ?
time
Instruction 1
Instruction 2
Instruction 3
Instruction 4
Instruction 5
Instruction 6
Advanced Pipelining
Fraction enhanced
ETnew ETold (1 fraction enhanced )
Speedup enhanced
ETnew 1
Speedup
ETold (1 fraction Fraction enhanced
enhanced )
Speedup enhanced
Amdahl's Law- Illustration
Solution:
Fraction enhanced = 0.4
Speedup enhanced = 10
Amdahl's Law for Parallel Processing