15IF11 Multicore A PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

15IF11: Multicore Technology @ PSG Tech, Coimbatore

Session-1

Dr. John Jose


Assistant Professor
Department of Computer Science & Engineering
Indian Institute of Technology Guwahati, Assam.
9th & 10th March 2019
15IF11: Multicore Technology @ PSG Tech, Coimbatore

Syllabus:
Introduction to instruction pipeline & super scalar
processing. Memory hierarchy, On-Chip Interconnects,
Tiled Chip Multicore Processors, Scheduling
Instruction Execution Principles
CISC vs RISC processors
Instruction Pipeline, Hazards
Static and Dynamic Scheduling
Superscalar Processors
Cache Memory, Main Memory
Network On Chip
Multicore Processors
TCMP scheduling
Reference Books
 Computer Architecture-A Quantitative Approach (5th edition),
John L. Hennessy, David A. Patterson, Morgan Kaufman.

 Computer Architecture-Pipelined and Parallel Processor Design,


Michael J. Flynn, Narosa Publishing House.

 Memory System-Cache, DRAM and Disk, Bruce Jacob, Spencer


W. Ng, David T. Wang, Morgan Kaufman.

 Principles and Practices of Interconnection Networks, William


James Dally, Brian Towles, Morgan Kaufman.
Role of computer architects
Applications and hand held devices are part and
parcel of our day to day life
Processor Memory Interaction

Instruction
Fetch
Instruction
Decode
Operand
Fetch

Execute

Result
Store
Next
Instruction
How memory works ?
Introduction to RISC MIPS

 MIPS – Microprocessor without Interlocked Pipelined Stages


 32 registers (32 bit each)
 Uniform length instructions
 RISC- Load store architecture
Pipelined RISC Data path
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back

Next PC
Next SEQ PC Next SEQ PC
Adder

4 RS1
Zero?

MEM/WB
Address

Reg File

EX/MEM
RS2

ID/EX
IF/ID

ALU
Memory

Memory

MUX

WB Data
Sign
Extend
Imm

RD RD RD
RISC Instruction Pipeline
 Each instruction can take at most 5 clock cycles
 Instruction fetch cycle (IF)
 Based on PC, fetch the instruction from memory
 Increment PC

Next PC
Instruction
Fetch Adder

4
Address

Memory
RISC Instruction Pipeline
 Instruction decode/register fetch cycle (ID)
 Decode the instruction + register read operation
 Fixed field decoding [ADD R1,R2,R3] OR [LW R1,8(R2) ]
Ex: A3.01.02.03 : 10100011 00000001 00000010 00000011
 Ex: 86.01.02.03 : 10000110 00000001 00001000 00000010

Next SEQ PC

Instr. Decode
RS1
Reg. Fetch
Reg File

RS2

Sign
Extend
Imm

RD
RISC Instruction Pipeline
 Execution/Effective address cycle (EX)
 Memory reference: Calculate the effective address
 [LW R1,8(R2) ] EFF ADDR= [R2] +8
 Register-register ALU instruction [ADD R1,R2,R2]

Execute
Zero?
Addr. Calc
ALU

RD
RISC Instruction Pipeline
 Memory access cycle (MEM)
 Load from memory and store in register [LW R1,8(R2)]
 Store the data from the register to memory [SW R3,16(R4)]

Memory
Access
Memory
RISC Instruction Pipeline
 Write-back cycle (WB)
 Register-register ALU instruction or load instruction
 Write to register file [LW R1,8(R2)] , [ADD R1,R2,R3]

Instr. Decode Write


Reg. Fetch Back

RS1

RS2
Reg File

RD

Imm

Sign
Extend
Pipelined RISC Data path
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back

Next PC
Next SEQ PC Next SEQ PC
Adder

4 RS1
Zero?

MEM/WB
Address

Reg File

EX/MEM
RS2

ID/EX
IF/ID

ALU
Memory

Memory

MUX

WB Data
Sign
Extend
Imm

RD RD RD
RISC Instruction Pipeline

 Instruction fetch cycle (IF)


 Based on PC, fetch the instruction from memory
 Instruction decode/register fetch cycle (ID)
 Decode the instruction + register read operation
 Fixed field decoding [ADD R1,R2,R3] [ A3.01.02.03]
 Execution/Effective address cycle (EX)
 Memory reference: Calculate the effective address
 [LW R1,8(R2) ] EFF ADDR= [R2] +8
 Register-register ALU instruction [ADD R1,R2,R3]
RISC Instruction Pipeline

 Memory access cycle (MEM)


 Load instruction: Read from memory using effective
address [LW R1,8(R2)]
 Store instruction: Write the data from the register to
memory using effective address [SW R1,8(R2) ]
 Write-back cycle (WB)
Register-register ALU instruction or load instruction
Write to register file [LW R1,8(R2)] , [ADD R1,R2,R3]
EX

IF ID MEM WB
Visualizing Pipelining
Visualizing Pipelining
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

ALU
IM REG DM REG

IM REG ALU DM REG

IM REG ALU DM REG

ALU
IM REG DM REG
Pipelining Issues

 Ideal Case: Uniform sub-computations


 The computation to be performed can be evenly partitioned
into uniform-latency sub-computations
 Reality: Internal fragmentation
 Not all pipeline stages may have the uniform latencies
Pipelining Issues
 Ideal Case : Identical computations
 The same computation is to be performed repeatedly on a
large number of input data sets
 Reality: External fragmentation
 Some pipeline stages may not be used
Pipelining Issues
 Ideal Case : Independent computations
 All the instructions are mutually independent
 Reality: Pipeline stalls – cannot proceed.
 A later computation may require the result of an earlier
computation
Limits to pipelining
 Hazards: circumstances that would cause incorrect execution if
next instruction is fetched and executed

Structural hazards: Attempting to use the same hardware to do


two different things at the same time

Data hazards: Instruction depends on result of prior instruction


still in the pipeline

Control hazards: Caused by delay between the fetching of


instructions and decisions about changes in control flow
(branches)
Structural Hazard

Eg: Uniport Memory


Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

ALU
Load Ifetch Reg DMem Reg

ALU
Instr 1 Ifetch Reg DMem Reg

ALU
Instr 2 Ifetch Reg DMem Reg

ALU
Instr 3 Ifetch Reg DMem Reg

Instr 4

Structural Hazard
Resolving Structural Hazard
 Eliminate the use same hardware for two different things at the
same time

 Solution 1: Wait

 must detect the hazard

 must have mechanism to stall

 Solution 2: Duplicate hardware

 Multiple such units will help both instruction to progress


Detecting & Resolving Structural Hazard

Time (clock cycles)


Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

ALU
Load Ifetch Reg DMem Reg

ALU
Instr 1 Ifetch Reg DMem Reg

ALU
Instr 2 Ifetch Reg DMem Reg

Stall Bubble Bubble Bubble Bubble Bubble

ALU
Instr 3 Ifetch Reg DMem Reg
Eliminating Structural Hazards at Design

Adder
Next PC
Next SEQ PC Next SEQ PC

4 Zero?
RS1

MUX MUX

MEM/WB
Address

RS2

EX/MEM
Reg File

ID/EX
IF/ID

ALU

Cache
Cache
Instr

Data

MUX

WB Data
Sign
Extend
Imm

RD RD RD
Data Hazard

Time (clock cycles)

IF ID/RF EX MEM WB

ALU
add r1,r2,r3 Ifetch Reg DMem Reg

ALU
Ifetch Reg DMem Reg
sub r4,r1,r3

ALU
and r6,r1,r7 Ifetch Reg DMem Reg

ALU
Ifetch Reg DMem Reg
or r8,r1,r9

ALU
xor r10,r1,r11 Ifetch Reg DMem Reg
Three Generic Data Hazards

 Read After Write (RAW)


InstrJ tries to read operand before InstrI writes it

I : add r1,r2,r3
J: sub r4,r1,r3

 Caused by a data dependence


 This hazard results from an actual need for communication.
Three Generic Data Hazards
 Write After Read (WAR)
InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
 Called an anti-dependence by compiler writers. K: mul r6,r1,r7
 This results from reuse of the name r1
 Can’t happen in MIPS 5 stage pipeline because:
 All instructions take 5 stages, and
 Reads are always in stage 2, and
 Writes are always in stage 5
Three Generic Data Hazards
 Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3
 Called an output dependence J: add r1,r2,r3
 This also results from the reuse of name r1. K: mul r6,r1,r7
 Can’t happen in MIPS 5 stage pipeline because:

 All instructions take 5 stages, and

 Writes are always in stage 5


 WAR and WAW happens in out of order pipes
Operand Forwarding to Avoid Data Hazard

Time (clock cycles)

ALU
add r1,r2,r3 Ifetch Reg DMem Reg

ALU
sub r4,r1,r3 Ifetch Reg DMem Reg

ALU
Ifetch Reg DMem Reg
and r6,r1,r7

ALU
Ifetch Reg DMem Reg
or r8,r1,r9

ALU
Ifetch Reg DMem Reg
xor r10,r1,r11
Hardware Change for Forwarding

NextPC
mux

MEM/WR
EX/MEM
Reg ALU
ID/EX

Data
mux

Memory

mux
Immediate
Data Hazard even with Operand Forwarding

Time (clock cycles)

ALU
lw r1, 0(r2) Ifetch Reg DMem Reg

ALU
sub r4,r1,r6 Ifetch Reg DMem Reg

ALU
Ifetch Reg DMem Reg
and r6,r1,r7

ALU
Ifetch Reg DMem Reg
or r8,r1,r9
Resolving the Load-ALU Hazard

Time (clock cycles)

ALU
lw r1, 0(r2) Ifetch Reg DMem Reg

ALU
sub r4,r1,r6 Ifetch Reg Bubble DMem Reg

ALU
and r6,r1,r7 Ifetch Bubble Reg DMem Reg

ALU
Bubble Ifetch Reg DMem
or r8,r1,r9
Software Scheduling for Load Hazards
Assume a, b, c, d ,e, and f in memory.
LW Rb,b
a = b + c;
d = e – f; LW Rc,c

LW Re,e
LW Rb,b
LW Rc,c ADD Ra,Rb,Rc
ADD Ra,Rb,Rc
LW Rf,f
SW a,Ra
LW Re,e SW a,Ra
LW Rf,f
SUB Rd,Re,Rf SUB Rd,Re,Rf
SW d,Rd SW d,Rd
Control Hazard on Branches

=> Three Stage Stall

ALU
10: beq r1,r3,36 Ifetch Reg DMem Reg

ALU
Ifetch Reg DMem Reg
14: and r2,r3,r5

ALU
18: or r6,r1,r7 Ifetch Reg DMem Reg

ALU
Ifetch Reg DMem Reg
22: add r8,r1,r9

ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg
Conventional MIPS Pipeline
Branching at 4th stage
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back

Next PC
Next SEQ PC Next SEQ PC
Adder

4 RS1
Zero?

MEM/WB
Address

EX/MEM
RS2
Reg File

ID/EX
Memory

IF/ID

Memory
ALU

WB Data
Sign
Extend
Imm

RD RD RD
Branch Optimized MIPS Pipeline
Branching at 2nd stage

Instruction Instr. Decode Execute Memory Write


Fetch Reg. Fetch Addr. Calc Access Back
Next PC Next
SEQ PC
Adder Adder
Zero?
4
RS1

MEM/WB
Address

RS2
Memory

EX/MEM
Reg File

ID/EX
ALU
IF/ID

Memory
Data

WB Data
Sign
Extend
Imm

RD RD RD
Dynamic branch prediction
 Use a Branch Prediction Buffer (BPB)
Also called Branch Prediction Table (BPT), Branch History
Table (BHT)
Records previous outcomes of the branch instruction.
How to index into the table is an issue.
 A prediction using BPB is attempted when the branch
instruction is fetched (IF stage or equivalent)
 It is acted upon during ID stage (when we know we have a
branch)
Two-bit Prediction Scheme
 This method is more reliable than using a single bit to
represent whether the branch was recently taken or not.

 The use of a 2-bit predictor will allow branches that favor


taken (or not taken) to be mispredicted less often than the
one-bit case. (reinforcement learning)
Multi-cycle Operations
 Some operations require more than 1 clock cycle to complete.

Floating Point Multiply

Floating Point Divide

Floating Point Add

 Hardware is available on the processor for performing the


operations.
Multi-cycle Operations
Multi-cycle Operations
Issues in Longer Latency Pipelines
 Since divide unit is not-pipelined – structural hazard
 Instruction have varying runtimes–more register writes/cycle
 Resolve write port conflicts in ID stage
 WAW hazards possible
 Out of order completion
 Stalls for RAW hazards will be more
Compiler Techniques for Exposing ILP
 Find and overlap sequence of unrelated instruction

 Compiler scheduling

 Separate dependent instruction from the source instruction by


pipeline latency of the source instruction
Pipeline Stalls
Loop: L.D F0,0(R1)
for (i=999; i>=0; i=i-1)
stall
ADD.D F4,F0,F2 x[i] = x[i] + s;
stall
stall
S.D F4,0(R1)
DADDUI R1,R1,#-8
stall (assume integer load latency is 1)
BNE R1,R2,Loop
Pipeline Scheduling
Loop: L.D F0,0(R1) Scheduled code:
stall
ADD.D F4,F0,F2 Loop: L.DF0,0(R1)
stall DADDUI R1,R1,#-8
stall ADD.D F4,F0,F2
S.D F4,0(R1) stall
DADDUI R1,R1,#-8 stall
stall S.D F4,8(R1)
BNE R1,R2,Loop BNE R1,R2,Loop
Loop Unrolling
Loop unrolling
Loop: L.D F0,0(R1)
Loop: L.D F0,0(R1)
stall
ADD.D F4,F0,F2
ADD.D F4,F0,F2 S.D F4,0(R1) ; // drop DADDUI & BNE
stall L.D F6,-8(R1)
stall ADD.D F8,F6,F2
S.D F4,0(R1) S.D F8,-8(R1) ; // drop DADDUI & BNE
DADDUI R1,R1,#-8 L.D F10,-16(R1)
stall ADD.D F12,F10,F2
BNE R1,R2,Loop S.D F12,-16(R1) ; // drop DADDUI &
BNE
L.D F14,-24(R1)
ADD.D F16,F14,F2
S.D F16,-24(R1)
 Unroll by a factor of 4 DADDUI R1,R1,#-32
BNE R1,R2,Loop
 Eliminate unnecessary
instructions
Loop Unrolling/Pipeline Scheduling
 Loop unrolling  Scheduled, Unrolled loop
Loop: L.D F0,0(R1) Loop: L.D F0,0(R1)
ADD.D F4,F0,F2 L.D F6,-8(R1)
S.D F4,0(R1) L.D F10,-16(R1)
L.D F6,-8(R1) L.D F14,-24(R1)
ADD.D F8,F6,F2 ADD.D F4,F0,F2
S.D F8,-8(R1) ADD.D F8,F6,F2
L.D F10,-16(R1) ADD.D F12,F10,F2
ADD.D F12,F10,F2 ADD.D F16,F14,F2
S.D F12,-16(R1) S.D F4,0(R1)
L.D F14,-24(R1) S.D F8,-8(R1)
ADD.D F16,F14,F2 DADDUI R1,R1,#-32
S.D F16,-24(R1) S.D F12,16(R1)
DADDUI R1,R1,#-32 S.D F16,8(R1)
BNE R1,R2,Loop BNE R1,R2,Loop
Dynamic Scheduling

 Rearrange execution order of instructions to reduce stalls


while maintaining data flow

 Advantages:
Compiler doesn’t need to have knowledge of micro-
architecture
Handles cases where dependencies are unknown at
compile time
 Disadvantage:
Substantial increase in hardware complexity
How dynamic scheduling works ?
 Limitation of simple pipelining.
In-order instruction issue and execution.
Instructions are issued in program order.
If an instruction is stalled in the pipeline, no later
instructions can proceed.
 If instruction j depends on a long-running instruction i,
currently in execution in the pipeline, then all instructions after
j must be stalled until i is finished and j can execute.
How dynamic scheduling works ?

 Separate the issue process into two parts:


checking for any structural hazards.
waiting for the absence of a data hazard.
 Use in-order instruction issue but we want an instruction to
begin execution as soon as its data operands are available.
 out-of-order execution  out-of-order completion.
 OOO execution introduces the possibility of WAR and WAW
hazards
How dynamic scheduling works ?
 To allow out-of-order execution, split the ID stage into two
 Issue—Decode instructions, check for structural hazards.
 Read operands—Wait until no data hazards, then read
operands.
 In a dynamically scheduled pipeline, all instructions pass through
the issue stage in order (in-order issue); however, they can be
stalled or bypass each other in the second stage (read operands)
and thus enter execution out of order.
Simple in-order pipelining

pipelining –goal was to complete one instruction per


clock cycle
time
Instruction1
Instruction2
Instruction3
Instruction4
Instruction5
Superpipelining
superpipelining -Increase the depth of the pipeline to
increase the clock rate
time
Instruction1
Instruction2
Instruction3
Instruction4
Instruction5
Superscalar – multiple-issue
 Fetch (and execute) more than 1 instructions at one time

 Expand every stage to accommodate multiple instructions

time
Instruction 1
Instruction 2
Instruction 3
Instruction 4
Instruction 5
Instruction 6
Advanced Pipelining

Increase the depth of the pipeline to increase the clock


rate – superpipelining

Fetch (and execute) more than one instructions at one


time (expand every pipeline stage to accommodate
multiple instructions) – multiple-issue (super scalar)

Launching multiple instructions per stage allows the


instruction execution rate, CPI, to be less than 1
Superscalar Processor
Multithreading vs Hyperthreading
Amdahl's Law
 Amdahl’s Law defines the speedup that can be gained by improving
some portion of a computer.
 The performance improvement to be gained from using some faster
mode of execution is limited by the fraction of the time the faster mode
can be used.

Fraction enhanced
ETnew  ETold  (1  fraction enhanced ) 
Speedup enhanced
ETnew 1
Speedup  
ETold (1  fraction Fraction enhanced
enhanced ) 
Speedup enhanced
Amdahl's Law- Illustration

Example: Suppose that we want to enhance the floating point


operations of a processor by introducing a new advanced FPU unit.
Let the new FPU is 10 times faster on floating point computations
than the original processor. Assuming a program has 40% floating
point operations, what is the overall speedup gained by
incorporating the enhancement?

Solution:
Fraction enhanced = 0.4
Speedup enhanced = 10
Amdahl's Law for Parallel Processing

100 100 100 100

100 50 50 25 25 25 25 ∞ Processors, Time ≈0

100 100 100 100

100 50 50 25 25 25 25 ∞ Processors, Time ≈0

100 100 100 100


Work 500, Work 500, Work 500, Work 500,
Time 500 Time 400 Time 350 Time 300
Sp=1X Sp=1.25X Sp=1.4X Sp=1.7X
How much Speed up you can achieve ?
[email protected]
http://www.iitg.ac.in/johnjose/

You might also like