15IF11 Multicore A PDF

15IF11: Multicore Technology @ PSG Tech, Coimbatore
Session-1
Dr. John Jose

Assistant Professor
Department of Computer Science & Engineering
Indian Institute of Technology Guwahati, Assam.
9th & 10th March 2019
15IF11: Multicore Technology @ PSG Tech, Coimbatore
Syllabus:
Introduction to instruction pipeline & super scalar
processing. Memory hierarchy, On-Chip Interconnects,
Tiled Chip Multicore Processors, Scheduling
Instruction Execution Principles
CISC vs RISC processors
Instruction Pipeline, Hazards
Static and Dynamic Scheduling
Superscalar Processors
Cache Memory, Main Memory
Network On Chip
Multicore Processors
TCMP scheduling
Reference Books
 Computer Architecture-A Quantitative Approach (5th edition),
John L. Hennessy, David A. Patterson, Morgan Kaufman.
 Computer Architecture-Pipelined and Parallel Processor Design,

Michael J. Flynn, Narosa Publishing House.
 Memory System-Cache, DRAM and Disk, Bruce Jacob, Spencer

W. Ng, David T. Wang, Morgan Kaufman.
 Principles and Practices of Interconnection Networks, William

James Dally, Brian Towles, Morgan Kaufman.
Role of computer architects
Applications and hand held devices are part and
parcel of our day to day life
Processor Memory Interaction
Instruction
Fetch
Instruction
Decode
Operand
Fetch
Execute
Result
Store
Next
Instruction
How memory works ?
Introduction to RISC MIPS
 MIPS – Microprocessor without Interlocked Pipelined Stages

 32 registers (32 bit each)
 Uniform length instructions
 RISC- Load store architecture
Pipelined RISC Data path
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back
Next PC
Next SEQ PC Next SEQ PC
Adder
4 RS1
Zero?
MEM/WB
Address
Reg File
EX/MEM
RS2
ID/EX
IF/ID
ALU
Memory
Memory
MUX
WB Data
Sign
Extend
Imm
RD RD RD
RISC Instruction Pipeline
 Each instruction can take at most 5 clock cycles
 Instruction fetch cycle (IF)
 Based on PC, fetch the instruction from memory
 Increment PC
Next PC
Instruction
Fetch Adder
4
Address
Memory
 Instruction decode/register fetch cycle (ID)
 Decode the instruction + register read operation
 Fixed field decoding [ADD R1,R2,R3] OR [LW R1,8(R2) ]
Ex: A3.01.02.03 : 10100011 00000001 00000010 00000011
 Ex: 86.01.02.03 : 10000110 00000001 00001000 00000010
Next SEQ PC
Instr. Decode
RS1
Reg. Fetch
Reg File
RS2
Sign
Extend
Imm
RD
 Execution/Effective address cycle (EX)
 Memory reference: Calculate the effective address
 [LW R1,8(R2) ] EFF ADDR= [R2] +8
 Register-register ALU instruction [ADD R1,R2,R2]
Execute
Zero?
Addr. Calc
ALU
RD
 Memory access cycle (MEM)
 Load from memory and store in register [LW R1,8(R2)]
 Store the data from the register to memory [SW R3,16(R4)]
Memory
Access
Memory
 Write-back cycle (WB)
 Register-register ALU instruction or load instruction
 Write to register file [LW R1,8(R2)] , [ADD R1,R2,R3]
Instr. Decode Write

Reg. Fetch Back
RS1
RS2
Reg File
RD
Imm
Sign
Extend
Pipelined RISC Data path
Next PC
Adder
4 RS1
Zero?
MEM/WB
Address
Reg File
EX/MEM
RS2
ID/EX
IF/ID
ALU
Memory
Memory
MUX
WB Data
Sign
Extend
Imm
RD RD RD
 Instruction fetch cycle (IF)

 Based on PC, fetch the instruction from memory
 Instruction decode/register fetch cycle (ID)
 Decode the instruction + register read operation
 Fixed field decoding [ADD R1,R2,R3] [ A3.01.02.03]
 Execution/Effective address cycle (EX)
 Memory reference: Calculate the effective address
 [LW R1,8(R2) ] EFF ADDR= [R2] +8
 Register-register ALU instruction [ADD R1,R2,R3]
 Memory access cycle (MEM)

 Load instruction: Read from memory using effective
address [LW R1,8(R2)]
 Store instruction: Write the data from the register to
memory using effective address [SW R1,8(R2) ]
 Write-back cycle (WB)
Register-register ALU instruction or load instruction
Write to register file [LW R1,8(R2)] , [ADD R1,R2,R3]
EX
IF ID MEM WB
Visualizing Pipelining
Visualizing Pipelining
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
ALU
IM REG DM REG
IM REG ALU DM REG
IM REG ALU DM REG
ALU
IM REG DM REG
Pipelining Issues
 Ideal Case: Uniform sub-computations

 The computation to be performed can be evenly partitioned
into uniform-latency sub-computations
 Reality: Internal fragmentation
 Not all pipeline stages may have the uniform latencies
Pipelining Issues
 Ideal Case : Identical computations
 The same computation is to be performed repeatedly on a
large number of input data sets
 Reality: External fragmentation
 Some pipeline stages may not be used
Pipelining Issues
 Ideal Case : Independent computations
 All the instructions are mutually independent
 Reality: Pipeline stalls – cannot proceed.
 A later computation may require the result of an earlier
computation
Limits to pipelining
 Hazards: circumstances that would cause incorrect execution if
next instruction is fetched and executed
Structural hazards: Attempting to use the same hardware to do

two different things at the same time
Data hazards: Instruction depends on result of prior instruction

still in the pipeline
Control hazards: Caused by delay between the fetching of

instructions and decisions about changes in control flow
(branches)
Structural Hazard
Eg: Uniport Memory

Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
ALU
Load Ifetch Reg DMem Reg
ALU
Instr 1 Ifetch Reg DMem Reg
ALU
ALU
Instr 4
Structural Hazard
Resolving Structural Hazard
 Eliminate the use same hardware for two different things at the
same time
 Solution 1: Wait
 must detect the hazard
 must have mechanism to stall
 Solution 2: Duplicate hardware
 Multiple such units will help both instruction to progress

Detecting & Resolving Structural Hazard
Time (clock cycles)

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
ALU
Load Ifetch Reg DMem Reg
ALU
ALU
Stall Bubble Bubble Bubble Bubble Bubble
ALU
Eliminating Structural Hazards at Design
Adder
Next PC
4 Zero?
RS1
MUX MUX
MEM/WB
Address
RS2
EX/MEM
Reg File
ID/EX
IF/ID
ALU
Cache
Cache
Instr
Data
MUX
WB Data
Sign
Extend
Imm
RD RD RD
Data Hazard
Time (clock cycles)
IF ID/RF EX MEM WB
ALU
add r1,r2,r3 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
sub r4,r1,r3
ALU
and r6,r1,r7 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
or r8,r1,r9
ALU
xor r10,r1,r11 Ifetch Reg DMem Reg
Three Generic Data Hazards
 Read After Write (RAW)

InstrJ tries to read operand before InstrI writes it
I : add r1,r2,r3
J: sub r4,r1,r3
 Caused by a data dependence

 This hazard results from an actual need for communication.
 Write After Read (WAR)
InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
 Called an anti-dependence by compiler writers. K: mul r6,r1,r7
 This results from reuse of the name r1
 Can’t happen in MIPS 5 stage pipeline because:
 All instructions take 5 stages, and
 Reads are always in stage 2, and
 Writes are always in stage 5
 Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3
 Called an output dependence J: add r1,r2,r3
 This also results from the reuse of name r1. K: mul r6,r1,r7
 Can’t happen in MIPS 5 stage pipeline because:
 All instructions take 5 stages, and
 Writes are always in stage 5

 WAR and WAW happens in out of order pipes
Operand Forwarding to Avoid Data Hazard
Time (clock cycles)
ALU
add r1,r2,r3 Ifetch Reg DMem Reg
ALU
sub r4,r1,r3 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
and r6,r1,r7
ALU
Ifetch Reg DMem Reg
or r8,r1,r9
ALU
Ifetch Reg DMem Reg
xor r10,r1,r11
Hardware Change for Forwarding
NextPC
mux
MEM/WR
EX/MEM
Reg ALU
ID/EX
Data
mux
Memory
mux
Immediate
Data Hazard even with Operand Forwarding
Time (clock cycles)
ALU
lw r1, 0(r2) Ifetch Reg DMem Reg
ALU
sub r4,r1,r6 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
and r6,r1,r7
ALU
Ifetch Reg DMem Reg
or r8,r1,r9
Resolving the Load-ALU Hazard
Time (clock cycles)
ALU
lw r1, 0(r2) Ifetch Reg DMem Reg
ALU
sub r4,r1,r6 Ifetch Reg Bubble DMem Reg
ALU
and r6,r1,r7 Ifetch Bubble Reg DMem Reg
ALU
Bubble Ifetch Reg DMem
or r8,r1,r9
Software Scheduling for Load Hazards
Assume a, b, c, d ,e, and f in memory.
LW Rb,b
a = b + c;
d = e – f; LW Rc,c
LW Re,e
LW Rb,b
LW Rc,c ADD Ra,Rb,Rc
ADD Ra,Rb,Rc
LW Rf,f
SW a,Ra
LW Re,e SW a,Ra
LW Rf,f
SUB Rd,Re,Rf SUB Rd,Re,Rf
SW d,Rd SW d,Rd
Control Hazard on Branches
=> Three Stage Stall
ALU
10: beq r1,r3,36 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
14: and r2,r3,r5
ALU
18: or r6,r1,r7 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
22: add r8,r1,r9
ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg
Conventional MIPS Pipeline
Branching at 4th stage
Next PC
Adder
4 RS1
Zero?
MEM/WB
Address
EX/MEM
RS2
Reg File
ID/EX
Memory
IF/ID
Memory
ALU
WB Data
Sign
Extend
Imm
RD RD RD
Branch Optimized MIPS Pipeline
Branching at 2nd stage

Next PC Next
SEQ PC
Adder Adder
Zero?
4
RS1
MEM/WB
Address
RS2
Memory
EX/MEM
Reg File
ID/EX
ALU
IF/ID
Memory
Data
WB Data
Sign
Extend
Imm
RD RD RD
Dynamic branch prediction
 Use a Branch Prediction Buffer (BPB)
Also called Branch Prediction Table (BPT), Branch History
Table (BHT)
Records previous outcomes of the branch instruction.
How to index into the table is an issue.
 A prediction using BPB is attempted when the branch
instruction is fetched (IF stage or equivalent)
 It is acted upon during ID stage (when we know we have a
branch)
Two-bit Prediction Scheme
 This method is more reliable than using a single bit to
represent whether the branch was recently taken or not.
 The use of a 2-bit predictor will allow branches that favor

taken (or not taken) to be mispredicted less often than the
one-bit case. (reinforcement learning)
Multi-cycle Operations
 Some operations require more than 1 clock cycle to complete.
Floating Point Multiply
Floating Point Divide
Floating Point Add
 Hardware is available on the processor for performing the

operations.
Issues in Longer Latency Pipelines
 Since divide unit is not-pipelined – structural hazard
 Instruction have varying runtimes–more register writes/cycle
 Resolve write port conflicts in ID stage
 WAW hazards possible
 Out of order completion
 Stalls for RAW hazards will be more
Compiler Techniques for Exposing ILP
 Find and overlap sequence of unrelated instruction
 Compiler scheduling
 Separate dependent instruction from the source instruction by

pipeline latency of the source instruction
Pipeline Stalls
Loop: L.D F0,0(R1)
for (i=999; i>=0; i=i-1)
stall
ADD.D F4,F0,F2 x[i] = x[i] + s;
stall
stall
S.D F4,0(R1)
DADDUI R1,R1,#-8
stall (assume integer load latency is 1)
BNE R1,R2,Loop
Pipeline Scheduling
Loop: L.D F0,0(R1) Scheduled code:
stall
ADD.D F4,F0,F2 Loop: L.DF0,0(R1)
stall DADDUI R1,R1,#-8
stall ADD.D F4,F0,F2
S.D F4,0(R1) stall
DADDUI R1,R1,#-8 stall
stall S.D F4,8(R1)
BNE R1,R2,Loop BNE R1,R2,Loop
Loop Unrolling
Loop unrolling
Loop: L.D F0,0(R1)
Loop: L.D F0,0(R1)
stall
ADD.D F4,F0,F2
ADD.D F4,F0,F2 S.D F4,0(R1) ; // drop DADDUI & BNE
stall L.D F6,-8(R1)
S.D F4,0(R1) S.D F8,-8(R1) ; // drop DADDUI & BNE
DADDUI R1,R1,#-8 L.D F10,-16(R1)
BNE R1,R2,Loop S.D F12,-16(R1) ; // drop DADDUI &
BNE
L.D F14,-24(R1)
ADD.D F16,F14,F2
S.D F16,-24(R1)
 Unroll by a factor of 4 DADDUI R1,R1,#-32
BNE R1,R2,Loop
 Eliminate unnecessary
instructions
Loop Unrolling/Pipeline Scheduling
 Loop unrolling  Scheduled, Unrolled loop
Loop: L.D F0,0(R1) Loop: L.D F0,0(R1)
ADD.D F4,F0,F2 L.D F6,-8(R1)
S.D F4,0(R1) L.D F10,-16(R1)
L.D F6,-8(R1) L.D F14,-24(R1)
ADD.D F8,F6,F2 ADD.D F4,F0,F2
S.D F8,-8(R1) ADD.D F8,F6,F2
L.D F10,-16(R1) ADD.D F12,F10,F2
ADD.D F12,F10,F2 ADD.D F16,F14,F2
S.D F12,-16(R1) S.D F4,0(R1)
L.D F14,-24(R1) S.D F8,-8(R1)
ADD.D F16,F14,F2 DADDUI R1,R1,#-32
S.D F16,-24(R1) S.D F12,16(R1)
DADDUI R1,R1,#-32 S.D F16,8(R1)
BNE R1,R2,Loop BNE R1,R2,Loop
Dynamic Scheduling
 Rearrange execution order of instructions to reduce stalls

while maintaining data flow
 Advantages:
Compiler doesn’t need to have knowledge of micro-
architecture
Handles cases where dependencies are unknown at
compile time
 Disadvantage:
Substantial increase in hardware complexity
How dynamic scheduling works ?
 Limitation of simple pipelining.
In-order instruction issue and execution.
Instructions are issued in program order.
If an instruction is stalled in the pipeline, no later
instructions can proceed.
 If instruction j depends on a long-running instruction i,
currently in execution in the pipeline, then all instructions after
j must be stalled until i is finished and j can execute.
 Separate the issue process into two parts:

checking for any structural hazards.
waiting for the absence of a data hazard.
 Use in-order instruction issue but we want an instruction to
begin execution as soon as its data operands are available.
 out-of-order execution  out-of-order completion.
 OOO execution introduces the possibility of WAR and WAW
hazards
 To allow out-of-order execution, split the ID stage into two
 Issue—Decode instructions, check for structural hazards.
 Read operands—Wait until no data hazards, then read
operands.
 In a dynamically scheduled pipeline, all instructions pass through
the issue stage in order (in-order issue); however, they can be
stalled or bypass each other in the second stage (read operands)
and thus enter execution out of order.
Simple in-order pipelining
pipelining –goal was to complete one instruction per

clock cycle
time
Instruction1
Instruction2
Instruction3
Instruction4
Instruction5
Superpipelining
superpipelining -Increase the depth of the pipeline to
increase the clock rate
time
Instruction1
Instruction2
Instruction3
Instruction4
Instruction5
Superscalar – multiple-issue
 Fetch (and execute) more than 1 instructions at one time
 Expand every stage to accommodate multiple instructions
time
Instruction 1
Instruction 2
Instruction 3
Instruction 4
Instruction 5
Instruction 6
Advanced Pipelining
Increase the depth of the pipeline to increase the clock

rate – superpipelining
Fetch (and execute) more than one instructions at one

time (expand every pipeline stage to accommodate
multiple instructions) – multiple-issue (super scalar)
Launching multiple instructions per stage allows the

instruction execution rate, CPI, to be less than 1
Superscalar Processor
Multithreading vs Hyperthreading
Amdahl's Law
 Amdahl’s Law defines the speedup that can be gained by improving
some portion of a computer.
 The performance improvement to be gained from using some faster
mode of execution is limited by the fraction of the time the faster mode
can be used.
Fraction enhanced
ETnew  ETold  (1  fraction enhanced ) 
Speedup enhanced
ETnew 1
Speedup  
ETold (1  fraction Fraction enhanced
enhanced ) 
Speedup enhanced
Amdahl's Law- Illustration
Example: Suppose that we want to enhance the floating point

operations of a processor by introducing a new advanced FPU unit.
Let the new FPU is 10 times faster on floating point computations
than the original processor. Assuming a program has 40% floating
point operations, what is the overall speedup gained by
incorporating the enhancement?
Solution:
Fraction enhanced = 0.4
Speedup enhanced = 10
Amdahl's Law for Parallel Processing
100 100 100 100
100 50 50 25 25 25 25 ∞ Processors, Time ≈0
100 100 100 100
100 50 50 25 25 25 25 ∞ Processors, Time ≈0
100 100 100 100

Work 500, Work 500, Work 500, Work 500,
Time 500 Time 400 Time 350 Time 300
Sp=1X Sp=1.25X Sp=1.4X Sp=1.7X
How much Speed up you can achieve ?
[email protected]
http://www.iitg.ac.in/johnjose/

15IF11 Multicore A PDF

Uploaded by

Copyright:

Available Formats

15IF11 Multicore A PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

15IF11 Multicore A PDF

Uploaded by

Copyright:

Available Formats

15IF11: Multicore Technology @ PSG Tech, Coimbatore

Dr. John Jose

 Computer Architecture-Pipelined and Parallel Processor Design,

 Memory System-Cache, DRAM and Disk, Bruce Jacob, Spencer

 Principles and Practices of Interconnection Networks, William

 MIPS – Microprocessor without Interlocked Pipelined Stages

Instr. Decode Write

 Instruction fetch cycle (IF)

 Memory access cycle (MEM)

IM REG ALU DM REG

IM REG ALU DM REG

 Ideal Case: Uniform sub-computations

Structural hazards: Attempting to use the same hardware to do

Data hazards: Instruction depends on result of prior instruction

Control hazards: Caused by delay between the fetching of

Eg: Uniport Memory

 must detect the hazard

 must have mechanism to stall

 Solution 2: Duplicate hardware

 Multiple such units will help both instruction to progress

Time (clock cycles)

Stall Bubble Bubble Bubble Bubble Bubble

Time (clock cycles)

 Read After Write (RAW)

 Caused by a data dependence

 All instructions take 5 stages, and

 Writes are always in stage 5

Time (clock cycles)

Time (clock cycles)

Time (clock cycles)

=> Three Stage Stall

Instruction Instr. Decode Execute Memory Write

 The use of a 2-bit predictor will allow branches that favor

Floating Point Multiply

Floating Point Divide

Floating Point Add

 Hardware is available on the processor for performing the

 Separate dependent instruction from the source instruction by

 Rearrange execution order of instructions to reduce stalls

 Separate the issue process into two parts:

pipelining –goal was to complete one instruction per

 Expand every stage to accommodate multiple instructions

Increase the depth of the pipeline to increase the clock

Fetch (and execute) more than one instructions at one

Launching multiple instructions per stage allows the

Example: Suppose that we want to enhance the floating point

100 100 100 100

100 50 50 25 25 25 25 ∞ Processors, Time ≈0

100 100 100 100

100 50 50 25 25 25 25 ∞ Processors, Time ≈0

100 100 100 100

You might also like