CS 211: Computer Architecture: Instructor: Prof. Bhagi Narahari

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 82

CS 211: Computer Architecture

Instructor: Prof. Bhagi Narahari


Dept. of Computer Science
Course URL:
www.seas.gwu.edu/~narahari/cs211/

CS211 1

How to improve performance?


Recall performance is function of
CPI: cycles per instruction
Clock cycle
Instruction count

Reducing any of the 3 factors will lead to


improved performance

CS211 2

How to improve performance?


First step is to apply concept of pipelining to
the instruction execution process
Overlap computations

What does this do?


Decrease clock cycle
Decrease effective CPU time compared to original
clock cycle

Appendix A of Textbook
Also parts of Chapter 2

CS211 3

Pipeline Approach to Improve System


Performance
Analogous to fluid flow in pipelines and
assembly line in factories
Divide process into stages and send tasks
into a pipeline
Overlap computations of different tasks by
operating on them concurrently in different stages

CS211 4

Instruction Pipeline
Instruction execution process lends itself
naturally to pipelining
overlap the subtasks of instruction fetch, decode
and execute

CS211 5

Linear Pipeline
Processor

Linear pipeline processes a sequence of


subtasks with linear precedence
At a higher level - Sequence of processors
Data flowing in streams from stage S1 to the
final stage Sk
Control of data flow : synchronous or
asynchronous

S1
S2
Sk

CS211 6

Synchronous Pipeline
All transfers simultaneous
One task or operation enters the pipeline per cycle
Processors reservation table : diagonal

CS211 7

Time Space Utilization of Pipeline


Full pipeline after 4 cycles

S3
T1

S2
S1

T1

T2

T2

T3

T1

T2

T3

T4

Time (in pipeline cycles)


CS211 8

Asynchronous Pipeline
Transfers performed when individual processors are ready
Handshaking protocol between processors
Mainly used in multiprocessor systems with message-passing

CS211 9

Pipeline Clock and


Timing
Si

Si+1

Clock cycle of the pipeline :


Latch delay : d

= max { m } + d

Pipeline frequency : f
f=1/

CS211 10

Speedup and Efficiency

k-stage pipeline processes n tasks in k + (n-1) clock


cycles:
k cycles for the first task and n-1 cycles
for the remaining n-1 tasks
Total time to process n tasks
Tk = [ k + (n-1)]
For the non-pipelined processor
T1 = n k
Speedup factorS =
k

T1
Tk

nk
= [ k + (n-1)]

nk
k + (n-1)
CS211 11

Efficiency and Throughput

10

Efficiency of the k-stages pipeline :


Ek =

Sk
k

n
k + (n-1)

Pipeline throughput (the number of tasks per unit time) :


note equivalence to IPC
Hk =

n
=
[ k + (n-1)]

nf
k + (n-1)

CS211 12

Pipeline Performance: Example

Task has 4 subtasks with time: t1=60, t2=50, t3=90, and t4=80 ns
(nanoseconds)
latch delay = 10
Pipeline cycle time = 90+10 = 100 ns
For non-pipelined execution
time = 60+50+90+80 = 280 ns

Speedup for above case is: 280/100 = 2.8 !!


Pipeline Time for 1000 tasks = 1000 + 4-1= 1003*100 ns
Sequential time = 1000*280ns
Throughput= 1000/1003
What is the problem here ?
How to improve performance ?

CS211 13

Non-linear pipelines and pipeline


control algorithms
Can have non-linear path in pipeline
How to schedule instructions so they do no conflict
for resources

How does one control the pipeline at the


microarchitecture level
How to build a scheduler in hardware ?
How much time does scheduler have to make
decision ?

CS211 14

Non-linear Dynamic Pipelines

Multiple processors (k-stages) as linear pipeline


Variable functions of individual processors
Functions may be dynamically assigned
Feedforward and feedback connections

CS211 15

Reservation Tables
Reservation table : displays time-space flow of data
through the pipeline analogous to opcode of pipeline
Not diagonal, as in linear pipelines

Multiple reservation tables for different functions


Functions may be dynamically assigned

Feedforward and feedback connections


The number of columns in the reservation table :
evaluation time of a given function

CS211 16

14

Reservation Tables (Examples)

CS211 17

Latency Analysis
Latency : the number of clock cycles
between two initiations of the pipeline
Collision : an attempt by two initiations to use
the same pipeline stage at the same time
Some latencies cause collision, some not

CS211 18

16

Collisions (Example)

x1

x2
x1

x3

x1

x4

x1x2

x2x3

x3x4

x4

x1x2

x1

x2x3
x1x2

x1x2x4

10

x2x3x4

Latency = 2
CS211 19

17

Latency Cycle

x1

9 10 11 12 13 14 15 16 17 18

x1 x2 x1
x1

x1
x1

x2 x3 x2

x2
x1

x1

x3

x2
x2

x3

x2
Cycle

x2

x3
x3

x3

Cycle

Latency cycle : the sequence of initiations which has repetitive


subsequence and without collisions
Latency sequence length : the number of time intervals
within the cycle
Average latency : the sum of all latencies divided by
the number of latencies along the cycle
CS211 20

18

Collision Free Scheduling

Goal : to find the shortest average latency


Lengths : for reservation table with n columns, maximum forbidden
latency is m <= n 1, and permissible latency p is
1 <= p <= m 1
Ideal case : p = 1 (static pipeline)
Collision vector : C = (CmCm-1 . . .C2C1)
[ Ci = 1 if latency i causes collision ]
[ Ci = 0 for permissible latencies ]

CS211 21

19

Collision Vector
Reservation Table

x
x

x
x

x1

x1
x1

x1
x1

Value X1

x2

x2
x2

x1

x2
x2

x2

Value X2

C = (? ? . . . ? ?)
CS211 22

Back to our focus: Computer


Pipelines
Execute billions of instructions, so
throughput is what matters
MIPS desirable features:
all instructions same length,
registers located in same place in instruction
format,
memory operands only in loads or stores

CS211 23

Designing a Pipelined
Processor

Go back and examine your datapath and control


diagram
associated resources with states
ensure that flows do not conflict, or figure out how
to resolve
assert control in appropriate stage

CS211 24

5 Steps of MIPS Datapath


What do we need to do to pipeline the process ?
Instruction
Fetch

Instr. Decode
Reg. Fetch

Execute
Addr. Calc

Next SEQ PC

Adder

Zero?

RS1

L
M
D

MUX

Data
Memory

ALU

Imm

MUX MUX

RD

Reg File

Inst

Memory

Address

RS2

Write
Back

MUX

Next PC

Memory
Access

Sign
Extend

WB Data

CS211 25

5 Steps of MIPS/DLX Datapath


Execute
Addr. Calc

Instr. Decode
Reg. Fetch
Next SEQ PC

Next SEQ PC

Adder

Zero?

RS1

MUX

MEM/WB

Data
Memory

EX/MEM

ALU

MUX MUX

ID/EX

Imm

Reg File

IF/ID

Memory

Address

RS2

Write
Back

MUX

Next PC

Memory
Access

WB Data

Instruction
Fetch

Sign
Extend

RD

RD

RD

Data stationary control


local decode for each instruction phase / pipeline stage

CS211 26

Graphically Representing
Pipelines

Can help with answering questions like:


how many cycles does it take to execute this code?
what is the ALU doing during cycle 4?
use this representation to help understand datapaths
CS211 27

Visualizing Pipelining
Time (clock cycles)

Ifetch

DMem

Reg

Ifetch

Reg

Ifetch

Reg

DMem

Reg

Reg

DMem

ALU

Reg

ALU

O
r
d
e
r

Ifetch

ALU

I
n
s
t
r.

ALU

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

Reg

DMem

Reg

CS211 28

Conventional Pipelined Execution


Representation
Time
IFetch Dcd

Exec

IFetch Dcd

Mem

WB

Exec

Mem

WB

Exec

Mem

WB

Exec

Mem

WB

Exec

Mem

WB

Exec

Mem

IFetch Dcd

IFetch Dcd

IFetch Dcd
Program Flow

IFetch Dcd

WB

CS211 29

Clk

Single Cycle,
Multiple
Cycle,
vs.
Cycle 1
Cycle 2
Pipeline

Single Cycle Implementation:


Load

Store

Waste

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Clk
Multiple Cycle Implementation:
Load
Ifetch

Reg

Exec

Mem

Wr

Exec

Mem

Wr

Reg

Exec

Mem

Store
Ifetch

Reg

Exec

Mem

R-type
Ifetch

Pipeline Implementation:
Load Ifetch

Reg

Store Ifetch

R-type Ifetch

Reg

Exec

Wr
Mem

Wr

CS211 30

The Five Stages of


Load Cycle 1 Cycle 2 Cycle 3
Load Ifetch

Reg/Dec

Exec

Cycle 4 Cycle 5

Mem

Wr

Ifetch: Instruction Fetch


Fetch the instruction from the Instruction Memory

Reg/Dec: Registers Fetch and Instruction Decode


Exec: Calculate the memory address
Mem: Read the data from the Data Memory
Wr: Write the data back to the register file

CS211 31

The Four Stages of Rtype Cycle 1 Cycle 2 Cycle 3 Cycle 4


R-type Ifetch

Reg/Dec

Exec

Wr

Ifetch: Instruction Fetch


Fetch the instruction from the Instruction Memory

Reg/Dec: Registers Fetch and Instruction Decode


Exec:
ALU operates on the two register operands
Update PC

Wr: Write the ALU output back to the register file

CS211 32

Clock

Pipelining the R-type and Load


Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Instruction

R-type Ifetch
R-type

Reg/Dec

Exec

Ifetch

Reg/Dec

Exec

Ifetch

Reg/Dec

Load

Ops! We have a problem!

Wr

R-type Ifetch

Cycle 8 Cycle 9

Wr
Exec

Mem

Wr

Reg/Dec

Exec

Wr

R-type Ifetch

Reg/Dec

Exec

Wr

We have pipeline conflict or structural hazard:


Two instructions try to write to the register file at the same
time!
Only one write port
CS211 33

Important
Observation
Each
functional unit can only be used once per

instruction
Each functional unit must be used at the same stage
for all instructions:
Load uses Register Files Write Port during its 5th stage
Load

Ifetch

Reg/Dec

3
Exec

Mem

Wr

R-type uses Register Files Write Port during its 4th stage
1
R-type Ifetch

2
Reg/Dec

3
Exec

4
Wr

2 ways to solve this pipeline hazard.

CS211 34

Solution 1: Insert Bubble into the


Pipeline
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8

Cycle 9

Clock
Ifetch
Load

Reg/Dec

Exec

Ifetch

Reg/Dec

R-type Ifetch

Wr
Exec

Mem

Reg/Dec

Exec

Wr
Wr

R-type Ifetch Reg/Dec Pipeline Exec


R-type Ifetch
Bubble Reg/Dec
Ifetch

Wr
Exec
Reg/Dec

Wr
Exec

Insert a bubble into the pipeline to prevent 2 writes


at the same cycle
The control logic can be complex.
Lose instruction fetch and issue opportunity.

No instruction is started in Cycle 6!


CS211 35

Solution 2: Delay R-types Write by One


Cycle
Delay
R-types register write by one cycle:
Now R-type instructions also use Reg Files write port at Stage 5
Mem stage is a NOOP stage: nothing is being done.
1

R-type Ifetch

Cycle 1 Cycle 2

Reg/Dec

3
Exec

4
Mem

5
Wr

Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9

Clock
R-type Ifetch
R-type

Reg/Dec

Exec

Mem

Wr

Ifetch

Reg/Dec

Exec

Mem

Wr

Ifetch

Reg/Dec

Exec

Mem

Wr

Reg/Dec

Exec

Mem

Wr

Reg/Dec

Exec

Mem

Load

R-type Ifetch

R-type Ifetch

Wr
CS211 36

Why
Pipeline?

Suppose we execute 100 instructions


Single Cycle Machine
45 ns/cycle x 1 CPI x 100 inst = 4500 ns

Multicycle Machine
10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns

Ideal pipelined machine


10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns

CS211 37

Why Pipeline? Because the resources are


there!
Time (clock cycles)

Inst 3

Im

Dm

Reg

Dm

Im

Reg

Im

Reg

Reg

Reg

Dm

ALU

Inst 4

Reg

Reg

ALU

Inst 2

Im

Dm

ALU

Inst 1

Reg

ALU

O
r
d
e
r

Inst 0

Im

ALU

I
n
s
t
r.

Reg

Dm

Reg

CS211 38

Problems with Pipeline processors?


Limits to pipelining: Hazards prevent next instruction from
executing during its designated clock cycle and introduce
stall cycles which increase CPI
Structural hazards: HW cannot support this combination of
instructions - two dogs fighting for the same bone
Data hazards: Instruction depends on result of prior
instruction still in the pipeline
Data dependencies
Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow
(branches and jumps).
Control dependencies

Can always resolve hazards by stalling


More stall cycles = more CPU time = less performance
Increase performance = decrease stall cycles

CS211 39

Back to our old friend: CPU time


equation
Recall equation for CPU time

So what are we doing by pipelining the instruction


execution process ?
Clock ?
Instruction Count ?
CPI ?
How is CPI effected by the various hazards ?
CS211 40

Speed Up Equation for Pipelining


CPIpipelined Ideal CPI Average Stall cycles per Inst

Cycle Timeunpipelined
Ideal CPI Pipeline depth
Speedup

Ideal CPI Pipeline stall CPI


Cycle Timepipelined

For simple RISC pipeline, CPI = 1:


Cycle Timeunpipelined
Pipeline depth
Speedup

1 Pipeline stall CPI


Cycle Timepipelined
CS211 41

One Memory Port/Structural Hazards


Time (clock cycles)

Instr 4

Ifetch

Reg

Ifetch

DMem

ALU

Reg

Reg

Reg

Ifetch

Reg

DMem

Reg

DMem

Reg

ALU

Instr 3

Ifetch

DMem

ALU

O
r
d
e
r

Instr 2

Reg

ALU

I Load Ifetch
n
s
t Instr 1
r.

ALU

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

Reg

DMem

Reg

CS211 42

One Memory Port/Structural Hazards


Time (clock cycles)

Stall
Instr 3

Reg

Ifetch

Reg

Bubble

Reg

DMem

ALU

Ifetch

DMem

Reg

DMem

Bubble Bubble

Ifetch

Reg

Reg

Bubble
ALU

O
r
d
e
r

Instr 2

Reg

ALU

I Load Ifetch
n
s
t Instr 1
r.

ALU

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

Bubble

DMem

Reg

CS211 43

Example: Dual-port vs. Single-port


Machine A: Dual ported memory (Harvard
Architecture)
Machine B: Single ported memory, but its
pipelined implementation has a 1.05 times faster
clock rate
Ideal CPI = 1 for both
Note - Loads will cause stalls of 1 cycle
Recall our friend:
CPU = IC*CPI*Clk
CPI= ideal CPI + stalls
CS211 44

Example
Machine A: Dual ported memory (Harvard
Architecture)
Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate
Ideal CPI = 1 for both
Loads are 40% of instructions executed
SpeedUpA = Pipe. Depth/(1 + 0) x (clock unpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipe. Depth/(1 + 0.4 x 1) x
(clockunpipe/(clockunpipe /
1.05)
= (Pipe. Depth/1.4) x 1.05
= 0.75 x Pipe. Depth
SpeedUpA / SpeedUpB = Pipe. Depth/(0.75 x Pipe. Depth) = 1.33

Machine A is 1.33 times faster


CS211 45

Data Dependencies
True dependencies and False dependencies
false implies we can remove the dependency
true implies we are stuck with it!

Three types of data dependencies defined in


terms of how succeeding instruction depends
on preceding instruction
RAW: Read after Write or Flow dependency
WAR: Write after Read or anti-dependency
WAW: Write after Write

CS211 46

Three Generic Data Hazards


Read After Write (RAW)
InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3
J: sub r4,r1,r3
Caused by a Dependence (in compiler
nomenclature). This hazard results from an actual
need for communication.

CS211 47

RAW Dependency
Example program (a) with two instructions
i1: load r1, a;
i2: add r2, r1,r1;

Program (b) with two instructions


i1: mul r1, r4, r5;
i2: add r2, r1, r1;

Both cases we cannot read in i2 until i1 has


completed writing the result
In (a) this is due to load-use dependency
In (b) this is due to define-use dependency
CS211 48

Three Generic Data Hazards


Write After Read (WAR)
InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
Called an anti-dependence by compiler writers.
This results from reuse of the name r1.
Cant happen in MIPS 5 stage pipeline because:

All instructions take 5 stages, and


Reads are always in stage 2, and
Writes are always in stage 5
CS211 49

Three Generic Data Hazards


Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
Called an output dependence by compiler writers
This also results from the reuse of name r1.
Cant happen in MIPS 5 stage pipeline because:
All instructions take 5 stages, and
Writes are always in stage 5
Will see WAR and WAW in later more complicated
pipes
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7

CS211 50

WAR and WAW Dependency

Example program (a):


i1: mul r1, r2, r3;
i2: add r2, r4, r5;

Example program (b):


i1: mul r1, r2, r3;
i2: add r1, r4, r5;

both cases we have dependence between i1 and i2


in (a) due to r2 must be read before it is written into
in (b) due to r1 must be written by i2 after it has been written
into by i1

CS211 51

What to do with WAR and WAW ?


Problem:
i1: mul r1, r2, r3;
i2: add r2, r4, r5;

Is this really a dependence/hazard ?

CS211 52

What to do with WAR and WAW


Solution: Rename Registers
i1: mul r1, r2, r3;
i2: add r6, r4, r5;

Register renaming can solve many of these


false dependencies
note the role that the compiler plays in this
specifically, the register allocation process--i.e., the
process that assigns registers to variables

CS211 53

Hazard Detection in H/W


Suppose instruction i is about to be issued and a
predecessor instruction j is in the instruction pipeline
How to detect and store potential hazard information
Note that hazards in machine code are based on register
usage
Keep track of results in registers and their usage
Constructing a register data flow graph

For each instruction i construct set of Read registers


and Write registers
Rregs(i) is set of registers that instruction i reads from
Wregs(i) is set of registers that instruction i writes to
Use these to define the 3 types of data hazards
CS211 54

Hazard Detection in
Hardware
A RAW hazard exists on register if Rregs( i ) Wregs( j )
Keep a record of pending writes (for inst's in the pipe) and
compare with operand regs of current instruction.
When instruction issues, reserve its result register.
When on operation completes, remove its write reservation.

A WAW hazard exists on register if Wregs( i ) Wregs( j )


A WAR hazard exists on register if Wregs( i ) Rregs( j )

CS211 55

Internal Forwarding: Getting rid of


some hazards
In some cases the data needed by the next
instruction at the ALU stage has been
computed by the ALU (or some stage defining
it) but has not been written back to the
registers
Can we forward this result by bypassing
stages ?

CS211 56

Data Hazard on R1
Time (clock cycles)

and r6,r1,r7
or

r8,r1,r9

xor r10,r1,r11

Ifetch

DMem

Reg

DMem

Ifetch

Reg

DMem

Reg

DMem

Reg

ALU

sub r4,r1,r3

Reg

ALU

Ifetch

ALU

O
r
d
e
r

add r1,r2,r3

WB

ALU

I
n
s
t
r.

MEM

ALU

IF ID/RF EX

Ifetch

Reg

Ifetch

Reg

Reg

Reg

DMem

CS211 57

Reg

Forwarding to Avoid Data Hazard

or

r8,r1,r9

xor r10,r1,r11

Reg

DMem

Ifetch

Reg

DMem

Reg

DMem

Reg

ALU

and r6,r1,r7

Ifetch

DMem

ALU

sub r4,r1,r3

Reg

ALU

O
r
d
e
r

add r1,r2,r3 Ifetch

ALU

I
n
s
t
r.

ALU

Time (clock cycles)

Ifetch

Reg

Ifetch

Reg

Reg

Reg

DMem

Reg

CS211 58

Internal Forwarding of Instructions


Forward result from ALU/Execute unit to execute
unit in next stage
Also can be used in cases of memory access
in some cases, operand fetched from memory has been
computed previously by the program
can we forward this result to a later stage thus
avoiding an extra read from memory ?
Who does this ?
Internal forwarding cases
Stage i to Stage i+k in pipeline
store-load forwarding
load-store forwarding
store-store forwarding
CS211 59

38

Internal Data
Forwarding
Store-load forwarding

Memory
M

Memory
M

Access Unit

Access Unit

R1
STO M,R1

R2
LD R2,M

R1
STO M,R1

R2
MOVE R2,R1

CS211 60

39

Internal Data
Forwarding
Load-load forwarding

Memory
M

Memory
M

Access Unit

Access Unit

R1
LD R1,M

R2
LD R2,M

R1
LD R1,M

R2
MOVE R2,R1

CS211 61

40

Internal Data
Forwarding
Store-store forwarding

Memory
M

Memory
M

Access Unit

Access Unit

R1
STO M, R1

R2
STO M,R2

R1

R2
STO M,R2

CS211 62

HW Change for Forwarding

NextPC

mux
MEM/WR

EX/MEM

ALU

mux

ID/EX

Registers

mux

Immediate

Data
Memory

CS211 63

What about memory


If instructions are initiated in order and
operations?
operations always occur in the same stage,

op Rd Ra Rb

there can be no hazards between memory


operations!
What does delaying WB on arithmetic
operations cost?
cycles ?
hardware ?

op Rd Ra Rb

What about data dependence on loads?


R1 <- R4 + R5
R2 <- Mem[ R2 + I ]
R3 <- R2 + R1
Delayed Loads

Rd

Can recognize this in decode stage and


introduce bubble while stalling fetch stage

Rd

Tricky situation:
R1 <- Mem[ R2 + I ]
Mem[R3+34] <- R1
Handle with bypass in memory stage!

Mem
T

to reg
file

CS211 64

Data Hazard Even with Forwarding

and r6,r1,r7
or

r8,r1,r9

DMem

Ifetch

Reg

DMem

Reg

Ifetch

Ifetch

Reg

Reg

Reg

DMem

ALU

sub r4,r1,r6

Reg

ALU

O
r
d
e
r

lw r1, 0(r2) Ifetch

ALU

I
n
s
t
r.

ALU

Time (clock cycles)

Reg

DMem

Reg

CS211 65

Data Hazard Even with Forwarding

and r6,r1,r7
or r8,r1,r9

Reg

DMem

Ifetch

Reg

Bubble

Ifetch

Bubble

Reg

Bubble

Ifetch

Reg

DMem

Reg

Reg

DMem

ALU

sub r4,r1,r6

Ifetch

ALU

O
r
d
e
r

lw r1, 0(r2)

ALU

I
n
s
t
r.

ALU

Time (clock cycles)

Reg

DMem

CS211 66

What can we (S/W) do?

CS211 67

Software Scheduling to Avoid Load


Hazards
Try producing fast code for
a = b + c;
d = e f;
assuming a, b, c, d ,e, and f in memory.
Slow code:
LW
LW
ADD
SW
LW
LW
SUB
SW

Rb,b
Rc,c
Ra,Rb,Rc
a,Ra
Re,e
Rf,f
Rd,Re,Rf
d,Rd

Fast code:
LW Rb,b
LW Rc,c
LW Re,e
ADD Ra,Rb,Rc
LW Rf,f
SW a,Ra
SUB Rd,Re,Rf
SW d,Rd

CS211 68

Control Hazards: Branches


Instruction flow
Stream of instructions processed by Inst. Fetch unit
Speed of input flow puts bound on rate of
outputs generated

Branch instruction affects instruction flow


Do not know next instruction to be executed until
branch outcome known

When we hit a branch instruction


Need to compute target address (where to branch)
Resolution of branch condition (true or false)
Might need to flush pipeline if other instructions
have been fetched for execution
CS211 69

22: add r8,r1,r9


36: xor r10,r1,r11

Reg

DMem

Ifetch

Reg

Ifetch

Reg

Reg

DMem

Reg

DMem

Ifetch

Reg

ALU

r6,r1,r7

Ifetch

DMem

ALU

18: or

Reg

ALU

14: and r2,r3,r5

Ifetch

ALU

10: beq r1,r3,36

ALU

Control Hazard on Branches


Three Stage Stall

Reg

Reg

DMem

CS211 70

Reg

Branch Stall Impact


If CPI = 1, 30% branch,
Stall 3 cycles => new CPI = 1.9!
Two part solution:
Determine branch taken or not sooner, AND
Compute taken branch address earlier

MIPS branch tests if register = 0 or 0


MIPS Solution:
Move Zero test to ID/RF stage
Adder to calculate new PC in ID/RF stage
1 clock cycle penalty for branch versus 3
CS211 71

Pipelined MIPS (DLX) Datapath


Instruction
Fetch

Instr. Decode
Reg. Fetch

Execute
Addr. Calc.

Memory
Access

Write
Back

This is the correct 1 cycle


latency implementation!

CS211 72

Four Branch Hazard Alternatives


#1: Stall until branch direction is clear flushing pipe
#2: Predict Branch Not Taken

Execute successor instructions in sequence


Squash instructions in pipeline if branch actually taken
Advantage of late pipeline state update
47% DLX branches not taken on average
PC+4 already calculated, so use it to get next instruction

#3: Predict Branch Taken


53% DLX branches taken on average
But havent calculated branch target address in DLX
DLX still incurs 1 cycle branch penalty
Other machines: branch target known before outcome

CS211 73

Four Branch Hazard Alternatives


#4: Delayed Branch
Define branch to take place AFTER a following instruction
branch instruction
sequential successor 1
sequential successor 2
........
sequential successor n
branch target if taken

Branch delay of length n

1 slot delay allows proper decision and branch target address in


5 stage pipeline
DLX uses this

CS211 74

CS211 75

Delayed Branch
Where to get instructions to fill branch delay slot?

Before branch instruction


From the target address: only valuable when branch taken
From fall through: only valuable when branch not taken
Cancelling branches allow more slots to be filled

Compiler effectiveness for single branch delay slot:


Fills about 60% of branch delay slots
About 80% of instructions executed in branch delay slots useful
in computation
About 50% (60% x 80%) of slots usefully filled

Delayed Branch downside: 7-8 stage pipelines, multiple


instructions issued per clock (superscalar)

CS211 76

Evaluating Branch Alternatives


Pipelinespeedup=

Pipelinedepth
1+BranchfrequencyBranchpenalty

Scheduling
BranchCPIspeedup v.speedup v. scheme
penalty
unpipelined
stall
Stall pipeline
31.423.5
1.0
Predict taken
11.144.4
1.26
Predict not taken
11.094.5
1.29
Delayed branch
0.51.074.6
1.31

Conditional & Unconditional = 14%, 65% change PC

CS211 77

Branch Prediction based on history


Can we use history of branch behaviour to predict
branch outcome ?
Simplest scheme: use 1 bit of history
Set bit to Predict Taken (T) or Predict Not-taken (NT)
Pipeline checks bit value and predicts
If incorrect then need to invalidate instruction
Actual outcome used to set the bit value

Example: let initial value = T, actual outcome of


branches is- NT, T,T,NT,T,T
Predictions are: T, NT,T,T,NT,T
3 wrong (in red), 3 correct = 50% accuracy

In general, can have k-bit predictors: more when we


cover superscalar processors.
CS211 78

Summary :
Control and Pipelining
Just overlap tasks; easy if tasks are independent
Speed Up Pipeline Depth; if ideal CPI is 1, then:
Cycle Timeunpipelined
Pipeline depth
Speedup

1 Pipeline stall CPI


Cycle Timepipelined

Hazards limit performance on computers:


Structural: need more HW resources
Data (RAW,WAR,WAW): need forwarding, compiler
scheduling
Control: delayed branch, prediction

CS211 79

Summary #1/2:
Pipelining
What makes it easy
all instructions are the same length
just a few instruction formats
memory operands appear only in loads and stores

What makes it hard? HAZARDS!


structural hazards: suppose we had only one memory
control hazards: need to worry about branch instructions
data hazards: an instruction depends on a previous
instruction

Pipelines pass control information down the pipe just


as data moves down pipe
Forwarding/Stalls handled by local control
Exceptions stop the pipeline
CS211 80

Introduction to ILP
What is ILP?
Processor and Compiler design techniques that
speed up execution by causing individual machine
operations to execute in parallel

ILP is transparent to the user


Multiple operations executed in parallel even though
the system is handed a single program written with
a sequential processor in mind

Same execution hardware as a normal RISC


machine
May be more than one of any given type of hardware
CS211 81

Compiler vs. Processor


Compiler

Hardware

FrontendandOptimizer

Superscalar

DetermineDependences

Dataflow

DetermineDependences

DetermineIndependences

Indep.Arch.

DetermineIndependences

BindOperationstoFunctionUnits

VLIW

BindOperationstoFunctionUnits

BindTransportstoBusses

BindTransportstoBusses

TTA
Execute

B. Ramakrishna Rau and Joseph A. Fisher. Instruction-level parallel: History overview, and perspective. The
Journal of Supercomputing, 7(1-2):9-50, May 1993.

CS211 82

You might also like