c6 Cao

Download as pdf or txt
Download as pdf or txt
You are on page 1of 77

Faculty of Electrical and Electronic Engineering

Semester II, Session 2016/2017

BEC30303
Computer Architecture and Organization

Chapter 6:
Pipelining
Mohamad Hairol Jabbar
Department of Computer Engineering
http://fkee.uthm.edu.my/mhjabbar
OUTLINE
1. Pipeline organization
2. Data dependencies
3. Pipeline issues
4. Branch delays
5. Superscalar operation
6. Performance evaluation

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 2


1. PIPELINE ORGANIZATION
IMPROVE EXECUTION OF PROGRAMS
• Use faster circuit technology to build the
processor and the main memory.
• Arrange the hardware so that more than one
operation can be performed at the same time.
• In the latter way, the number of operations
performed per second is increased even
though the elapsed time needed to perform
any one operation is not changed.

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 4


LAUNDRY EXAMPLE
•Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
•Washer takes 30 minutes
•Dryer takes 40 minutes
•Folder takes 20 minutes

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 5


SEQUENTIAL CONCEPT
6 PM 7 8 9 10 11 Midnight

Time

30 40 20 30 40 20 30 40 20 30 40 20

A • Sequential laundry takes 6


hours for 4 loads
Task • If they learned pipelining, how
order B long would laundry take?

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 6


PIPELINE CONCEPT
• Pipelining doesn’t help latency
of single task, it helps
throughput of entire workload
• Pipeline rate limited by slowest
pipeline stage
• Multiple tasks operating
simultaneously using different
resources
• Potential speedup = Number
pipe stages
• Unbalanced lengths of pipe
stages reduces speedup
• Time to “fill” pipeline and time
to “drain” it reduces speedup
• Stall for Dependences

• Pipelined laundry takes 3.5


hours for 4 loads
• What is the speedup? = 6/3.5

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 7


LATENCY VS THROUGHPUT
• Latency:…
• Throughput:…
• Speedup due to increased throughput
– Latency (time for each instruction) does not
decrease

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 8


NON-PIPELINED DESIGN
• Single-cycle implementation:
– The cycle time depends on the slowest instruction
– Every instruction takes the same amount of time
• Multi-cycle implementation:
– Divide the execution of an instruction into multiple
steps
– Each instruction may take variable number of
steps (clock cycles)

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 9


SINGLE CYCLE
• The cycle time depends on the slowest
instruction
• Every instruction takes the same amount of
time

Problem:
- Long cycle time (means what?), to finish each instruction

Source: fetweb.ju.edu.jo

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 10


MULTI CYCLE
• Divide the execution of an instruction into
multiple steps
• Each instruction may take variable number of
steps (clock cycles)

Problem:
- improve the clock cycle period, but some instructions
finish longer than the single cycle implementation Source: fetweb.ju.edu.jo

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 11


PIPELINE IMPLEMENTATION

Ideal multi cycle implementation with


pipeline

Advantages:
- Once the pipeline is full, we get CPI = 1 (means what?)

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 12


SINGLE, MULTI, PIPELINE

Lw- load word


Sw- store word

Source: fetweb.ju.edu.jo

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 13


REALISTIC PIPELINE EXECUTION

Each stage executes


within the same clock
period/cycle, but
some stages
complete early
1 clk cycle 1 clk cycle 1 clk cycle

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 14


CPI, IPC
• CPI:
– Cycle per instruction
– Smaller is better
– For ideal case (all steps finish in 1 clock cycle),
thus CPI = 1
– Realistic case, some steps take multi cycle
execution, thus use average CPI
• IPC:
– Instruction per cycle
– = 1/CPI

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 15


CPI EXAMPLE
• A program executes:
– ALU, Load, Store
– Cycles per instruction type: ALU = 1, Load = 2,
Store = 3

• What is the CPI?:


– (33% * 1) + (33% * 2) + (33% * 3) = 2
– Also = total no. of cycles / total no. of instructions

Source: http://www.cis.upenn.edu/

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 16


PIPELINE PERFORMANCE
• Time to start executing the fourth
instruction:
• Single-Cycle = 1000 x 3 =
3000 ps,
• Pipelining = 3 x 200 = 600
ps(we fetch one instruction
every clock cycle
• Speedup = 3000/600 = 5
instructions over single
cycle

• Total execution time for three


Example: sequence
load instructions:
of Load instructions
• Single-cycle = 3 x 1000 =
3000 ps, clock cycle?
• Pipelined = 1200 ps, clock
cycle?
• Speedup = ???
Source: fetweb.ju.edu.jo

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 17


2. DATA DEPENDENCIES
DATA DEPENDENCE
• Data dependence:
– one instruction is dependent on another
instruction to provide its operands.
• Control dependence (aka branch
dependences):
– one instruction determines whether another
instruction gets executed or not.
• Control dependences are particularly critical
with conditional branches.

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 19


EXAMPLE OF DATA DEPENDENCIES

• Control dependence will always


creates control hazard, why?
• Data dependence not
necessarily creates data hazard

Source: http://cseweb.ucsd.edu/

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 20


EXAMPLE
Instruction:

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 21


3. PIPELINE ISSUES
HAZARDS
• Any condition that causes a pipeline to stall is
called a hazard.
• Pipeline stall causes degradation in pipeline
performance.

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 23


TYPES OF HAZARDS
• Data hazard – any condition in which either
the source or the destination operands of an
instruction are not available at the time
expected in the pipeline.
• Structural hazard – the situation when two
instructions require the use of a given
hardware resource at the same time.
• Instruction (control) hazard – a delay in the
availability of an instruction causes the
pipeline to stall.

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 24


DATA HAZARD
• Data/operands is not ready when it is needed

1 2 3 4 5 6 7

Read-after-write hazard Source: www.cs.utexas.edu/

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 25


TYPES OF DATA HAZARD
• RAW (read after write)
– only hazard for ‘fixed’ pipelines
– later instruction must read after earlier write
• WAW (write after write)
– variable-length pipeline
– later instruction must write after earlier write
• WAR (write after read)
– pipelines with late read
– later instruction must write after earlier read

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 26


HANDLING DATA HAZARDS
• In software:
– insert independent instructions (or no-ops), by the
compiler
• In hardware:
– insert bubbles (i.e. stall the pipeline): solve all
hazards
– data forwarding: sometimes does not solve
hazards

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 27


INSERTING NOP (1/2)
• Let the compiler detect and handle the
hazard:
add $1, x, x
NOP
NOP
sub $4, $1, $5
add $6, $1, $7
...
• The compiler can reorder the instructions to
perform some useful work during the NOP
slots.
Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 28
INSERTING NOP (2/2)

1 2 3 4 5 6 7

Source: http://cseweb.ucsd.edu/

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 29


STALL THE PIPELINE

What will
happen to
the CPI?
1 2 3 4 5 6 7

Source: www.cs.utexas.edu/

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 30


OPERAND FORWARDING (1/2)
• Instead of from the register file, the second
instruction can get data directly from the
output of ALU after the previous instruction is
completed.
• A special arrangement needs to be made to
“forward” the output of ALU to the input of
ALU.

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 31


OPERAND FORWARDING (2/2)
1 2 3 4 5 6 7

Source: www.cs.utexas.edu/

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 32


EXAMPLE

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 33


FORWARDING HARDWARE

Result is forwarded to
the ALU needed for next
instruction execution

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 34


FORWARDING PROBLEM
1 2 3 4 5 6 7

Still must stall for 1 clock


cycle (for instruction #2)
Source: www.cs.utexas.edu/

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 35


STRUCTURAL HAZARDS
• The situation when two instructions require the use
of a given hardware resource at the same time.

Structural
hazard due to
the accessing
the same
registers at
the same time

Clock cycle 1 2 3 4 5 6 7 Source: www.cs.utexas.edu/

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 36


FIXING STRUCTURAL HAZARDS
• Fix using:
– Stall
– Separate memory architecture for instruction and
data or separate the memory access time

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 37


FIXING THE HAZARD
1 2 3 4 5 6 7

Source: www.cs.utexas.edu/

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 38


CONTROL HAZARD
• Whenever the stream of instructions supplied
by the instruction fetch unit is interrupted, the
pipeline stalls.
• Examples of interruption:
– Cache miss
– Branch instruction

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 39


CONTROL HAZARD

1 2 3 4 5 6 7

Source: http://cseweb.ucsd.edu/

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 40


FIXING CONTROL HAZARD

1 2 3 4 5 6 7

Stall (inserting
bubbles) the pipeline
for several clock
cycles before the real
Source: http://cseweb.ucsd.edu/
instruction is fetched

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 41


4. BRANCH DELAYS
BRANCH TYPES
• Unconditional, example?
• Conditional, example?

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 43


CONDITIONAL BRANCHES
• A conditional branch instruction introduces
the added hazard caused by the dependency
of the branch condition on the result of a
preceding instruction.
• The decision to branch cannot be made until
the execution of that instruction has been
completed.
• Branch instructions represent about 20% of
the dynamic instruction count of most
programs.

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 44


BRANCH PENALTY
1 2 3 4 5 6 7

3 cycles penalty for branch instruction


Source: http://cseweb.ucsd.edu/

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 45


REDUCING BRANCH PENALTY
• For unconditional branch:
– Compute the branch target address earlier in the
pipeline
– Instruction prefetching
– Delayed branch
– Branch prediction

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 46


BRANCH TARGET ADDRESS

Determine the branch


target address during
decode stage – save 1
clock cycle

Determine the branch


target address during
execute stage - 2 clock
cycles branch penalty

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 47


INSTRUCTION QUEUE/PREFETCHING

Instruction fetch unit


Instruction queue
F:
instruction
Fetch
Memory access is slow,
thus prefetch the
instructions to avoid
long delay (waiting)
when branch happens
D : Dispatch/
Decode E : ecute W:
instruction
Ex results
Write
unit

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 48


DELAYED BRANCH (1/2)
• The instructions in the delay slots are always
fetched. Therefore, we would like to arrange
for them to be fully executed whether or not
the branch is taken.
• The objective is to place useful instructions in
these slots.
• The effectiveness of the delayed branch
approach depends on how often it is possible
to reorder instructions.

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 49


DELAYED BRANCH (2/2)
• Scheduling techniques:
– The compiler statically schedules an independent
instruction in the branch delay slot.
• The instruction in the branch delay slot is
executed whether or not the branch is taken

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 50


EXAMPLE
• A previous add instruction with any effects on
the branch is scheduled in the branch delay
slot

Source: http://home.deib.polimi.it/

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 51


BRANCH NOT TAKEN
• If the branch is not taken, the execution
continues with the next instruction

Source: http://home.deib.polimi.it/

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 52


BRANCH TAKEN
• If the branch is taken = execution continues
at the branch target, after the delayed
instruction slot is executed

Source: http://home.deib.polimi.it/

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 53


BRANCH PREDICTION (1/2)
• To predict whether or not a particular branch
will be taken.
• Simplest form:
– assume branch will not take place and continue to
fetch instructions in sequential address order.
• Until the branch is evaluated, instruction
execution along the predicted path must be
done on a speculative basis.
• Speculative execution:
– instructions are executed before the processor is
certain that they are in the correct execution
sequence.
Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 54
BRANCH PREDICTION (2/2)
• Better performance can be achieved if we arrange
for some branch instructions to be predicted as
taken and others as not taken.
• Use hardware to observe whether the target
address is lower or higher than that of the branch
instruction.
• Let compiler include a branch prediction bit.
• So far the branch prediction decision is always the
same every time a given instruction is executed –
static branch prediction.

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 55


BRANCH NOT TAKEN
• For branch not taken assumption

When it is right, no
penalty (no waste
clock cycles)

Source: http://cseweb.ucsd.edu/

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 56


BRANCH TAKEN
• When branch is taken

Same performance as
stalling when it is wrong

Source: http://cseweb.ucsd.edu/

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 57


PREDICTION METHODS
• Fixed:
– Prediction is fixed
– Example: branch-never-taken
– Not proper for loop structures
• Static branch prediction:
– base guess on instruction types
• Dynamic branch prediction:
– base guess on execution history

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 58


DYNAMIC BRANCH PREDICTION
• Using branch predictors to save the history of
branch executions

Source: http://www.owlnet.rice.edu/

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 59


5. SUPERSCALAR OPERATION
OVERVIEW
• The maximum throughput of a pipelined
processor is one instruction per clock cycle.
• If we equip the processor with multiple
processing units to handle several
instructions in parallel in each processing
stage, several instructions start execution in
the same clock cycle – multiple-issue.
• Processors are capable of achieving an
instruction execution throughput of more than
one instruction per cycle – superscalar
processors.
• Multiple-issue requires a wider path to the
cache and multiple execution units.
Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 61
SCALAR AND SUPERSCALAR

Source: W. M. Johnson, 1989


Source: http://cseweb.ucsd.edu/

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 62


SUPERSCALAR OPERATION
• Increases the ability of the processor to use
instruction level parallelism
• Multiple instructions are issued every cycle:
– multiple pipelines or functional units operating in
parallel
• Example:
– 3 parallel pipelines
– 3-way superscalar processor
– 3-issue processor

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 63


EXECUTION ORDER
• In order execution:
– Instructions are fetched, executed and completed in
compiler generated order
– One stalls, they all stall
– Instructions are statistically scheduled
• Out-of-order execution:
– Instructions are fetched in compiler-generated order
– Instruction completion may be in-order or out-of-order
– In between they may be executed in some other order
– Instructions are dynamically scheduled

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 64


PIPELINE EVOLUTION
• Basic pipeline:
– Single issue, in-order issue
• First extension:
– Multiple issue (superscalar)
– In-order issue
• Second extension:
– Multiple issue (superscalar)
– Out-of-order issue

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 65


EXAMPLE

Source: www.ida.liu.se/

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 66


IOI, IOC
• In-order issue, in-order completion
In order In order
execution completion

Source: www.ida.liu.se/

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 67


OOI, OOC
• Out-of-order issue, out-of-order completion

Out of order Out of order


execution completion
Source: www.ida.liu.se/

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 68


EXECUTION COMPLETION
• It is desirable to use out-of-order execution,
so that an execution unit is freed to execute
other instructions as soon as possible.
• At the same time, instructions must be
completed in program order to allow precise
exceptions.
• Using temporary registers

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 69


SUPER COMPUTER AT FKEE

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 70


SUPER COMPUTER AT FKEE

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 71


QUESTION: WHICH ONE IS THE FASTEST?

GPU? Microprocessor? CPU?


Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 72
6. PERFORMANCE EVALUATION
PIPELINE PERFORMANCE (1/3)
• The execution time, T of a program that has a
dynamic instruction count N is given by:

where S is the average number of clock cycles it


takes to fetch and execute one instruction, and R is
the clock rate.
• Instruction throughput is defined as the
number of instructions executed per second.

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 74


PIPELINE PERFORMANCE (2/3)
• An n-stage pipeline has the potential to
increase the throughput by n times.
• However, the only real measure of
performance is the total execution time of a
program.
• Higher instruction throughput will not
necessarily lead to higher performance in
terms of program’s execution time.
• Question regarding pipelining:
– What is good value of n?

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 75


PIPELINE PERFORMANCE (3/3)
• Since an n-stage pipeline has the potential to
increase the throughput by n times, how
about we use a 10,000-stage pipeline?
• As the number of stages increase, the
probability of the pipeline being stalled
increases.
• The inherent delay in the basic operations
increases.
• Hardware considerations (area, power,
complexity, etc.)

Computer Architecture and Organization (BEC30303) | Chapter 5: Pipelining 76


FINISH

You might also like