Ch#4 Part 1, 2,34

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 70

EE227 Computer Organization

and Architecture

Chapter # 4
The Processor

1
Outline
 4.1 Introduction
 4.2 Logic Design Conventions
 4.3 Building a Data path
 4.4 A Simple Implementation Scheme
 4.5 An Overview of Pipelining
 4.6 Pipelined Data path and Control
 4.7 Data Hazards: Forwarding versus Stalling
 4.8 Control Hazards
 4.9 Exceptions
 4.10 Parallelism via Instructions
 4.11 Real Stuff: The ARM Cortex-A8 and Intel Core i7 Pipelines

2
Today’s Topic

 Single – Cycle CPU

 Multi-cycle CPU

3
Basic MIPS Architecture

• Now that we understand clocks and storage of states,


we’ll design a simple CPU that executes:

 basic math (add, sub, and, or, slt) – 32-Bit ALU


 memory access (lw and sw) Exchanging Values
 branch and jump instructions (beq and j) – Control Flow
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits

op rs rt rd shamt funct R-Format


6 bits 5 bits 5 bits 16 bits

op rs rt offset I-Format
6 bits 26 bits

op address J-Format

4
Implementation Overview – Requirements

• We need memory Unit


 to store instructions
 to store data
 for now, let’s make them separate units (see later)

• We need registers (scratch pad for input operands / Results)


ALU(32-bit), and a whole lot of control logic (how an
instruction navigates, what inputs are provided and where
the output goes)

• CPU operations common to all instructions:


 use the program counter (PC) to pull instruction out
of instruction memory
 read register values (fetch instructions) 5
Overview: Processor Implementation
Styles

 Single Cycle
 perform each instruction in 1 clock cycle
 clock cycle must be long enough for slowest instruction; therefore,
 disadvantage: only as fast as slowest instruction
 Multi-Cycle
 break fetch/execute cycle into multiple steps
 perform 1 step in each clock cycle
 advantage: each instruction uses only as many cycles as it needs
 Pipelined
 execute each instruction in multiple steps
 perform 1 step / instruction in each clock cycle
 process multiple instructions in parallel – assembly line

6
Simple Cycle – Abstract Design

Note: we haven’t bothered


showing multiplexors

• High Level Abstract Design Source: H&P textbook

• Single Cycle CPU Design


• Every cycle new instruction and get it processed in single cycle
• At each rising edge the PC gets updated 7
Simple Cycle – Abstract Level

Note: we haven’t bothered


showing multiplexors

• What is the role of the Add units?


• Explain the inputs to the data memory unit
• Explain the inputs to the ALU
• Explain the inputs to the register unit 8
Clocking Methodology(1)

Source: H&P textbook


• Which of the above units need a clock?
• What is being saved (latched) on the rising edge of the clock?
Keep in mind that the latched value remains there for an entire cycle
9
Clocking Methodology(2)

• Which of the above units need a clock?

• Program Counter needs clock -> at every rising edge it records the value of PC
for the next cycle

• Instruction Memory unit don’t need clock -> once we provide address as input,
after few Pico seconds instruction spills out at other end

• Register unit don’t need clock -> once we provide registers input (e.g. $to &
$t1) , the value sitting at this location comes out in few Pico second as the
inputs to ALU block

• ALU don’t need clock -> ALU/Add units are combinational circuits and don’t
needs a clock to coordinate their inputs

• Data Memory don’t need clock -> similar to IM unit

10
Implementing R-type Instructions

• Instructions of the form add $t1, $t2, $t3


• Explain the role of each signal

Control
Signal

If RegWrite = 1 then write otherwise any other operation e.g store or load
11
Implementing Loads/Stores

• Instructions of the form lw $t1, 8($t2) and sw $t1, 8($t2)

Where does this input come from?

12
Source: H&P textbook
Implementing J-type Instructions

• Instructions of the form beq $t1, $t2, offset

Source: H&P textbook


13
Single-cycle Implementation of MIPS

 Our first implementation of MIPS will use a single long clock


cycle for every instruction

 Every instruction begins on one up (or, down) clock edge and


ends on the next up (or, down) clock edge

 This approach is not practical as it is much slower than a


multicycle implementation where different instruction classes can
take different numbers of cycles
 in a single-cycle implementation every instruction must take
the same amount of time as the slowest instruction
 in a multicycle implementation this problem is avoided by
allowing quicker instructions to use fewer cycles

 Even though the single-cycle approach is not practical it is


simple and useful to understand first

14
Datapath: Instruction Store/Fetch & PC
Increment

Instruction
address
Add
PC
Instruction Add Sum
4
Instruction
memory
Read
PC address
a. Instruction memory b. Program counter c. Adder
Instruction
Instruction
Three elements used to store memory

and fetch instructions and


increment the PC
Datapath

15
Animating the Datapath

Instruction <- MEM[PC]


PC <- PC + 4
ADD

PC
ADDR
Memory
RD Instruction

16
Datapath: R-Type Instruction

ALU control ALU operation


5 Read 3 Read 3
register 1 register 1
Read Read
Register 5 data 1 data 1
Read Read
numbers register 2 Zero Zero
Registers Data ALU Instruction register 2
ALU Registers ALU ALU
5 Write result
register Write result
Read
register
Write data 2 Read
Data data data 2
Write
RegWrite
data

RegWrite
a. Registers b. ALU

Two elements used to implement Datapath


R-type instructions

17
Animating the Datapath

add rd, rs, rt


Instruction
op rs rt rd shamt funct R[rd] <- R[rs] + R[rt];

5 5 5 Operation
3
RN1 RN2 WN
RD1
Register File ALU Zero
WD

RD2
RegWrite

18
Datapath:
Load/Store Instruction

3 ALU operation
Read
MemWrite register 1 MemWrite
Read
data 1
Read
Instruction register 2 Zero
Registers ALU ALU
Address Read Write Read
result Address
data 16 32 register data
Sign Read
Write data 2
extend Data
Write Data data
data memory memory
RegWrite Write
data
16 32
Sign MemRead
MemRead
extend

a. Data memory unit b. Sign-extension unit

Two additional elements used Datapath


To implement load/stores

19
Animating the Datapath

op rs rt offset/immediate lw rt, offset(rs)


16
R[rt] <- MEM[R[rs] + s_extend(offset)];
5 5 5 Operation
3
RN1 RN2 WN
RD1
Register File ALU Zero

WD
MemWrite
RD2 ADDR
RegWrite
Memory
E WD RD
16 X 32
T
N MemRead
D

20
Animating the Datapath

op rs rt offset/immediate sw rt, offset(rs)


16 MEM[R[rs] + sign_extend(offset)] <- R[rt]
5 5 5 Operation
3
RN1 RN2 WN
RD1
Register File ALU Zero

WD
MemWrite
RD2 ADDR
RegWrite
Memory
E WD RD
16 X 32
T
N MemRead
D

21
Datapath: Branch Instruction

PC + 4 from instruction datapath

Add Sum Branch target

Shift
left 2

ALU operation
Read 3
Instruction register 1
Read
data 1
Read
register 2 To branch
Registers ALU Zero
Write control logic
register
Read
data 2
Write
data
RegWrite

16 32
Sign
extend

Datapath
22
Animating the Datapath

PC +4 from
op rs rt offset/immediate instruction
16 datapath ADD
5 5 Operation
<<2
RN1 RN2 WN
RD1
Register File ALU Zero

WD

RD2
RegWrite

16
E
X 32
beq rs, rt, offset
T
N
D
if (R[rs] == R[rt]) then
PC <- PC+4 + s_extend(offset<<2)
23
Implementation including Multiplexers
(View from 10,000 feet)

24
Source: H&P textbook
Animating the Datapath:
R-type Instruction

Instruction add rd,rs,rt


32 16 5 5 5 Operation
3
RN1 RN2 WN
RD1
Register File ALU Zero
WD

M MemWrite
RD2 U ADDR MemtoReg
RegWrite X
Data
E Memory RD M
X U
16 32 ALUSrc X
T WD
N MemRead
D

25
Animating the Datapath:
Load Instruction

Instruction lw rt,offset(rs)
32 16 5 5 5 Operation
3
RN1 RN2 WN
RD1
Register File ALU Zero
WD

M MemWrite
RD2 U ADDR MemtoReg
RegWrite X
Data
E Memory RD M
X U
16 32 ALUSrc X
T WD
N MemRead
D

26
Animating the Datapath:
Store Instruction

Instruction sw rt,offset(rs)
32 16 5 5 5 Operation
3
RN1 RN2 WN
RD1
Register File ALU Zero
WD

M MemWrite
RD2 U ADDR MemtoReg
RegWrite X
Data
E Memory RD M
X U
16 32 ALUSrc X
T WD
N MemRead
D

27
Function Elements

 Two types of functional elements in the hardware:

 elements that operate on data (called combinational elements)

 elements that contain data (called state or sequential


elements)

28
Combinational Elements

 Works as an input  output function, e.g., ALU

 Output depends only on the current inputs

 Combinational logic reads input data from one register and


writes output data to another, or same, register

 read/write happens in a single cycle – combinational element


cannot store data from one cycle to a future one

Combinational logic hardware units

29
State Elements
 State elements contain data in internal storage
 All state elements together define the state of the machine
 Flipflops and latches are 1-bit state elements, equivalently,
they are 1-bit memories
 The output(s) of a flipflop or latch always depends on the bit
value stored, i.e., its state, and can be called 1/0 or high/low
or true/false
 The input to a flipflop or latch can change its state depending
on whether it is clocked or not…
 Instruction and data memories, as well as the registers, are all
examples of state elements.

30
State / Combinational Elements

Combinational logic hardware units

State State State


Combinational logic
element Combinational logic element element
1 2

Clock cycle

31
Clocking Methodology
 The approach used to determine when data is valid and stable
relative to the clock.
 An edge triggered clocking methodology means that any
values stored in a sequential logic element are updated only
on a clock edge (a transition from low to high and vice versa)

 Because only state elements can store a data value, any


collection of combinational logic must have its inputs come
from a set of state elements and its outputs written into a set
of state elements.
 The inputs are values that were written in a previous clock
cycle, while the outputs are values that can be used in a
following clock cycle.

32
Edge Triggered Clocking Methodology(1)
 A clocking scheme in which all state changes occur on a clock
edge.

State State
element Combinational logic element
1 2

Clock cycle

33
Edge Triggered Clocking Methodology(2)

State State
element Combinational logic element
1 2

Clock cycle

 Shows the two state elements surrounding a block of


combinational logic, which operates in a single clock cycle: all
signals must propagate from state element 1, through the
combinational logic, and to state element 2 in the time of one
clock cycle.

 The time necessary for the signals to reach state element 2


defines the length of the clock cycle.

34
Control Signal

 A signal used for multiplexor selection or for directing the


operation of a functional unit is called control signal

 In contrasts with a data signal, which contains information that


is operated on by a functional unit.

 asserted to indicate a signal that is logically high and assert to


specify that a signal should be driven logically high

 deassert or deasserted to represent logically low.

35
Revisiting MIPS inst. format

36
View from 5,000 Feet

37
Source: H&P textbook
Control signals overview
 RegDst: which instr. field to use for dst. register
specifier?
 Inst[20:16] vs. Inst[15:11]

 ALUSrc: which one to use for ALU src 2?


 Immediate vs. register read port 2

 MemtoReg: is it memory load?


 RegWrite: update register?
 MemRead: read memory?
 MemWrite: write to memory?
 Branch: is it a branch?
 ALUop: what type of ALU operation?

38
Example: lw r8, 32(r18)

 Let’s assume r18 has 1,000


 Let’s assume M[1032] has 0x11223344

39
Example: lw r8, 32(r18)

(PC+4)
(PC+4)

Branch=0

35
RegWrite
(PC+4) 18 1000
8 0x11223344
RegDest=0 ALUSrc=1
8 1032
0 32 MemtoReg=1
0x11223344
32
MemRead
32

0x11223344

40
ALU Control Lines(1)
 The MIPS ALU in defines the 6 following
combinations of four control inputs:

41
ALU Control Lines(2)
 For load word and store word instructions, we use the
ALU to compute the memory address by addition.

 For the R-type instructions, the ALU needs to perform


one of the five actions (AND, OR, subtract, add, or set
on less than), depending on the value of the 6-bit funct
(or function) field in the low-order bits of the instruction

 For branch equal, the ALU must perform a subtraction.

42
ALU Control Input / ALUOp(1)
 We can generate the 4-bit ALU control input using a
small control unit that has as inputs the function field of
the instruction and a 2-bit control field, which we call
ALUOp.

 ALUOp indicates whether the operation to be performed


should be add (00) for loads and stores, subtract (01) for
beq, or determined by the operation encoded in the funct
field (10).

 The output of the ALU control unit is a 4-bit signal that


directly controls the ALU by generating one of the 4-bit
combinations shown previously.

43
ALU Control Input / ALUOp(2)

44
Mapping of ALUOp & Funct. Field to ALU Control
Input (1)
 This style of using multiple levels of decoding—that is, the main
control unit generates the ALUOp bits, which then are used as input
to the ALU control that generates the actual signals to control the
ALU unit—is a common implementation technique.

 Using multiple levels of control can reduce the size of the main
control unit & Using several smaller control units may also
potentially increase the speed of the control unit

 Such optimizations are important, since the speed of the control unit
is often critical to clock cycle time.

 There are several different ways to implement the mapping from the
2-bit ALUOp field and the 6-bit funct. field to the four ALU
operation control bits

45
Mapping of ALUOp & Funct. Field to ALU Control
Input (2)
 As only a small number of the 64 possible values of the function
field are of interest and the function field is used only when the
ALUOp bits equal 10, we can use a small piece of logic that
recognizes the subset of possible values and causes the correct
setting of the ALU control bits.

 As a step in designing this logic, it is useful to create a truth table for


the interesting combinations of the function code field and the
ALUOp bits

46
Mapping of ALUOp & Funct. Field to ALU Control
Input (3)

47
Don’t Care Terms
 In many instances we do not care about the values of some of the
inputs, and because we wish to keep the tables compact, we
include don’t-care terms.

 A don't care term in the truth table (represented by an X in an input


column) indicates that the output does not depend on the value of
the input corresponding to that column

 when the ALUOp bits are 00, as in the first row of truth table
discussed. we always set the ALU control to 0010, independent of
the function code. In this case, then, the function code inputs will be
don’t cares in this line of the truth table

48
Control Signals Review

49
Source: H&P textbook
All control Signals Table(1)

 The first row of the table corresponds to the R-format instructions (add, sub, AND,
OR, and slt).

 For all these instructions, the source register fields are rs and rt, and the destination
register field is rd; this defines how the signals ALUSrc and RegDst are set.

 Furthermore, an R-type instruction writes a register (RegWrite=1), but neither reads


nor writes data memory. When the Branch control signal is 0, the PC is
unconditionally replaced with PC+4;

 The ALUOp field for R-type instructions is set to 10 to indicate that the ALU control
should be generated from the funct. field

50
All control Signals Table(2)

 The second and third rows of this table give the control signal settings for lw
and sw.

 These ALUSrc and ALUOp fields are set to perform the address calculation.
The MemRead and MemWrite are set to perform the memory access.

 Finally,RegDst and RegWrite are set for a load to cause the result to be
stored into the rt register.

51
All control Signals Table(3)

 The branch instruction is similar to an R-format operation, since it sends the


rs and rt registers to the ALU. The ALUOp field for branch is set for a
subtract (ALU control=01), which is used to test for equality

 Notice that the MemtoReg field is irrelevant when the RegWrite signal is 0:
since the register is not being written, the value of the data on the register
data write port is not used. Thus, the entry MemtoReg in the last two rows of
the table is replaced with X for don’t care

 Don’t cares can also be added to RegDst when RegWrite is 0.

52
All control Signals Table(4)

 The branch instruction is similar to an R-format operation, since it sends the


rs and rt registers to the ALU. The ALUOp field for branch is set for a
subtract (ALU control=01), which is used to test for equality

 Notice that the MemtoReg field is irrelevant when the RegWrite signal is 0:
since the register is not being written, the value of the data on the register
data write port is not used. Thus, the entry MemtoReg in the last two rows of
the table is replaced with X for don’t care

53
Control Function for Single Cycle Implementation

54
Combining R-type & I-type Data paths

RegWrite

+4
ALUCtrl A mux selects RW
30
Instruction Registers 32 as either Rt or Rd
Memory Rs 5
30 32 RA BusA A 32
00

Instruction Rt 5 32 L Another mux


32 RB
Address
BusB 0 U
selects 2nd ALU
PC

0
RW 1
Rd
1 BusW
32
input as either
clk RegDst ExtOp ALUSrc data on BusB or
ALU result the extended
Extender immediate
Imm16
 Control signals
 ALUCtrl is derived from either the Op or the funct field
 RegWrite enables the writing of the ALU result
 ExtOp controls the extension of the 16-bit immediate
 RegDst selects the register destination as either Rt or Rd
 ALUSrc selects the 2nd ALU source as BusB or extended
immediate
55
Controlling ALU Instructions
RegWrite = 1
ALUCtrl
+4 For R-type ALU
30
Instruction Registers 32
Memory Rs 5 instructions, RegDst is
30 RA BusA A
32 32 ‘1’ to select Rd on RW
00

Instruction Rt 5 32 L
32 RB BusB 0 U and ALUSrc is ‘0’ to
Address
PC

Rd
0
RW 1 select BusB as second
1 BusW
ALU input. The active
ALUSrc = 0
clk RegDst = 1 ExtOp part of datapath is
ALU result shown in green
Extender
Imm16

+
RegWrite = 1
ALUCtrl For I-type ALU
30
Instruction Registers 32 instructions, RegDst is
130 Memory
32
Rs 5
RA BusA A 32
‘0’ to select Rt on RW
00

32
Instruction Rt 5
RB
32 L and ALUSrc is ‘1’ to
Address
BusB 0 U select Extended
PC

0
Rd RW 1
1 BusW immediate as second
ExtOp ALUSrc = 1 ALU input. The active
clk RegDst = 0

32 ALU result
part of datapath is
Imm16
Extender shown in green

56
Details of the Extender
 Two types of extensions
 Zero-extension for unsigned constants

 Sign-extension for signed constants

 Control signal ExtOp indicates type of extension


 Extender Implementation: wiring and one AND gate

ExtOp = 0  Upper16 = 0
ExtOp = 1 
.. Upper
. Upper16 = sign bit
16 bits
ExtOp

.. Lower
Imm16 .
16 bits

57
Adding Data Memory to Datapath

 A data memory is added for load and store instructions


ExtOp ALUCtrl MemRead MemWrite
Imm16 32 ALUSrc
E MemtoReg
ALU result
+4
30 32
Instruction Rs 5
Data
RA BusA
30 Memory Memory
32
Registers A 32
0
32
00

Instruction Rt 5 L Address
32 RB 32
Address
BusB 0 U Data_out 1
PC

0
Rd RW Data_in
1 BusW 1
32
RegDst Reg
Write
clk

ALU calculates data memory address A 3rd mux selects data on BusW as
either ALU result or memory data_out
 Additional Control signals
 MemRead for load instructions BusB is connected to Data_in of Data
Memory for store instructions
 MemWrite for store instructions
 MemtoReg selects data on BusW as ALU result or Memory Data_out

58
Controlling the Execution of Load
ExtOp = 1 ALUCtrl MemRead MemWrite
= ADD =1 =0
ALUSrc
Imm16 32 MemtoReg
=1
E ALU result =1
+4
30 32
Instruction Rs 5
Data
RA BusA
30 Memory Memory
32
Registers A 32
0
32
00

Instruction Rt 5 L Address
32 RB 32
Address
BusB 0 U Data_out 1
PC

0
Rd RW Data_in
1 BusW 1
32
RegDst
RegWr
=0
clk =1

RegDst = ‘0’ selects Rt RegWrite = ‘1’ to enable ExtOp = 1 to sign-extend


as destination register writing of register file Immmediate16 to 32 bits

ALUSrc = ‘1’ selects extended ALUCtrl = ‘ADD’ to calculate data memory


immediate as second ALU input address as Reg(Rs) + sign-extend(Imm16)

MemRead = ‘1’ to MemtoReg = ‘1’ places the data Clock edge updates PC
read data memory read from memory on BusW and Register Rt

59
Controlling the Execution of Store
ExtOp = 1 ALUCtrl MemRead MemWrite
= ADD =0 =1
ALUSrc
Imm16 32 MemtoReg
=1
E ALU result =X
+4
30 32
Instruction Rs 5
Data
RA BusA
30 Memory Memory
32
Registers A 32
0
32
00

Instruction Rt 5 L Address
32 RB 32
Address
BusB 0 U Data_out 1
PC

0
Rd RW Data_in
1 BusW 1
32
RegDst
RegWr
=X
clk =0

RegDst = ‘X’ because RegWrite = ‘0’ to disable ExtOp = 1 to sign-extend


no register is written writing of register file Immmediate16 to 32 bits

ALUSrc = ‘1’ selects extended ALUCtrl = ‘ADD’ to calculate data memory


immediate as second ALU input address as Reg(Rs) + sign-extend(Imm16)

MemWrite = ‘1’ to MemtoReg = ‘X’ because don’t Clock edge updates PC


write data memory care what data is put on BusW and Data Memory

60
Adding Jump and Branch to Datapath
Two adder blocks
as shown in
30 Jump or Branch Target Address previous slides
J
Next Beq
PC Bne
30
ALU result
Imm26
Imm16
PCSrc +4 zero
Instruction Rs 5 32
RA BusA Data
30 Memory Memory 0
32
Registers A 32 32
00

Instruction
0
Rt 5 L Address
32
RB
Address
BusB 0
32
U Data_out 1
PC

0
1 Rd RW Data_in
1 BusW E 1
32
RegDst Reg
Write
clk
Mem Mem Mem
ExtOp ALUSrc ALUCtrl Read Write toReg

 Additional Control Signals


 J, Beq, Bne for jump and branch instructions Next PC logic
computes jump or
 Zero flag of the ALU is examined branch target
 PCSrc = 1 for jump & taken branch instruction address

61
Controlling the Execution of Jump
Two adder blocks
as shown in
30 Jump Target Address previous slides
J=1
Next Beq = 0
PC Bne = 0
30
ALU result
Imm26
Imm16
PCSrc +4 zero
=1 32
Instruction Rs 5
Data
RA BusA
30 Memory Memory 0
32
Registers A 32 32
00

Instruction
0
Rt 5 L Address
32
RB
Address
BusB 0
32
U Data_out 1
PC

0
1 Rd RW Data_in
1 BusW E 1
32
RegDst
=x RegWr
clk =0

Mem Mem Mem


ExtOp ALUSrc ALUCtrl
Read Write toReg
=x =x =x
=0 =0 =x
J = 1 to control jump.
Next PC outputs Jump
Target Address We don’t care about RegDst, ExtOp,
ALUSrc, ALUCtrl, and MemtoReg
MemRead, MemWrite,
and RegWrite are 0 Clock edge updates PC register only

62
Controlling the Execution of Branch
Two adder blocks
as shown in
30 Branch Target Address previous slides
J=0
Next Beq = 1
PC Bne = 0
30
ALU result
Imm26
Imm16
PCSrc +4 Zero
=1
=1 32
Instruction Rs 5
Data
RA BusA
30 Memory Memory 0
32
Registers A 32 32
00

Instruction
0
Rt 5 L Address
32
RB
Address
BusB 0
32
U Data_out 1
PC

0
1 Rd RW Data_in
1 BusW E 1
32
RegDst
=x RegWr
clk =0

Mem Mem Mem


ExtOp ALUSrc ALUCtrl
Read Write toReg
Either Beq = 1 or Bne =x =0 = SUB
=0 =0 =x
depending on opcode
Next PC outputs branch target address
ALUSrc = 0 to select PCSrc = 1 if branch is taken
value on BusB
RegWrite, MemRead, and MemWrite are 0
ALUCtrl = SUB to
generate Zero Flag Clock edge updates PC register only

63
Main Control and ALU Control

Instruction
Memory
Instruction
32
Datapath A
L
U

MemtoReg
MemRead
MemWrite
Address

RegWrite

ALUSrc
RegDst

ExtOp

Beq
Bne
J
Op6 ALUCtrl
funct6

Main ALU
Control Op6
Control
Main Control Input: ALU Control Input:
 6-bit opcode field from instruction  6-bit opcode field from instruction
Main Control Output:  6-bit function field from instruction
 10 control signals for the Datapath
ALU Control Output:
 ALUCtrl signal for ALU

64
Single-Cycle Data path + Control
Two adder blocks
as shown in
30 Jump or Branch Target Address
previous slides
30 30

Next J, Beq, Bne


Imm26
PC ALU result
PCSrc +4 Imm16
zero
Instruction Rs 5
BusA Data
RA
Memory Memory 0
30 32
Registers E A m 32
00

Instruction
0 Rt 5 L Address u
m RB 32 x
u Address
BusB 0
m
U Data_out 1
PC

0
x m u
1 x Data_in
u RW BusW
Rd x 1
1
5
clk
ALUop
func

Op
RegDst RegWrite ExtOp ALU
Ctrl MemRead

ALUSrc
MemWrite MemtoReg

Main
Control

65
Drawbacks of Single Cycle Processor

 Long cycle time



All instructions take as much time as the slowest instruction

Instruction Decode Reg


ALU Fetch Reg Read
ALU
Write

longest delay
Instruction Decode Compute Reg
Load Fetch Reg Read Address
Memory Read
Write

Instruction Decode Compute


Store Fetch Reg Read Address
Memory Write

Instruction Reg Read Compare


Branch Fetch Br Target & PC Write

Instruction Decode
Jump Fetch PC Write

66
Alternative: Multi cycle Implementation

 Break instruction execution into five steps


 Instruction fetch
 Instruction decode, register read, target address for jump/branch
 Execution, memory address calculation, or branch outcome
 Memory access or ALU instruction completion
 Load instruction completion
 One clock cycle per step (clock cycle is reduced)
 First 2 steps are the same for all instructions

Instruction # cycles Instruction # cycles


ALU & Store 4 Branch 3
Load 5 Jump 2

67
Performance Example
 Assume the following operation times for components:
 Instruction and data memories: 200 ps
 ALU and adders: 180 ps
 Decode and Register file access (read or write): 150 ps
 Ignore the delays in PC, mux, extender, and wires

 Which of the following would be faster and by how much?


 Single-cycle implementation for all instructions
 Multicycle implementation optimized for every class of instructions

 Assume the following instruction mix:


 40% ALU, 20% Loads, 10% stores, 20% branches, & 10% jumps

68
Solution
Instruction Instruction Register ALU Data Register
Total
Class Memory Read Operation Memory Write
ALU 200 150 180 150 680 ps
Load 200 150 180 200 150 880 ps
Store 200 150 180 200 730 ps
Branch 200 150 180 Compare and write PC 530 ps
Jump 200 150 Decode and write PC 350 ps

 For fixed single-cycle implementation:


 Clock cycle = 880 ps determined by longest delay (load instruction)

 For multi-cycle implementation:


 Clock cycle = max (200, 150, 180) = 200 ps (maximum delay at any step)
 Average CPI = 0.4×4 + 0.2×5 + 0.1×4+ 0.2×3 + 0.1×2 = 3.8

 Speedup = 880 ps / (3.8 × 200 ps) = 880 / 760 = 1.16


Summary

 5 steps to design a processor


 Analyze instruction set => datapath requirements
 Select datapath components & establish clocking methodology
 Assemble datapath meeting the requirements
 Analyze implementation of each instruction to determine control signals
 Assemble the control logic
 MIPS makes Control easier
 Instructions are of same size
 Source registers always in same place
 Immediates are of same size and same location
 Operations are always on registers/immediates
 Single cycle datapath => CPI=1, but Long
Clock Cycle

70

You might also like