Ch#4 Part 1, 2,34
Ch#4 Part 1, 2,34
Ch#4 Part 1, 2,34
and Architecture
Chapter # 4
The Processor
1
Outline
4.1 Introduction
4.2 Logic Design Conventions
4.3 Building a Data path
4.4 A Simple Implementation Scheme
4.5 An Overview of Pipelining
4.6 Pipelined Data path and Control
4.7 Data Hazards: Forwarding versus Stalling
4.8 Control Hazards
4.9 Exceptions
4.10 Parallelism via Instructions
4.11 Real Stuff: The ARM Cortex-A8 and Intel Core i7 Pipelines
2
Today’s Topic
Multi-cycle CPU
3
Basic MIPS Architecture
op rs rt offset I-Format
6 bits 26 bits
op address J-Format
4
Implementation Overview – Requirements
Single Cycle
perform each instruction in 1 clock cycle
clock cycle must be long enough for slowest instruction; therefore,
disadvantage: only as fast as slowest instruction
Multi-Cycle
break fetch/execute cycle into multiple steps
perform 1 step in each clock cycle
advantage: each instruction uses only as many cycles as it needs
Pipelined
execute each instruction in multiple steps
perform 1 step / instruction in each clock cycle
process multiple instructions in parallel – assembly line
6
Simple Cycle – Abstract Design
• Program Counter needs clock -> at every rising edge it records the value of PC
for the next cycle
• Instruction Memory unit don’t need clock -> once we provide address as input,
after few Pico seconds instruction spills out at other end
• Register unit don’t need clock -> once we provide registers input (e.g. $to &
$t1) , the value sitting at this location comes out in few Pico second as the
inputs to ALU block
• ALU don’t need clock -> ALU/Add units are combinational circuits and don’t
needs a clock to coordinate their inputs
10
Implementing R-type Instructions
Control
Signal
If RegWrite = 1 then write otherwise any other operation e.g store or load
11
Implementing Loads/Stores
12
Source: H&P textbook
Implementing J-type Instructions
14
Datapath: Instruction Store/Fetch & PC
Increment
Instruction
address
Add
PC
Instruction Add Sum
4
Instruction
memory
Read
PC address
a. Instruction memory b. Program counter c. Adder
Instruction
Instruction
Three elements used to store memory
15
Animating the Datapath
PC
ADDR
Memory
RD Instruction
16
Datapath: R-Type Instruction
RegWrite
a. Registers b. ALU
17
Animating the Datapath
5 5 5 Operation
3
RN1 RN2 WN
RD1
Register File ALU Zero
WD
RD2
RegWrite
18
Datapath:
Load/Store Instruction
3 ALU operation
Read
MemWrite register 1 MemWrite
Read
data 1
Read
Instruction register 2 Zero
Registers ALU ALU
Address Read Write Read
result Address
data 16 32 register data
Sign Read
Write data 2
extend Data
Write Data data
data memory memory
RegWrite Write
data
16 32
Sign MemRead
MemRead
extend
19
Animating the Datapath
WD
MemWrite
RD2 ADDR
RegWrite
Memory
E WD RD
16 X 32
T
N MemRead
D
20
Animating the Datapath
WD
MemWrite
RD2 ADDR
RegWrite
Memory
E WD RD
16 X 32
T
N MemRead
D
21
Datapath: Branch Instruction
Shift
left 2
ALU operation
Read 3
Instruction register 1
Read
data 1
Read
register 2 To branch
Registers ALU Zero
Write control logic
register
Read
data 2
Write
data
RegWrite
16 32
Sign
extend
Datapath
22
Animating the Datapath
PC +4 from
op rs rt offset/immediate instruction
16 datapath ADD
5 5 Operation
<<2
RN1 RN2 WN
RD1
Register File ALU Zero
WD
RD2
RegWrite
16
E
X 32
beq rs, rt, offset
T
N
D
if (R[rs] == R[rt]) then
PC <- PC+4 + s_extend(offset<<2)
23
Implementation including Multiplexers
(View from 10,000 feet)
24
Source: H&P textbook
Animating the Datapath:
R-type Instruction
M MemWrite
RD2 U ADDR MemtoReg
RegWrite X
Data
E Memory RD M
X U
16 32 ALUSrc X
T WD
N MemRead
D
25
Animating the Datapath:
Load Instruction
Instruction lw rt,offset(rs)
32 16 5 5 5 Operation
3
RN1 RN2 WN
RD1
Register File ALU Zero
WD
M MemWrite
RD2 U ADDR MemtoReg
RegWrite X
Data
E Memory RD M
X U
16 32 ALUSrc X
T WD
N MemRead
D
26
Animating the Datapath:
Store Instruction
Instruction sw rt,offset(rs)
32 16 5 5 5 Operation
3
RN1 RN2 WN
RD1
Register File ALU Zero
WD
M MemWrite
RD2 U ADDR MemtoReg
RegWrite X
Data
E Memory RD M
X U
16 32 ALUSrc X
T WD
N MemRead
D
27
Function Elements
28
Combinational Elements
29
State Elements
State elements contain data in internal storage
All state elements together define the state of the machine
Flipflops and latches are 1-bit state elements, equivalently,
they are 1-bit memories
The output(s) of a flipflop or latch always depends on the bit
value stored, i.e., its state, and can be called 1/0 or high/low
or true/false
The input to a flipflop or latch can change its state depending
on whether it is clocked or not…
Instruction and data memories, as well as the registers, are all
examples of state elements.
30
State / Combinational Elements
Clock cycle
31
Clocking Methodology
The approach used to determine when data is valid and stable
relative to the clock.
An edge triggered clocking methodology means that any
values stored in a sequential logic element are updated only
on a clock edge (a transition from low to high and vice versa)
32
Edge Triggered Clocking Methodology(1)
A clocking scheme in which all state changes occur on a clock
edge.
State State
element Combinational logic element
1 2
Clock cycle
33
Edge Triggered Clocking Methodology(2)
State State
element Combinational logic element
1 2
Clock cycle
34
Control Signal
35
Revisiting MIPS inst. format
36
View from 5,000 Feet
37
Source: H&P textbook
Control signals overview
RegDst: which instr. field to use for dst. register
specifier?
Inst[20:16] vs. Inst[15:11]
38
Example: lw r8, 32(r18)
39
Example: lw r8, 32(r18)
(PC+4)
(PC+4)
Branch=0
35
RegWrite
(PC+4) 18 1000
8 0x11223344
RegDest=0 ALUSrc=1
8 1032
0 32 MemtoReg=1
0x11223344
32
MemRead
32
0x11223344
40
ALU Control Lines(1)
The MIPS ALU in defines the 6 following
combinations of four control inputs:
41
ALU Control Lines(2)
For load word and store word instructions, we use the
ALU to compute the memory address by addition.
42
ALU Control Input / ALUOp(1)
We can generate the 4-bit ALU control input using a
small control unit that has as inputs the function field of
the instruction and a 2-bit control field, which we call
ALUOp.
43
ALU Control Input / ALUOp(2)
44
Mapping of ALUOp & Funct. Field to ALU Control
Input (1)
This style of using multiple levels of decoding—that is, the main
control unit generates the ALUOp bits, which then are used as input
to the ALU control that generates the actual signals to control the
ALU unit—is a common implementation technique.
Using multiple levels of control can reduce the size of the main
control unit & Using several smaller control units may also
potentially increase the speed of the control unit
Such optimizations are important, since the speed of the control unit
is often critical to clock cycle time.
There are several different ways to implement the mapping from the
2-bit ALUOp field and the 6-bit funct. field to the four ALU
operation control bits
45
Mapping of ALUOp & Funct. Field to ALU Control
Input (2)
As only a small number of the 64 possible values of the function
field are of interest and the function field is used only when the
ALUOp bits equal 10, we can use a small piece of logic that
recognizes the subset of possible values and causes the correct
setting of the ALU control bits.
46
Mapping of ALUOp & Funct. Field to ALU Control
Input (3)
47
Don’t Care Terms
In many instances we do not care about the values of some of the
inputs, and because we wish to keep the tables compact, we
include don’t-care terms.
when the ALUOp bits are 00, as in the first row of truth table
discussed. we always set the ALU control to 0010, independent of
the function code. In this case, then, the function code inputs will be
don’t cares in this line of the truth table
48
Control Signals Review
49
Source: H&P textbook
All control Signals Table(1)
The first row of the table corresponds to the R-format instructions (add, sub, AND,
OR, and slt).
For all these instructions, the source register fields are rs and rt, and the destination
register field is rd; this defines how the signals ALUSrc and RegDst are set.
The ALUOp field for R-type instructions is set to 10 to indicate that the ALU control
should be generated from the funct. field
50
All control Signals Table(2)
The second and third rows of this table give the control signal settings for lw
and sw.
These ALUSrc and ALUOp fields are set to perform the address calculation.
The MemRead and MemWrite are set to perform the memory access.
Finally,RegDst and RegWrite are set for a load to cause the result to be
stored into the rt register.
51
All control Signals Table(3)
Notice that the MemtoReg field is irrelevant when the RegWrite signal is 0:
since the register is not being written, the value of the data on the register
data write port is not used. Thus, the entry MemtoReg in the last two rows of
the table is replaced with X for don’t care
52
All control Signals Table(4)
Notice that the MemtoReg field is irrelevant when the RegWrite signal is 0:
since the register is not being written, the value of the data on the register
data write port is not used. Thus, the entry MemtoReg in the last two rows of
the table is replaced with X for don’t care
53
Control Function for Single Cycle Implementation
54
Combining R-type & I-type Data paths
RegWrite
+4
ALUCtrl A mux selects RW
30
Instruction Registers 32 as either Rt or Rd
Memory Rs 5
30 32 RA BusA A 32
00
0
RW 1
Rd
1 BusW
32
input as either
clk RegDst ExtOp ALUSrc data on BusB or
ALU result the extended
Extender immediate
Imm16
Control signals
ALUCtrl is derived from either the Op or the funct field
RegWrite enables the writing of the ALU result
ExtOp controls the extension of the 16-bit immediate
RegDst selects the register destination as either Rt or Rd
ALUSrc selects the 2nd ALU source as BusB or extended
immediate
55
Controlling ALU Instructions
RegWrite = 1
ALUCtrl
+4 For R-type ALU
30
Instruction Registers 32
Memory Rs 5 instructions, RegDst is
30 RA BusA A
32 32 ‘1’ to select Rd on RW
00
Instruction Rt 5 32 L
32 RB BusB 0 U and ALUSrc is ‘0’ to
Address
PC
Rd
0
RW 1 select BusB as second
1 BusW
ALU input. The active
ALUSrc = 0
clk RegDst = 1 ExtOp part of datapath is
ALU result shown in green
Extender
Imm16
+
RegWrite = 1
ALUCtrl For I-type ALU
30
Instruction Registers 32 instructions, RegDst is
130 Memory
32
Rs 5
RA BusA A 32
‘0’ to select Rt on RW
00
32
Instruction Rt 5
RB
32 L and ALUSrc is ‘1’ to
Address
BusB 0 U select Extended
PC
0
Rd RW 1
1 BusW immediate as second
ExtOp ALUSrc = 1 ALU input. The active
clk RegDst = 0
32 ALU result
part of datapath is
Imm16
Extender shown in green
56
Details of the Extender
Two types of extensions
Zero-extension for unsigned constants
ExtOp = 0 Upper16 = 0
ExtOp = 1
.. Upper
. Upper16 = sign bit
16 bits
ExtOp
.. Lower
Imm16 .
16 bits
57
Adding Data Memory to Datapath
Instruction Rt 5 L Address
32 RB 32
Address
BusB 0 U Data_out 1
PC
0
Rd RW Data_in
1 BusW 1
32
RegDst Reg
Write
clk
ALU calculates data memory address A 3rd mux selects data on BusW as
either ALU result or memory data_out
Additional Control signals
MemRead for load instructions BusB is connected to Data_in of Data
Memory for store instructions
MemWrite for store instructions
MemtoReg selects data on BusW as ALU result or Memory Data_out
58
Controlling the Execution of Load
ExtOp = 1 ALUCtrl MemRead MemWrite
= ADD =1 =0
ALUSrc
Imm16 32 MemtoReg
=1
E ALU result =1
+4
30 32
Instruction Rs 5
Data
RA BusA
30 Memory Memory
32
Registers A 32
0
32
00
Instruction Rt 5 L Address
32 RB 32
Address
BusB 0 U Data_out 1
PC
0
Rd RW Data_in
1 BusW 1
32
RegDst
RegWr
=0
clk =1
MemRead = ‘1’ to MemtoReg = ‘1’ places the data Clock edge updates PC
read data memory read from memory on BusW and Register Rt
59
Controlling the Execution of Store
ExtOp = 1 ALUCtrl MemRead MemWrite
= ADD =0 =1
ALUSrc
Imm16 32 MemtoReg
=1
E ALU result =X
+4
30 32
Instruction Rs 5
Data
RA BusA
30 Memory Memory
32
Registers A 32
0
32
00
Instruction Rt 5 L Address
32 RB 32
Address
BusB 0 U Data_out 1
PC
0
Rd RW Data_in
1 BusW 1
32
RegDst
RegWr
=X
clk =0
60
Adding Jump and Branch to Datapath
Two adder blocks
as shown in
30 Jump or Branch Target Address previous slides
J
Next Beq
PC Bne
30
ALU result
Imm26
Imm16
PCSrc +4 zero
Instruction Rs 5 32
RA BusA Data
30 Memory Memory 0
32
Registers A 32 32
00
Instruction
0
Rt 5 L Address
32
RB
Address
BusB 0
32
U Data_out 1
PC
0
1 Rd RW Data_in
1 BusW E 1
32
RegDst Reg
Write
clk
Mem Mem Mem
ExtOp ALUSrc ALUCtrl Read Write toReg
61
Controlling the Execution of Jump
Two adder blocks
as shown in
30 Jump Target Address previous slides
J=1
Next Beq = 0
PC Bne = 0
30
ALU result
Imm26
Imm16
PCSrc +4 zero
=1 32
Instruction Rs 5
Data
RA BusA
30 Memory Memory 0
32
Registers A 32 32
00
Instruction
0
Rt 5 L Address
32
RB
Address
BusB 0
32
U Data_out 1
PC
0
1 Rd RW Data_in
1 BusW E 1
32
RegDst
=x RegWr
clk =0
62
Controlling the Execution of Branch
Two adder blocks
as shown in
30 Branch Target Address previous slides
J=0
Next Beq = 1
PC Bne = 0
30
ALU result
Imm26
Imm16
PCSrc +4 Zero
=1
=1 32
Instruction Rs 5
Data
RA BusA
30 Memory Memory 0
32
Registers A 32 32
00
Instruction
0
Rt 5 L Address
32
RB
Address
BusB 0
32
U Data_out 1
PC
0
1 Rd RW Data_in
1 BusW E 1
32
RegDst
=x RegWr
clk =0
63
Main Control and ALU Control
Instruction
Memory
Instruction
32
Datapath A
L
U
MemtoReg
MemRead
MemWrite
Address
RegWrite
ALUSrc
RegDst
ExtOp
Beq
Bne
J
Op6 ALUCtrl
funct6
Main ALU
Control Op6
Control
Main Control Input: ALU Control Input:
6-bit opcode field from instruction 6-bit opcode field from instruction
Main Control Output: 6-bit function field from instruction
10 control signals for the Datapath
ALU Control Output:
ALUCtrl signal for ALU
64
Single-Cycle Data path + Control
Two adder blocks
as shown in
30 Jump or Branch Target Address
previous slides
30 30
Instruction
0 Rt 5 L Address u
m RB 32 x
u Address
BusB 0
m
U Data_out 1
PC
0
x m u
1 x Data_in
u RW BusW
Rd x 1
1
5
clk
ALUop
func
Op
RegDst RegWrite ExtOp ALU
Ctrl MemRead
ALUSrc
MemWrite MemtoReg
Main
Control
65
Drawbacks of Single Cycle Processor
longest delay
Instruction Decode Compute Reg
Load Fetch Reg Read Address
Memory Read
Write
Instruction Decode
Jump Fetch PC Write
66
Alternative: Multi cycle Implementation
67
Performance Example
Assume the following operation times for components:
Instruction and data memories: 200 ps
ALU and adders: 180 ps
Decode and Register file access (read or write): 150 ps
Ignore the delays in PC, mux, extender, and wires
68
Solution
Instruction Instruction Register ALU Data Register
Total
Class Memory Read Operation Memory Write
ALU 200 150 180 150 680 ps
Load 200 150 180 200 150 880 ps
Store 200 150 180 200 730 ps
Branch 200 150 180 Compare and write PC 530 ps
Jump 200 150 Decode and write PC 350 ps
70