CH6 Iii

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 56

1

+ Chapter 6_P_III
Processor Structure and Function
+ Processor Organization 2

Processor Requirements:
 Fetch instruction
 The processor reads an instruction from memory (register, cache, main memory)

 Interpret instruction
 The instruction is decoded to determine what action is required

 Fetch data
 The execution of an instruction may require reading data from memory or an I/O
module

 Process data
 The execution of an instruction may require performing some arithmetic or logical
operation on data

 Write data
 The results of an execution may require writing data to memory or an I/O module

 In order to do these things the processor needs to store some data


temporarily and therefore needs a small internal memory
3
4
+ 5

Register Organization
 Within the processor there is a set of registers that
function as a level of memory above main memory and
cache in the hierarchy

 The registers in the processor perform two roles:

User-Visible Registers Control and Status Registers


 Enable the machine or  Used by the control unit to
assembly language control the operation of the
programmer to minimize processor and by privileged
main memory references by operating system programs
optimizing use of registers to control the execution of
programs
6
User-Visible Registers

Categories:
Referenced by means • General purpose
of the machine • Can be assigned to a variety of
functions by the programmer
language that the • Data
processor executes • May be used only to hold data and
cannot be employed in the calculation
of an operand address
• Address
• May be somewhat general purpose or
may be devoted to a particular
addressing mode
• Examples: segment pointers, index
registers, stack pointer
• Condition codes
• Also referred to as flags
• Bits set by the processor hardware as
the result of operations
7

Table 14.1
Condition Codes
+ 8

Control and Status Registers


Four registers are essential to instruction
execution:
 Program counter (PC)
 Contains the address of an instruction to be fetched

 Instruction register (IR)


 Contains the instruction most recently fetched

 Memory address register (MAR)


 Contains the address of a location in memory

 Memory buffer register (MBR)


 Contains a word of data to be written to memory or the
word most recently read
+ Program Status Word (PSW) 9

Register or set of registers that


contain status information

Common fields or flags include:


• Sign
• Zero
• Carry
• Equal
• Overflow
• Interrupt Enable/Disable
• Supervisor
10

The point of this comparison should be clear. There is no


universally accepted philosophy concerning the best way to
organize processor registers [TOON81]. As with overall
instruction set design and so many other processor design
issues, it is still a matter of judgment and taste.

A second instructive point concerning register organization design is illustrated in Figure


14.3c. This figure shows the user-visible register organization for the Intel 80386 [ELAY85],
which is a 32-bit microprocessor designed as an extension of the 8086. The 80386 uses 32-bit
registers. However, to provide upward compatibility for programs written on the earlier
machine, the 80386 retains the original register organization embedded in the new
organization. Given this design constraint, the architects of the 32-bit processors had limited
flexibility in designing the register organization.
11

Includes the following


stages:
Instruction
Cycle

Fetch Execute Interrupt

If interrupts are
enabled and an
Read the next
Interpret the opcode interrupt has
instruction from
and perform the occurred, save the
memory into the
indicated operation current process state
processor
and service the
interrupt
12

• The main line of activity consists of alternating instruction fetch and instruction
execution activities.
• After an instruction is fetched, it is examined to determine if any indirect
addressing is involved. If so, the required operands are fetched using indirect
addressing.
• Following execution, an interrupt may be processed before the next instruction
fetch.
13
14

During the fetch cycle, an instruction is read from memory. Figure 14.6 shows the flow of data
during this cycle. The PC contains the address of the next instruction to be fetched. This address is
moved to the MAR and placed on the address bus. The control unit requests a memory read, and
the result is placed on the data bus and copied into the MBR and then moved to the IR.
Meanwhile, the PC is incremented by 1, preparatory for the next fetch.
15

Once the fetch cycle is over, the control unit examines the contents of the IR to determine if it
contains an operand specifier using indirect addressing. If so, an indirect cycle is performed. As
shown in Figure 14.7, this is a simple cycle. The right- most N bits of the MBR, which contain the
address reference, are transferred to the MAR. Then the control unit requests a memory read, to get
the desired address of the operand into the MBR.
16

Like the fetch and indirect cycles, the interrupt cycle is simple and predictable (Figure 14.8). The
current contents of the PC must be saved so that the processor can resume normal activity after the
interrupt. Thus, the contents of the PC are transferred to the MBR to be written into memory. The
special memory location reserved for this purpose is loaded into the MAR from the control unit. It
might, for example, be a stack pointer. The PC is loaded with the address of the interrupt routine. As a
result, the next instruction cycle will begin by fetching the appropriate instruction.
17

Pipelining Strategy
To apply this
concept to
instruction
Similar to the use execution we must
of an assembly line recognize that an
in a manufacturing instruction has a
plant number of stages

New inputs are


accepted at one
end before
previously
accepted inputs
appear as outputs
at the other end
18

While the second stage is executing the instruction, the first stage takes advantage of any unused memory cycles to fetch and buffer the next
instruction. This is called instruction prefetch or fetch overlap. Note that this approach, which involves instruction buffering, requires more
registers. In general, pipelining requires registers to store data between stages.

It should be clear that this process will speed up instruction execution. If the fetch and execute stages were of equal duration, the instruction
cycle time would be halved. However, if we look more closely at this pipeline (Figure 14.9b), we will see that this doubling of execution rate
is unlikely for two reasons:
-The execution time will generally be longer than the fetch time. Execution will involve reading and storing operands and the performance
of some operation. Thus, the fetch stage may have to wait for some time before it can empty its buffer.
-A conditional branch instruction makes the address of the next instruction to be fetched unknown. Thus, the fetch stage must wait until it
receives the next instruction address from the execute stage. The execute stage may then have to wait while the next instruction is fetched.

Guessing can reduce the time loss from the second reason. A simple rule is the following: When a conditional branch instruction is passed on
from the fetch to the execute stage, the fetch stage fetches the next instruction in memory after the branch instruction. Then, if the branch is
not taken, no time is lost. If the branch is taken, the fetched instruction must be discarded and a new instruction fetched.
+ Additional Stages 19

 Fetch instruction (FI)


 Read the next expected  Fetch operands (FO)
instruction into a buffer  Fetch each operand from
memory
 Decode instruction (DI)  Operands in registers
 Determine the opcode need not be fetched
and the operand
specifiers  Execute instruction (EI)
 Perform the indicated
 Calculate operands (CO) operation and store the
 Calculate the effective result, if any, in the
address of each source specified destination
operand operand location
 This may involve
displacement, register  Write operand (WO)
indirect, indirect, or other  Store the result in
forms of address memory
calculation
20

For the sake of illustration, let us assume equal duration. Using this assumption, Figure 14.10 shows that a six-
stage pipeline can reduce the execution time for 9 instructions from 54 time units to 14 time units.

Several comments are in order: The diagram assumes that each instruction goes through all six stages of the
pipeline. This will not always be the case. For example, a load instruction does not need the WO stage. However,
to simplify the pipeline hardware, the timing is set up assuming that each instruction requires all six stages. Also,
the diagram assumes that all of the stages can be performed in parallel. In particular, it is assumed that there are no
memory conflicts. For example, the FI, FO, and WO stages involve a memory access. The diagram implies that all
these accesses can occur simultaneously. Most memory systems will not permit that. However, the desired value
may be in cache, or the FO or WO stage may be null. Thus, much of the time, memory conflicts will not slow
down the pipeline.
21

Another difficulty is the conditional branch instruction, which can invalidate several instruction fetches.
A similar unpredictable event is an interrupt.
Figure 14.11 illustrates the effects of the conditional branch, using the same program as Figure 14.10.
Assume that instruction 3 is a conditional branch to instruction 15. Until the instruction is executed,
there is no way of knowing which instruction will come next. The pipeline, in this example, simply loads
the next instruction in sequence (instruction 4) and proceeds. The branch is not determined until the end
of time unit 7. At this point, the pipeline must be cleared of instructions that are not useful. During time
unit 8, instruction 15 enters the pipeline. No instructions complete during time units 9 through 12; this is
the performance penalty incurred because we could not anticipate the branch.
22

Figure 14.12 indicates the logic needed for pipelining to account for branches and
interrupts.
23

In Figure 14.13a (which corresponds to Figure 14.10), the pipeline is full at time 6, with 6 different instructions in
various stages of execution, and remains full through time 9; we assume that instruction I9 is the last instruction to be
executed. In Figure 14.13b, (which corresponds to Figure 14.11), the pipeline is full at times 6 and 7. At time 7,
instruction 3 is in the execute stage and executes a branch to instruction 15. At this point, instructions I4 through I7
are flushed from the pipeline, so that at time 8, only two instructions are in the pipeline, I3 and I15.
24

Figure 14.14a plots the speedup factor as a function of the number of instructions that are executed
without a branch.
Figure 14.14b shows the speedup factor as a function of the number of stages in the instruction pipeline
Thus, the larger the number of pipeline stages, the greater the potential for speedup. However, as a
practical matter, the potential gains of additional pipe- line stages are countered by increases in cost,
delays between stages, and the fact that branches will be encountered requiring the flushing of the
pipeline.
Pipeline Hazards
25

A pipeline hazard
occurs when the
pipeline, or some
portion of the There are three
pipeline, must stall types of hazards:
because conditions • Resource
do not permit • Data
continued execution • Control

Such a pipe- line stall


is also referred to as
a pipeline bubble
26

A resource hazard occurs when two (or more) instructions that are already in the pipeline need the
same resource. The result is that the instructions must be executed in serial rather than parallel for a
portion of the pipeline. A resource hazard is sometime referred to as a structural hazard.

Now assume that main memory has a single port and that all instruction fetches and data reads and writes
must be performed one at a time. Further, ignore the cache. In this case, an operand read to or write from
memory cannot be performed in parallel with an instruction fetch. This is illustrated in Figure 14.15b,
which assumes that the source operand for instruction I1 is in memory, rather than a register. Therefore,
the fetch instruction stage of the pipeline must idle for one cycle before beginning the instruction fetch
for instruction I3. The figure assumes that all other operands are in registers.
27
Clock cycle
1 2 3 4 5 6 7 8 9 10
ADD EAX, EBX FI DI FO EI WO

SUB ECX, EAX FI DI Idle FO EI WO

I3 FI DI FO EI WO

I4 FI DI FO EI WO

Figure 14.16 Example of Data Hazard

ADD EAX, EBX /* EAX = EAX + EBX

SUB ECX, EAX /* ECX = ECX – EAX

A data hazard occurs when there is a conflict in the access of an operand location. In general terms, we
can state the hazard in this form: Two instructions in a program are to be executed in sequence and both
access a particular memory or register operand. If the two instructions are executed in strict sequence, no
problem occurs. However, if the instructions are executed in a pipeline, then it is possible for the operand
value to be updated in such a way as to produce a different result than would occur with strict sequential
execution. In other words, the program produces an incorrect result because of the use of pipelining.
+ Types of Data Hazard 28

 Read after write (RAW), or true dependency


 An instruction modifies a register or memory location
 Succeeding instruction reads data in memory or register
location
 Hazard occurs if the read takes place before write operation is
complete

 Write after read (WAR), or antidependency


 An instruction reads a register or memory location
 Succeeding instruction writes to the location
 Hazard occurs if the write operation completes before the read
operation takes place

 Write after write (WAW), or output dependency


 Two instructions both write to the same location
 Hazard occurs if the write operations take place in the reverse
order of the intended sequence
+ 29

Control Hazard
 Also known as a branch hazard
 Occurs when the pipeline makes the wrong decision on
a branch prediction
 Brings instructions into the pipeline that must
subsequently be discarded
 Dealing with Branches:
 Multiple streams
 Prefetch branch target
 Loop buffer
 Branch prediction
 Delayed branch
30
Multiple Streams
A simple pipeline suffers a penalty for a
branch instruction because it must
choose one of two instructions to fetch
next and may make the wrong choice

A brute-force approach is to replicate


the initial portions of the pipeline and
allow the pipeline to fetch both
instructions, making use of two
streams

Drawbacks:
• With multiple pipelines there are contention
delays for access to the registers and to memory
• Additional branch instructions may enter the
pipeline before the original branch decision is
resolved
31

Prefetch Branch Target

 When a conditional branch is recognized,


the target of the branch is prefetched, in
addition to the instruction following the
branch
 Target is then saved until the branch
instruction is executed
 If the branch is taken, the target has
already been prefetched

+  IBM 360/91 uses this approach


+ 32

Loop Buffer
 Small, very-high speed memory maintained by the
instruction fetch stage of the pipeline and containing the
n most recently fetched instructions, in sequence
 Benefits:
 Instructions fetched in sequence will be available without the
usual memory access time
 If a branch occurs to a target just a few locations ahead of the
address of the branch instruction, the target will already be in
the buffer
 This strategy is particularly well suited to dealing with loops

 Similar in principle to a cache dedicated to instructions


 Differences:
 The loop buffer only retains instructions in sequence
 Is much smaller in size and hence lower in cost
Branch addr
ess
33

Instruction to be
8 decoded in case of hi
Loop Buffer
(256 bytes)

Most significant addr


ess bits
compar
ed to determine a hit
Figur
e 14.17 Loop Buffer

Figure 14.17 gives an example of a loop buffer. If the buffer contains 256 bytes, and byte
addressing is used, then the least significant 8 bits are used to index the buffer. The remaining
most significant bits are checked to determine if the branch target lies within the environment
captured by the buffer.
Among the machines using a loop buffer are some of the CDC machines (Star- 100, 6600, 7600) and the CRAY-1. A
specialized form of loop buffer is available on the Motorola 68010, for executing a three-instruction loop involving the DBcc
(decrement and branch on condition) instruction (see Problem 14.14). A three-word buffer is maintained, and the processor
executes these instructions repeatedly until the loop condition is satisfied.
+ Branch Prediction 34

 Various techniques can be used to predict whether a


branch will be taken:

1. Predict never taken


 These approaches are static
2. Predict always taken
 They do not depend on the
execution history up to the time
3. Predict by opcode of the conditional branch
instruction
4. Taken/not taken switch 
These approaches are dynamic
5. Branch history table  They depend on the execution history

The first two approaches are the simplest. These either always assume that the branch will not be taken and continue to
fetch instructions in sequence, or they always assume that the branch will be taken and always fetch from the branch tar-
get. The predict-never-taken approach is the most popular of all the branch prediction methods. Studies analyzing program
behavior have shown that conditional branches are taken more than 50% of the time [LILJ88], and so if the cost of
prefetching from either path is the same, then always prefetching from the branch target address should give better
performance than always prefetching from the sequential path.

The final static approach makes the decision based on the opcode of the branch instruction. The processor assumes that the
branch will be taken for certain branch opcodes and not for others. [LILJ88] reports success rates of greater than 75% with
this strategy.
35

With a single bit, all that can be recorded is whether the last execution of this instruction resulted in a branch or not. A
shortcoming of using a single bit appears in the case of a conditional branch instruction that is almost always taken, such
as a loop instruction. With only one bit of history, an error in prediction will occur twice for each use of the loop: once on
entering the loop, and once on exiting.

If two bits are used, they can be used to record the result of the last two instances of the execution of the associated
instruction, or to record a state in some other fashion. Figure 14.18 shows a typical approach (see Problem 14.13 for other
possibilities). Assume that the algorithm starts at the upper-left-hand corner of the flowchart. As long as each succeeding
conditional branch instruction that is encountered is taken, the decision process predicts that the next branch will be
taken. If a single prediction is wrong, the algorithm continues to predict that the next branch is taken. Only if two
successive branches are not taken does the algorithm shift to the right-hand side of the flowchart. Subsequently, the
algorithm will predict that branches are not taken until two branches in a row are taken. Thus, the algorithm requires two
consecutive wrong predictions to change the prediction decision.
36

-The decision process can be represented more compactly by a finite-state machine, shown in Figure 14.19.
-The use of history bits, as just described, has one drawback: If the decision is made to take the branch, the
target instruction cannot be fetched until the tar- get address, which is an operand in the conditional branch
instruction, is decoded. Greater efficiency could be achieved if the instruction fetch could be initiated as soon
as the branch decision is made. For this purpose, more information must be saved, in what is known as a
branch target buffer, or a branch history table.
Next sequential
address
37
-Figure 14.20 contrasts this scheme with a predict-

Select
never-taken strategy. With the former strategy, the
Memory
instruction fetch stage always fetches the next
Branch Miss
sequential address. If a branch is taken, some logic in
E
Handling the processor detects this and instructs that the next
instruction be fetched from the target address (in
(a) Predict nevertaken strategy
addition to flushing the pipeline).
-The branch history table is treated as a cache. Each
Next sequential
prefetch triggers a lookup in the branch history table. If
address no match is found, the next sequential address is used
Branch
IPFAR
instruction Target for the fetch. If a match is found, a prediction is made
Select
address State
Lookup
address
based on the state of the instruction: Either the next
Memory
sequential address or the branch target address is fed to
the select logic.

Add new IPFAR = instruction


entry prefix addr
-When the branch instruction is executed, the execute
ess register

Update
stage signals the branch history table logic with the
state result. The state of the instruction is updated to reflect
a correct or incorrect prediction. If the prediction is
incorrect, the select logic is redirected to the correct
Branch Miss
Handling
Redirect
address for the next fetch. When a conditional branch
E instruction is encountered that is not in the table, it is
added to the table and one of the existing entries is
(b) Branch history table strategy
discarded, using one of the cache replacement
Figur
e 14.20 Dealing with Branches algorithms discussed in Chapter 4.
38
Intel 80486 Pipelining
Objective is to fill the prefetch buffers with new data asFetch
Operates independently of the other stages to keep the
soon as the old data have been consumed by the
prefetch buffers full
instruction decoder

Decode stage 1 D1 decoder can then direct the D2


All opcode and addressing-mode 3 bytes of instruction are passed to
stage to capture the rest of the
information is decoded in the D1 stage the D1 stage from the prefetch buffers
instruction

Decode stage 2
Also controls the computation of the more complex
Expands each opcode into control signals for the ALU
addressing modes

Execute
Stage includes ALU operations, cache access, and register update

Write back
Updates registers and status flags modified during the preceding execute stage
39
-Figure 14.21 shows examples of the operation of
the pipeline. Figure 14.21a shows that there is no
delay introduced into the pipeline when a memory
access is required.
-However, as Figure 14.21b shows, there can be a
delay for values used to compute memory
addresses. That is, if a value is loaded from
memory into a register and that register is then
used as a base register in the next instruction, the
processor will stall for one cycle.

-Figure 14.21c illustrates the timing of a branch


instruction, assuming that the branch is taken. The
compare instruction updates condition codes in the
WB stage, and bypass paths make this available to
the EX stage of the jump instruction at the same
time. In parallel, the processor runs a speculative
fetch cycle to the target of the jump during the EX
stage of the jump instruction. If the processor
determines a false branch condition, it discards
this prefetch and continues execution with the next
sequential instruction (already fetched and
decoded).
+ 40

The x86 Processor Family


 The x86 organization has evolved dramatically over the
years. In this section we examine some of the details of the
most recent processor organizations, concentrating on
common elements in single processors
 Register Organization:
 General: There are eight 32-bit general-purpose registers
(see Figure 14.3c). These may be used for all types of x86
instructions; they can also hold operands for address
calculations. In addition, some of these registers also serve
special purposes. For example, string instructions use the
contents of the ECX, ESI, and EDI registers as operands
without having to reference these registers explicitly in the
instruction. As a result, a number of instructions can be
encoded more compactly. In 64-bit mode, there are sixteen
64-bit general-purpose registers.
41
42
43

Table 14.2

x86
Processor
Registers
44
45
46
+ 47

Interrupt Processing
Interrupts and Exceptions
 Interrupts
 Generated by a signal from hardware and it may occur at random
times during the execution of a program
 Maskable (. The processor does not recognize a maskable interrupt unless the
interrupt enable flag (IF) is set. )
 Nonmaskable (Recognition of such interrupts cannot be prevented. )

 Exceptions
 Generated from software and is provoked by the execution of an
instruction
 Processor detected
 Programmed

 Interrupt vector table


 Every type of interrupt is assigned a number
 Number is used to index into the interrupt vector table
48

Table 14.3

x86
Exception
and
Interrupt
Vector
Table
+ The ARM Processor 49

ARM is primarily a RISC system with the


following attributes:
 Moderate array of uniform registers
 A load/store model of data processing in which operations only
perform on operands in registers and not directly in memory
 A uniform fixed-length instruction of 32 bits for the standard set and
16 bits for the Thumb instruction set
 Separate arithmetic logic unit (ALU) and shifter units
 A small number of addressing modes with all load/store addresses
determined from registers and instruction fields
 Auto-increment and auto-decrement addressing modes are used to
improve the operation of program loops
 Conditional execution of instructions minimizes the need for
conditional branch instructions, thereby improving pipeline efficiency,
because pipeline flushing is reduced
50

ARM data processing


instructions typically have
two source registers, Rn and
Rm, and a single result or
destination register, Rd. The
source register values feed
into the ALU or a separate
multiply unit that makes use
of an additional register to
accumulate partial results.
The ARM processor also
includes a hardware unit
that can shift or rotate the
Rm value before it enters the
ALU. This shift or rotate
occurs within the cycle time
of the instruction and
increases the power and
flexibility of many data
processing operations.
Processor Modes 51

Most application
ARM programs execute
architecture in user mode
supports • While the processor is
in user mode the
seven program being
executed is unable to
execution access protected
modes system resources or to
change mode, other
than by causing an
exception to occur

Remaining six Advantages to


defining so many
execution different privileged
modes are modes
referred to as • The OS can tailor the use
of system software to a
privileged variety of circumstances
modes • Certain registers are
dedicated for use for each
• These modes are of the privileged modes,
used to run system allows swifter changes in
context
software
Exception Modes * Supervisor mode: Usually what the OS
52

runs in. It is entered when the processor


encounters a software interrupt instruction.
Software interrupts are a standard way to
invoke operating system services on ARM.
Have full * Abort mode: Entered in response to
access to Entered when memory faults.
* Undefined mode: Entered when the
system specific processor attempts to execute an instruction
resources and exceptions that is supported neither by the main integer
can change occur core nor by one of the coprocessors.
* Fast interrupt mode: Entered whenever
modes freely the processor receives an interrupt signal
from the designated fast interrupt source. A
fast interrupt cannot be interrupted, but a fast
interrupt may interrupt a normal interrupt.
• Interrupt mode: Entered whenever the
processor receives an interrupt signal from
Exception System mode: any other interrupt source (other than fast
interrupt). An interrupt may only be
modes: • Not entered by any interrupted by a fast interrupt.
exception and uses the
• Supervisor mode same registers
• available in User mode The remaining privileged mode is the
Abort mode System mode. This mode is not entered by
• Is used for running
• Undefined mode certain privileged any exception and uses the same registers
• Fast interrupt operating system tasks available in User mode. The System mode is
• May be interrupted by used for running certain privileged operating
mode system tasks. System mode tasks may be
• Interrupt mode any of the five
exception categories interrupted by any of the five exception
categories.
53

The ARM processor has a total of 37 32-bit


registers, classified as follows:

• Thirty-one registers referred to in the ARM


manual as general-purpose registers. In fact,
some of these, such as the program counters,
have special purposes.

• Six program status registers.

Registers are arranged in partially overlapping


banks, with the current processor mode
determining which bank is available. At any
time, sixteen numbered registers and one or two
program status registers are visible, for a total of
17 or 18 software-visible registers

Registers R0 through R7, register R15 (the


program counter) and the current program status
register (CPSR) are visible in and shared by all
modes.

Registers R8 through R12 are shared by all


modes except fast interrupt,which has its own
dedicated registers R8_fiq through R12_fiq.

All the exception modes have their own versions


of registers R13 and R14.

All the exception modes have a dedicated saved


program status register (SPSR)
54
55

Table 14.4

ARM
Interrupt
Vector
+ Summary Processor
56

Structure and
Function
Chapter 6
 Processor organization  Instruction pipelining
 Pipelining strategy
 Register organization
 Pipeline performance
 User-visible registers
 Pipeline hazards
 Control and status
registers
 Dealing with branches
 Intel 80486 pipelining
 Instruction cycle
 The indirect cycle
 The Arm processor
 Data flow
 Processor organization
 Processor modes
 The x86 processor family  Register organization
 Register organization  Interrupt processing
 Interrupt processing

You might also like