Unit 3 Students
Unit 3 Students
Unit 3 Students
UNIT 3
• Basic Processing Unit: Concepts - Instruction Execution - Hardware Components - Instruction Fetch and
Execution Steps -Control Signals - Hardwired Control.
• Pipelining: Basic Concept - Pipeline Organization- Pipelining Issues - Data Dependencies - Memory Delays -
Branch Delays - Resource Limitations - Performance Evaluation -Superscalar Operation.
1
22-Sep-2024
2
22-Sep-2024
Processor Organization
Internal processor
bus
Control signals
PC
Instruction
Address
decoder and
lines
MAR control logic
Memory
bus
MDR
Data
lines IR
Y
Constant 4 R0
Select MUX
Add
A B
ALU Sub R( n - 1)
control ALU
lines
Carry -in
XOR TEMP
Datapath
• The primary function of a computer system is to execute a program, sequence of instructions. These
instructions are stored in computer memory.
• These instructions are executed to process data which are already loaded in the computer memory through
some input devices.
• After processing the data, the result is either stored in the memory for further reference, or it is sent to the
outside world through some output port.
• To perform the execution of an instruction, in addition to the arithmetic logic unit, and control unit, the
processor contains a number of registers used for temporary storage of data and some special function
registers.
• The special function registers include program counters (PC), instruction registers (IR), memory address
registers (MAR) and memory and memory data registers (MDR).
• The Program counter is one of the most critical registers in CPU. It monitors the execution of instructions.
It keeps track on which instruction is being executed and what the next instruction will be.
• The instruction register IR is used to hold the instruction that is currently being executed.
• The contents of IR are available to the control unit, which generate the timing signals that control, the
various processing elements involved in executing the instruction.
• The two registers MAR and MDR are used to handle the data transfer between the main memory and the
processor
• The MAR holds the address of the main memory to or from which data is to be transferred.
• The MDR contains the data to be written into or read from the addressed word of the main memory. 6
3
22-Sep-2024
Instruction Execution
• It refers to the process by which a computer's processor (CPU) carries out the instructions specified in a program.
This process involves several key steps and components.
1. Instruction Fetch
• Program Counter (PC): The CPU uses the Program Counter to keep track of the address of the next
instruction to execute.
• Memory Access: The instruction is fetched from memory (RAM) based on the address in the Program
Counter.
2. Instruction Decode
• Instruction Register (IR): The fetched instruction is loaded into the Instruction Register.
• Decoding: The CPU decodes the instruction to determine what operation is to be performed. This involves
interpreting the opcode (operation code) and identifying the operands (data or addresses involved).
3. Operand Fetch
• Registers or Memory Access: Depending on the instruction, operands might need to be fetched from registers
or memory locations. For instructions that involve data in memory, the CPU may need to access specific
memory locations.
4. Instruction Execution
• Execution Unit: The actual operation specified by the instruction is performed by the CPU's execution units.
This could involve arithmetic operations (e.g., addition, subtraction), logical operations (e.g., AND, OR), or
data movement operations (e.g., loading data into a register).
4
22-Sep-2024
5. Write Back
• Update Registers or Memory: The result of the execution is written back to a register or memory location,
depending on the instruction’s requirements.
6. Update Program Counter
• Next Instruction Address: The Program Counter is updated to point to the address of the next instruction,
preparing the CPU to fetch the next instruction in the sequence.
Key Concepts
• Pipelining: Modern CPUs use pipelining to overlap these stages for improved performance. Different
instructions can be at different stages of execution simultaneously.
• Instruction Set Architecture (ISA): The ISA defines the set of instructions a CPU can execute and the format of
these instructions.
• Control Unit: This part of the CPU coordinates the fetching, decoding, execution, and write-back processes.
• Data Path: The internal pathways through which data and instructions travel within the CPU.
• Clock Cycles: Each step of the process typically takes one or more clock cycles, which are the basic unit of
time in the CPU’s operation.
• This sequence of steps is often referred to as the Instruction Cycle or Fetch-Decode-Execute Cycle. The
specific details of instruction execution can vary depending on the architecture of the CPU, such as whether it's
a RISC (Reduced Instruction Set Computer) or CISC (Complex Instruction Set Computer) architecture.
10
5
22-Sep-2024
2. Decode Phase:
▪ Decode the instruction from the IR.
3. Execute Phase
▪ Carry out the actions specified by the instruction in the IR by executing it.
▪ Performing the operation specified in the instruction constitutes the instruction execution phase. With
few exceptions, the operation specified by an instruction can be carried out by performing one or
more of the following actions:
➢ Read the contents of a given memory location and load them into a processor register.
➢ Perform an arithmetic or logic operation and place the result into a processor register.
11
▪ The processor communicates with the memory through the processor-memory interface, which
transfers data from and to the memory during Read and Write operations.
▪ The instruction address generator updates the contents of the PC after every instruction is fetched.
▪ The register file is a memory unit whose storage locations are organized to form the processor’s
general-purpose registers.
▪ During execution, the contents of the registers named in an instruction that performs an arithmetic or
logic operation are sent to the arithmetic and logic unit (ALU), which performs the required
computation.
▪ The results of the computation are stored in a register in the register file.
12
6
22-Sep-2024
Hardware Components
• The processor communicates with the memory through the
processor-memory interface, which transfers data from and
to the memory during Read and Write operations.
• The instruction address generator updates the contents of
the PC after every instruction is fetched.
• The register file is a memory unit whose storage locations
are organized to form the processor’s general-purpose
registers.
• During execution, the contents of the registers named in
an instruction that performs an arithmetic or logic
operation are sent to the arithmetic and logic unit (ALU),
which performs the required computation. The results of
the computation are stored in a register in the register file.
13
Register File
• General-purpose registers are usually implemented in the form of a register file, which is a small and fast
memory block. It consists of an array of storage elements, with access circuitry that enables data to be read
from or written into any register.
• The access circuitry is designed to enable two registers to be read at the same time, making their contents
available at two separate outputs, A and B.
• The register file has two address inputs that select the two registers to be read. These inputs are connected to
the fields in the IR that specify the source registers, so that the required registers can be read. The register file
also has a data input, C, and a corresponding address input to select the register into which data are to be
written. This address input is connected to the IR field that specifies the destination register of the instruction.
• The inputs and outputs of any memory unit are often called input and output ports stage performs the actions
needed in one of the steps.
• Two alternatives for implementing a dual-ported register file. A memory unit that has two output ports is said to
be dual-ported.
14
7
22-Sep-2024
• One possibility is to use a single set of registers with duplicate data paths and access circuitry that enable two
registers to be read at the same time.
• An alternative is to use two memory blocks, each containing one copy of the register file. Whenever data are
written into a register, they are written into both copies of that register. Thus, the two files have identical
contents. When an instruction requires data from two registers, one register is accessed in each file. In effect, the
two register files together function as a single dual-ported register file.
ALU
• The arithmetic and logic unit is used to manipulate data. It performs arithmetic operations such as addition and
subtraction, and logic operations such as AND, OR, and XOR.
• When an instruction that performs an arithmetic or logic operation is
being executed, the contents of the two registers specified in the
instruction are read from the register file and become available at
outputs A and B.
• Output A is connected directly to the first input of the ALU, InA, and
output B is connected to a multiplexer, MuxB.
• The multiplexer selects either output B of the register file or the
immediate value in the IR to be connected to the second ALU input,
InB.
• The output of the ALU is connected to the data input, C, of the register
file so that the results of a computation can be loaded into the
destination register. 16
8
22-Sep-2024
Datapath
• Instruction processing consists of two phases: the fetch phase and the execution phase. It is convenient to divide
the processor hardware into two corresponding sections. One section fetches instructions and the other executes
them.
• The section that fetches instructions is also responsible for decoding them and for generating the control signals
that cause appropriate actions to take place in the execution section.
• The execution section reads the data operands specified in an instruction, performs the required computations,
and stores the results
• An instruction is fetched in step 1 by hardware stage 1 and placed into the IR. It is decoded, and its source
registers are read in step 2. The information in the IR is used to generate the control signals for all subsequent
steps. Therefore, the IR must continue to hold the instruction until its execution is completed.
• Data read from the register file are placed in registers RA and RB. Register RA provides the data to input InA of
the ALU. Multiplexer MuxB forwards either the contents of RB or the immediate value in the IR to the ALU’s
second input, InB. The ALU constitutes stage 3, and the result of the computation it performs is placed in
register RZ
17
Datapath
• Recall that for computational instructions, such as an Add
instruction, no processing actions take place in step 4.
• During that step, multiplexer MuxYin The contents of RY are
transferred to the register file in step 5 and loaded into the
destination register.
• For this reason, the register file is in both stages 2 and 5. It
is a part of stage 2 because it contains the source registers
and a part of stage 5 because it contains the destination
register.
• For Load and Store instructions, the effective address of the
memory operand is computed by the ALU in step 3 and
loaded into register RZ. From there, it is sent to the
memory, which is stage 4.
18
9
22-Sep-2024
20
10
22-Sep-2024
1. Load Instructions
3. Store Instructions
21
1. Load Instructions
▪ The above example uses the Index addressing mode to load a word of data
from memory location X + [R7] into register R5.
22
11
22-Sep-2024
6. Use the sum X + [R7] as the effective address of the source operand, and read the contents of that
location in the memory.
7. Load the data received from the memory into the destination register, R5.
23
12
22-Sep-2024
Instructions that involve an arithmetic or logic operation can be executed using similar steps.
▪ There are either two source registers, or a source register and an immediate source operand.
25
26
13
22-Sep-2024
▪ Once the instruction is loaded into the IR, the immediate value is available for use in the
addition operation.
▪ The step sequence can be used, with steps 2 and 3 modified as:
27
▪ 3. Store Instructions
▪ The five-step sequence is suitable for all Load and Store instructions, because the addressing modes that can be
used in these instructions are special cases of the Index mode.
▪ Most RISC-style processors provide one general-purpose register, usually register R0, that always contains the
value zero.
Store R1, X(R0) ; Store the value in register R1 to memory address (X + R0)
▪ When R0 is used as the index register, the effective address of the operand is the immediate value X. This is
the Absolute addressing mode.
▪ Alternatively, if the offset X is set to zero, the effective address is the contents of the index register, Ri. This is
the Indirect addressing mode.
28
14
22-Sep-2024
29
30
15
22-Sep-2024
31
16
22-Sep-2024
6. Interrupt Signals
• Notify the CPU that an event requiring immediate attention has occurred.
• When an interrupt signal is received, the CPU temporarily halts its current operations to address the interrupting event.
7. Control Bus Signals
• Manage various control operations across the system.
• Include signals such as Memory Read (MEMR), Memory Write (MEMW), and Input/Output Read (IOR), among others.
8. Status Signals
• Provide information about the status of different components.
• Signals such as Interrupt Request (IRQ) or Flag Status might indicate the status of a device or the result of an operation.
9. Bus Control Signals
• Manage access to the system bus.
• Signals like Bus Request (BRQ) and Bus Grant (BG) control access to the shared system bus, coordinating which
component can use the bus at a given time.
10. DMA (Direct Memory Access) Signals
• Manage data transfers directly between memory and I/O devices without CPU intervention.
• Signals such as DMA Request (DMARQ) and DMA Acknowledge (DMAACK) coordinate the data transfer process in DMA
operations.
• Control signals are essential for the proper operation of a computer system. They ensure that different parts of the system
communicate and coordinate effectively, allowing for efficient execution of instructions and management of data.
• A Central Processing Unit is the most important component of a computer system. A control unit is a part of
the CPU.
• A control unit controls the operations of all parts of the computer but it does not carry out any data processing
operations.
Functions of the Control Unit
• It coordinates the sequence of data movements into, out of, and between a processor’s many sub-units.
• It interprets instructions.
• It controls data flow inside the processor.
• It receives external instructions or commands to which it converts to sequence of control signals.
• It controls many execution units(i.e. ALU, data buffers and registers) contained within a CPU.
• It also handles multiple tasks, such as fetching, decoding, execution handling and storing results.
Types of Control Unit
There are two types of control units:
• Hardwired
• Micro programmable control unit.
17
22-Sep-2024
• The above generated Op-code bits are received in the field of an instruction decoder. The instruction decoder
interprets the operation and instruction's addressing mode. Now on the basis of the addressing mode of
instruction and operation which exists in the instruction register, the instruction decoder sets the corresponding
Instruction signal INSi to 1.
• Five steps are used to execute each instruction, i.e., instruction fetch, decode, operand fetch, ALU and
memory store.
• The information about the current step of instruction must be known by the control unit. Now the Step Counter is
implemented, which is used to contain the signals from T1,…., T5. Now on the basis of the step which contains
the instruction, one of the signals of a step counter will be set from T1 to T5 to 1.
• The one-clock cycle of the clock will be completed for each step. For example, suppose that if the stop counter
sets T3 to 1, then after completing one clock cycle, the step counter will set T4 to 1.
• What will happen if the execution of an instruction is interrupted for some reason? Will the step counter still be
triggered by the clock? The answer to this question is No. As long as the execution is current step is completed,
the Counter Enable will "disable" the Step Counter so that it will stop then increment to the next step signal.
• What if the execution of instruction depends on some conditions? In this case, the Condition Signals will be
used. There are various conditions in which the signals are generated with the help of control signals that can be
less than, greater than, less than equal, greater than equal, and many more.
• The external input is the last one. It is used to tell the Control Signal Generator about the interrupts, which will
affect the execution of an instruction.
• So, on the basis of the input obtained by the conditional signals, step counter, external inputs, and instruction
register, the control signals will be generated with the help of Control signal Generator.
18
22-Sep-2024
The hardwired control unit is used in those types of computers The micro-programmed control unit is used in those types of
that also use the RISC (Reduced instruction Set Computers). computers that also use the CISC (Complex instruction Set
Computers).
In the hardwired control unit, the hardware is used to generate In this CU, the microinstructions are used to generate the
only the required control signals. That's why this control unit control signals. That's why this CU is slower than the
is faster as compared to the micro-programmed control unit. hardwired control unit.
Pipelining
• To improve the performance of a CPU we have two options:
• Improve the hardware by introducing faster circuits.
• Arrange the hardware such that more than one operation can be performed at the same time. Since there
is a limit on the speed of hardware and the cost of faster circuits is quite high, we have to adopt the
2nd option.
• Pipelining is a process of arrangement of hardware elements of the CPU such that its overall performance is
increased. Simultaneous execution of more than one instruction takes place in a pipelined processor.
• Let us see a real-life example that works on the concept of pipelined operation. Consider a water bottle
packaging plant. Let there be 3 stages that a bottle should pass through, Inserting the bottle(I), Filling water in
the bottle(F), and Sealing the bottle(S).
• Let us consider these stages as stage 1, stage 2, and stage 3 respectively. Let each stage take 1 minute to
complete its operation. Now, in a non-pipelined operation, a bottle is first inserted in the plant, after 1 minute
it is moved to stage 2 where water is filled. Now, in stage 1 nothing is happening. Similarly, when the bottle
moves to stage 3, both stage 1 and stage 2 are idle. But in pipelined operation, when the bottle is in stage 2,
another bottle can be loaded at stage 1. Similarly, when the bottle is in stage 3, there can be one bottle each in
stage 1 and stage 2. So, after each minute, we get a new bottle at the end of stage 3.
• Hence, the average time taken to manufacture 1 bottle is:
19
22-Sep-2024
Pipelining
• Hence, the average time taken to manufacture 1 bottle is:
Without pipelining = 9/3 minutes = 3m
I F S | | | | | |
| | | I F S | | |
| | | | | | I F S (9 minutes)
Pipelining
20
22-Sep-2024
Pipeline Organization
Pipeline Stages RISC processor has 5 stage instruction pipeline to execute all the instructions in the RISC
instruction set.
Following are the 5 stages of the RISC pipeline with their respective operations:
• Stage 1 (Instruction Fetch) In this stage the CPU reads instructions from the address in the memory whose
value is present in the program counter.
• Stage 2 (Instruction Decode) In this stage, instruction is decoded and the register file is accessed to get the
values from the registers used in the instruction.
• Stage 3 (Instruction Compute) In this stage, ALU operations are performed.
• Stage 4 (Memory Access) In this stage, memory operands are read and written from/to the memory that is
present in the instruction.
• Stage 5 (Write Back) In this stage, computed/fetched value is written back to the register present in the
instructions.
• In the first stage of the pipeline, the program counter (PC) is used to fetch a new instruction. As other instructions are
fetched, execution proceeds through successive stages.
• At any given time, each stage of the pipeline is processing a different instruction. Information such as register addresses,
immediate data, and the operations to be performed must be carried through the pipeline as each instruction proceeds
from one stage to the next. This information is held in interstage buffers.
The interstage buffers are used as follows:
• Interstage buffer B1 feeds the Decode stage with a newly-fetched instruction.
• Interstage buffer B2 feeds the Compute stage with the two operands read from the register file, the source/destination
register identifiers, the immediate value derived from the instruction, the incremented PC value used as the return address
for a subroutine call, and the settings of control signals determined by the instruction decoder. The settings for control
signals move through the pipeline to determine the ALU operation, the memory operation, and a possible write into the
register file.
• Interstage buffer B3 holds the result of the ALU operation, which may be data to be written into the register file or an
address that feeds the Memory stage. In the case of a write access to memory, buffer B3 holds the data to be written.
These data were read from the register file in the Decode stage. The buffer also holds the incremented PC value passed
from the previous stage, in case it is needed as the return address for a subroutine-call instruction.
• Interstage buffer B4 feeds the Write stage with a value to be written into the register file. This value may be the ALU
result from the Compute stage, the result of the Memory access stage, or the incremented PC value that is used as the
return address for a subroutine-call instruction.
21
22-Sep-2024
Pipelining
22
22-Sep-2024
Example
R1 to R5 are
registers
23
22-Sep-2024
Content of registers in
First clock Pulse:
Pipeline example • Transfers the A1 and B1 to R1 and R2
Second Clock Pulse:
• Transfers the product of R1 and R2 into R3 and C into R4
• Same clock pulse transfers A2 and B2 into R1 and R2
Third Clock Pulse:
• Operates on all segments simultaneously
• Places A3 and B3 into R1 and R2
• Transfers the product of R1 and R2 into R3 and C into R4
• Places the sum of R3 and R4 into R5
• From there on each clock produces a new output and
moves the data one step down the pipeline.
• This happens as long as new input data flow into the
system
• When input data not available, the clock must
continue until the last output emerges out of the
pipeline
Problem 1 Solution
24
22-Sep-2024
Problem 2 Solution
Problem 3 Solution
25
22-Sep-2024
Advantages/Disadvantages
Advantages:
■ More efficient use of processor
■ Quicker time of execution of large number of instructions
Disadvantages:
▪ Pipelining involves adding hardware to the chip
▪ Inability to continuously run the pipeline at full speed because of pipeline
hazards which disrupt the smooth execution of the pipeline.
Speed Up
For a pipeline processor:
• k-stage pipeline processes with a clock cycle time tp is used to execute n tasks.
• The time required for the first task T1 to complete the operation = k*tp (if k segments in
the pipe)
• The time required to complete (n-1) tasks = (n-1) *tp
• To complete n tasks using a k-segment pipeline requires =(k + (n-1)) *tp
For the non-pipelined processor :
• Time to complete each task = tn
• Total time required to complete n tasks=n*tn
26
22-Sep-2024
Mov R1, A
Mov R2, B
Add R3, R1, R2
Inc R3
Store C, R3
Mov R1, A F D E M W
Mov R2, B F D E M W
Inc R3 F D E M W
Store C, R3
F D E M W
27
22-Sep-2024
• Structural Hazards: Occur when hardware resources are insufficient to support all
concurrent operations in a pipeline. For example, if a single memory unit is used for both
instruction fetch and data access, a conflict can arise.
• Data Hazards: Arise when an instruction depends on the result of a previous instruction
that has not yet completed. This can be classified into:
– Read After Read (RAR)
– Read After Write (RAW)
– Write After Read (WAR)
– Write After Write (WAW)
• Control Hazards: Occur due to branch instructions that change the flow of execution,
making it uncertain which instruction should be fetched next. This can cause delays while
the pipeline waits to resolve the branch.
Structural Hazards
• Structural Hazard
– Hardware cannot support this combination of instructions - two instructions need the
same resource
• It may be too expensive too eliminate a structural hazard, in which case the pipeline should
stall.
• Stalling: The pipeline might need to stall the execution of one instruction until the other
completes, reducing overall performance.
• When the pipeline stalls, no instructions are issued until the hazard has been resolved
• Methods to prevent/overcome Structural Hazards
– Duplicate resources
– Pipeline the resource
– Reorder the instructions
28
22-Sep-2024
29
22-Sep-2024
Mitigation Strategies
Data Forwarding:
The processor can forward the result of the ADD operation directly to the SUB instruction without writing it back to the
register file first, allowing the second instruction to use the correct value.
Stalling:
The pipeline can introduce a stall (delay) for the second instruction until the first instruction has completed its write
operation.
Out-of-Order Execution:
More advanced architectures may allow instructions to be executed out of order, enabling the CPU to continue executing
other instructions while waiting for the result of the dependent instruction.
Mitigation Strategies
In-Order Execution:
Ensure that instructions are executed in their original order to prevent conflicts between reads and writes.
Pipeline Interlocks:
Implement hardware mechanisms that detect potential WAR hazards and stall the pipeline until it is safe to proceed.
Register Renaming:
Use register renaming to allocate different physical registers for different uses, ensuring that the write does not interfere
with the read operation.
30
22-Sep-2024
Mitigation Strategies
In-Order Execution:
Ensure that instructions are executed in the order they appear in the program, preventing the possibility of a WAW hazard.
Register Renaming:
Use register renaming to allocate different physical registers for the results of different instructions. This eliminates the
conflict by ensuring that each write goes to a unique register.
Pipeline Interlocks:
Implement hardware interlocks that detect potential WAW hazards and stall the pipeline until it is safe to proceed.
31
22-Sep-2024
Control Hazards
• Control hazards occur in pipelined processors when the flow of instruction execution is
altered, typically due to branch or jump instructions. These hazards can lead to incorrect
instruction fetching and execution, resulting in performance issues. Here’s a detailed
overview:
Causes of Control Hazards
Branch Instructions:
• Conditional branches (e.g., if statements) may change the next instruction to be executed
based on the outcome of a condition.
Jump Instructions:
• Unconditional jumps redirect the flow of execution to a specified address, which may not
be the next sequential instruction.
Consequences of Control Hazards
• Stalling: The pipeline may have to wait until the branch condition is evaluated before it can
fetch the next instruction.
• Incorrect Fetching: If the wrong instructions are fetched (before the branch decision is
resolved), it can lead to executing incorrect code.
32
22-Sep-2024
Control Hazards
Example
Consider the following sequence of instructions:
• 1. BEQ R1, R2, Label ; Branch to Label if R1 equals R2
• 2. ADD R3, R4, R5 ; This instruction might be executed if the branch is not taken
• 3. Label: SUB R6, R7, R8
33
22-Sep-2024
Memory Delays
• Memory delays in pipelining refer to the latency introduced when accessing memory during instruction execution.
These delays can significantly impact the performance of pipelined processors, especially since memory operations
often take longer than other operations. Here’s an overview of memory delays and their effects in a pipelined
architecture:
Sources of Memory Delays
Access Time:
• The time it takes to read from or write to memory can vary based on memory hierarchy (e.g., cache, main memory).
Accessing lower-level memory (like RAM) generally takes longer than accessing cache.
Cache Misses:
• If the required data is not found in the cache, the processor must retrieve it from slower main memory, introducing
significant delays.
Instruction Fetching:
• Each instruction must be fetched from memory, which can contribute to pipeline stalls if it takes longer than expected.
Load/Store Instructions:
• Memory access for load and store operations can create delays, particularly when the data is not in the cache.
Impact of Memory Delays on Pipelining
• Stalls: When an instruction that depends on a memory operation is in the pipeline, subsequent instructions may need
to wait, causing pipeline stalls.
• Reduced Throughput: Frequent stalls can lead to lower instruction throughput, negating the performance benefits of
pipelining.
• Complexity: Managing memory delays adds complexity to pipeline design and instruction scheduling.
34
22-Sep-2024
Memory Delays
Branch Delays
• Branch delays in pipelining refer to the performance penalties that occur when the flow of instruction execution is
altered by branch instructions (like jumps or conditional branches). These delays can disrupt the smooth flow of
instructions through the pipeline, leading to inefficiencies.
Causes of Branch Delays
Uncertainty in Control Flow:
• When a branch instruction is encountered, the processor may not immediately know which instruction to fetch next,
leading to potential stalls while the outcome of the branch is determined.
Pipeline Flushing:
• If the prediction of a branch outcome is incorrect, the instructions that were fetched speculatively need to be
discarded (flushed), wasting cycles and resources.
Instruction Fetching:
• The time taken to fetch the next instruction after a branch can lead to stalls, especially if the branch is taken.
Impact of Branch Delays on Performance
• Stalls: The pipeline may need to wait until the branch condition is resolved, introducing bubbles (NOPs) that can slow
down overall execution.
• Reduced Throughput: Frequent branches can lower instruction throughput, as the pipeline can spend significant time
resolving branches.
• Increased Complexity: Managing branch delays requires additional logic and mechanisms in the processor, increasing
design complexity.
35
22-Sep-2024
Resource Limitations
• Resource limitations in pipelining refer to constraints on hardware resources that can affect the efficiency and performance
of a pipelined processor. These limitations can lead to various hazards and inefficiencies during instruction execution.
Types of Resource Limitations
Functional Units:
• Description: The number of arithmetic logic units (ALUs), floating-point units (FPUs), or other functional units can limit the
number of instructions that can be executed simultaneously.
• Impact: If multiple instructions require the same functional unit, it can lead to pipeline stalls while waiting for resources to
become available.
Registers:
• Description: Limited physical registers can constrain the ability to hold intermediate values.
• Impact: Register file conflicts can lead to data hazards, particularly in Write After Write (WAW) and Read After Write (RAW)
scenarios.
Memory Bandwidth:
• Description: The speed and capacity of memory accesses can limit how quickly data can be fetched or stored.
• Impact: Memory bottlenecks can cause delays in instruction execution, especially with frequent load/store operations.
Cache Size and Hierarchy:
• Description: The size and levels of caches (L1, L2, L3) can limit the amount of frequently accessed data that can be stored
close to the CPU.
• Impact: Cache misses can introduce significant latency, affecting overall performance and leading to pipeline stalls.
Pipeline Stages:
• Description: The number of stages in a pipeline can affect how instructions are processed.
• Impact: A longer pipeline may lead to more complex hazard management, while a shorter pipeline may not fully exploit
parallelism.
Resource Limitations
Consequences of Resource Limitations
• Pipeline Stalls: Conflicts for functional units or registers can lead to delays, reducing overall throughput.
• Underutilization: If certain resources are idle while others are overused, it can lead to inefficient execution and wasted
potential.
• Increased Complexity: Managing limited resources requires additional control logic and mechanisms, complicating the
design and operation of the processor.
Strategies to Mitigate Resource Limitations
Resource Duplication:
• Increase the number of functional units, registers, or cache levels to reduce contention and increase parallelism.
Instruction Scheduling:
• Reorder instructions during compilation to minimize conflicts and optimize the use of available resources.
Out-of-Order Execution:
• Allow instructions to be executed based on resource availability rather than strict program order, helping to keep the
pipeline busy.
Register Renaming:
• Use additional physical registers to eliminate WAW and WAR hazards, allowing multiple instructions to execute without
conflicts.
Efficient Cache Design:
• Optimize cache sizes and hierarchies to balance speed and capacity, reducing the likelihood of cache misses.
Memory Management Techniques:
• Implement techniques such as prefetching and memory interleaving to improve memory bandwidth and reduce access
times.
36
22-Sep-2024
Performance Evaluation
Throughput:
• The number of instructions completed per unit of time (e.g., instructions per second).
• Importance: Higher throughput indicates better utilization of the pipeline.
Latency:
• The time taken to complete a single instruction from start to finish.
• Importance: Lower latency is desirable for quick instruction execution, but pipelining may increase the
latency of individual instructions due to the presence of pipeline stages.
Speedup:
• The ratio of the time taken to execute a program on a non-pipelined processor to the time taken on a
pipelined processor.
• Importance: Indicates how much faster a program runs on a pipelined processor compared to a non-
pipelined one.
Utilization:
• The fraction of time the pipeline is busy executing instructions.
• Importance: High utilization suggests that the pipeline is being effectively used, while low utilization
indicates idle time and inefficiencies.
Cycle Time:
• The time taken to complete one clock cycle in the pipeline.
• Importance: Cycle time can affect overall performance; shorter cycle times generally lead to better
performance.
Superscalar Operations
• Superscalar operations in pipelining refer to the ability of a processor to execute multiple instructions simultaneously in a
single clock cycle. This is achieved by having multiple execution units and pipelines within a single processor core, allowing
for greater throughput and improved performance.
Key Concepts of Superscalar Architecture
Multiple Functional Units:
• Superscalar processors have multiple ALUs, FPUs, and load/store units, allowing them to execute more than one instruction
at a time.
Instruction-Level Parallelism (ILP):
• Superscalar architectures exploit ILP by issuing multiple instructions per clock cycle. The degree of parallelism depends on
the dependencies between instructions.
Dynamic Scheduling:
• Instructions are dynamically scheduled at runtime, which means the processor can decide the order of instruction
execution based on resource availability and data dependencies, rather than following the original program order.
Instruction Dispatching:
• Instructions are fetched, decoded, and dispatched to various execution units in parallel. The dispatch unit determines which
instructions can be executed simultaneously based on their dependencies.
Benefits of Superscalar Operations
• Increased Throughput: By executing multiple instructions in parallel, superscalar architectures can significantly increase the
number of instructions processed per cycle.
• Better Resource Utilization: Multiple functional units can be utilized effectively, reducing idle times and enhancing overall
performance.
• Higher Performance for Diverse Workloads: Superscalar designs are particularly effective for applications with high
instruction-level parallelism, such as scientific computing and graphics processing.
37