Unit-5 (Coa) Notes

UNIT - 5
Introduction to pipelining
The term Pipelining refers to a technique of decomposing a sequential process into sub-
operations, with each sub-operation being executed in a dedicated segment that operates
concurrently with all other segments.
The most important characteristic of a pipeline technique is that several computations can be
in progress in distinct segments at the same time. The overlapping of computation is made
possible by associating a register with each segment in the pipeline. The registers provide
isolation between each segment so that each can operate on distinct data simultaneously.
The structure of a pipeline organization can be represented simply by including an input register
for each segment followed by a combinational circuit.
Let us consider an example of combined multiplication and addition operation to get a better
understanding of the pipeline organization.
The combined multiplication and addition operation is done with a stream of numbers such as:
Ai* Bi + Ci for i = 1, 2, 3, ......., 7
The operation to be performed on the numbers is decomposed into sub-operations with each
sub-operation to be implemented in a segment within a pipeline.
The sub-operations performed in each segment of the pipeline are defined as:
R1 ← Ai, R2 ← Bi Input Ai, and Bi

R3 ← R1 * R2, R4 ← Ci Multiply, and input Ci
R5 ← R3 + R4 Add Ci to product
The following block diagram represents the combined as well as the sub-operations performed
in each segment of the pipeline.
Registers R1, R2, R3, and R4 hold the data and the combinational circuits operate in a particular
segment.
The output generated by the combinational circuit in a given segment is applied as an input
register of the next segment. For instance, from the block diagram, we can see that the register
R3 is used as one of the input registers for the combinational adder circuit.
In general, the pipeline organization is applicable for two areas of computer design which
includes:
1. Arithmetic Pipeline
2. Instruction Pipeline
Arithmetic Pipeline
Arithmetic Pipelines are mostly used in high-speed computers. They are used to implement
floating-point operations, multiplication of fixed-point numbers, and similar computations
encountered in scientific problems.
To understand the concepts of arithmetic pipeline in a more convenient way, let us consider an
example of a pipeline unit for floating-point addition and subtraction.
The inputs to the floating-point adder pipeline are two normalized floating-point binary
numbers defined as:
X = A * 2a = 0.9504 * 103
Y = B * 2b = 0.8200 * 102
Where A and B are two fractions that represent the mantissa and a and b are the exponents.
The combined operation of floating-point addition and subtraction is divided into four
segments. Each segment contains the corresponding sub operation to be performed in the given
pipeline. The sub operations that are shown in the four segments are:
1. Compare the exponents by subtraction.

2. Align the mantissas.
3. Add or subtract the mantissas.
4. Normalize the result.
We will discuss each sub operation in a more detailed manner later in this section.
The following block diagram represents the sub operations performed in each segment of the
pipeline.
Note: Registers are placed after each sub operation to store the intermediate results.
1. Compare exponents by subtraction:
The exponents are compared by subtracting them to determine their difference. The larger
exponent is chosen as the exponent of the result.
The difference of the exponents, i.e., 3 - 2 = 1 determines how many times the mantissa
associated with the smaller exponent must be shifted to the right.
2. Align the mantissas:
The mantissa associated with the smaller exponent is shifted according to the difference of
exponents determined in segment one.
X = 0.9504 * 103
Y = 0.08200 * 103
3. Add mantissas:
The two mantissas are added in segment three.
Z = X + Y = 1.0324 * 103
4. Normalize the result:
After normalization, the result is written as:
Z = 0.1324 * 104
Instruction Pipeline
Pipeline processing can occur not only in the data stream but in the instruction stream as well.
Most of the digital computers with complex instructions require instruction pipeline to carry out
operations like fetch, decode and execute instructions.
In general, the computer needs to process each instruction with the following sequence of steps.
1. Fetch instruction from memory.

2. Decode the instruction.
3. Calculate the effective address.
4. Fetch the operands from memory.
5. Execute the instruction.
6. Store the result in the proper place.
Each step is executed in a particular segment, and there are times when different segments may
take different times to operate on the incoming information. Moreover, there are times when
two or more segments may require memory access at the same time, causing one segment to
wait until another is finished with the memory.
The organization of an instruction pipeline will be more efficient if the instruction cycle is divided
into segments of equal duration. One of the most common examples of this type of organization
is a Four-segment instruction pipeline.
A four-segment instruction pipeline combines two or more different segments and makes it as
a single one. For instance, the decoding of the instruction can be combined with the calculation
of the effective address into one segment.
The following block diagram shows a typical example of a four-segment instruction pipeline. The
instruction cycle is completed in four segments.
Segment 1:
The instruction fetch segment can be implemented using first in, first out (FIFO) buffer.
Segment 2:
The instruction fetched from memory is decoded in the second segment, and eventually, the
effective address is calculated in a separate arithmetic circuit.
Segment 3:
An operand from memory is fetched in the third segment.
Segment 4:
The instructions are finally executed in the last segment of the pipeline organization.
Pipelining is a process of arrangement of hardware elements of the CPU such that its overall
performance is increased. Simultaneous execution of more than one instruction takes place in
a pipelined processor. Let us see a real-life example that works on the concept of pipelined
operation. Consider a water bottle packaging plant. Let there be 3 stages that a bottle should
pass through, Inserting the bottle (I), Filling water in the bottle (F), and Sealing the bottle(S).
Let us consider these stages as stage 1, stage 2, and stage 3 respectively. Let each stage take
1 minute to complete its operation. Now, in a non-pipelined operation, a bottle is first inserted
in the plant, after 1 minute it is moved to stage 2 where water is filled. Now, in stage 1 nothing
is happening. Similarly, when the bottle moves to stage 3, both stage 1 and stage 2 are idle.
But in pipelined operation, when the bottle is in stage 2, another bottle can be loaded at stage
1. Similarly, when the bottle is in stage 3, there can be one bottle each in stage 1 and stage 2.
So, after each minute, we get a new bottle at the end of stage 3. Hence, the average time
taken to manufacture 1 bottle is:
Without pipelining = 9/3 minutes = 3m
IFS||||||
|||IFS|||
| | | | | | I F S (9 minutes)
With pipelining = 5/3 minutes = 1.67m
IFS||
|IFS|
| | I F S (5 minutes)
Thus, pipelined operation increases the efficiency of a system.
Design of a basic pipeline
• In a pipelined processor, a pipeline has two ends, the input end and the output end.
Between these ends, there are multiple stages/segments such that the output of one stage
is connected to the input of the next stage and each stage performs a specific operation.
• Interface registers are used to hold the intermediate output between two stages. These
interface registers are also called latch or buffer.
• All the stages in the pipeline along with the interface registers are controlled by a common
clock.
• Execution in a pipelined processor Execution sequence of instructions in a pipelined
processor can be visualized using a space-time diagram. For example, consider a
processor having 4 stages and let there be 2 instructions to be executed. We can
visualize the execution sequence through the following space-time diagrams:
• Non-overlapped execution:
Stage / Cycle 1 2 3 4 5 6 7 8
S1 I1 I2
S2 I1 I2
S3 I1 I2
S4 I1 I2
Total time = 8 Cycle

Overlapped execution:
Stage / Cycle 1 2 3 4 5
S1 I1 I2
S2 I1 I2
S3 I1 I2
S4 I1 I2
Total time = 5 Cycle Pipeline Stages RISC processor has 5 stage instruction pipeline to execute
all the instructions in the RISC instruction set. Following are the 5 stages of the RISC pipeline
with their respective operations:
• Stage 1 (Instruction Fetch) In this stage the CPU reads instructions from the address in
the memory whose value is present in the program counter.
• Stage 2 (Instruction Decode) In this stage, instruction is decoded and the register file is
accessed to get the values from the registers used in the instruction.
• Stage 3 (Instruction Execute) In this stage, ALU operations are performed.
• Stage 4 (Memory Access) In this stage, memory operands are read and written from/to
the memory that is present in the instruction.
• Stage 5 (Write Back) In this stage, computed/fetched value is written back to the register
present in the instructions.
• Performance of a pipelined processor Consider a ‘k’ segment pipeline with clock cycle
time as ‘Tp’. Let there be ‘n’ tasks to be completed in the pipelined processor. Now, the
first instruction is going to take ‘k’ cycles to come out of the pipeline but the other ‘n –
1’ instructions will take only ‘1’ cycle each, i.e., a total of ‘n – 1’ cycles. So, time taken to
execute ‘n’ instructions in a pipelined processor:
ETpipeline = k + n – 1 cycles
= (k + n – 1) Tp
• In the same case, for a non-pipelined processor, the execution time of ‘n’ instructions
will be:
ETnon-pipeline = n * k * Tp
• So, speedup (S) of the pipelined processor over the non-pipelined processor, when ‘n’
tasks are executed on the same processor is:
S = Performance of pipelined processor /
Performance of non-pipelined processor
• As the performance of a processor is inversely proportional to the execution time, we
have,
S = ETnon-pipeline / ETpipeline
=> S = [n * k * Tp] / [(k + n – 1) * Tp]
S = [n * k] / [k + n – 1]
• When the number of tasks ‘n’ is significantly larger than k, that is, n >> k
S=n*k/n
S=k
where ‘k’ are the number of stages in the pipeline.
Efficiency = Given speed up / Max speed up = S / S max We know that Smax = k
So,
Efficiency = S / k
Throughput = Number of instructions / Total time to complete the instructions

So,
Throughput = n / (k + n – 1)
Dependencies in pipeline Processor

The pipeline processor usually has three types of dependencies, which are described as follows:
1. Structural dependencies
2. Data dependencies
3. Control dependencies
Because of these dependencies, the stalls will be introduced in a pipeline. A stall can be
described as a cycle without new input in the pipeline. In other words, we can say that the stall
will happen when the later instruction depends on the output of the earlier instruction.
Structural dependencies
Because of the resource conflict in the pipeline, structural dependency usually arises. The
resource conflict can be described as a situation where there is a cycle containing resources such
as ALU (arithmetical logical unit), memory, or register. In resource conflict, more than one
instruction tries to access the same resource.
Example:
Instructions / Cycle 1 2 3 4 5
I1 IF(Mem) ID EX Mem
I2 IF(Mem) ID EX
I3 IF(Mem) ID EX
I4 IF(Mem) ID
The above table contains the four instructions I1, I2, I3, and I4, and five cycles 1, 2, 3, 4, 5. In cycle
4, there is a resource conflict because I1 and I4 are trying to access the same resource. In our
case, the resource is memory. The solution to this problem is that we have to keep the
instruction on wait as long as the required resource becomes available. Because of this wait, the
stall will be introduced in pipelines like this:
Instructions / Cycle 1 2 3 4 5 6 7 8
I1 IF(Mem) ID EX Mem WB
I4 - - - IF(Mem)
Solutions for Structural dependency
With the help of a hardware mechanism, we can minimize the structural dependency stalls in a
pipeline. The mechanism is known as renaming.
Remaining: In this mechanism, the memory will be divided into two independent modules,
which are known as Data memory (DM) and Code memory (CM). Here, all the instructions are
contained with the help of CM, and all the operands which are required for the instructions are
contained by the DM.
Instructions / Cycle 1 2 3 4 5 6 7
I1 IF(CM) ID EX DM WB
I4 IF(CM) ID EX DM
I5 IF(CM) ID EX
I6 IF(CM) ID
I7 IF(CM)
Control Dependency (Branch Hazards)

When we transfer the control instructions, the control dependency will occur at that time. These
instructions can be JMP, CALL, BRANCH, and many more. On many instruction architectures,
when the processor wants to add the new instruction into the pipeline, the processor does not
know the target address of these new instructions. Because of this drawback, unwanted
instructions are inserted into the pipeline.
For example:
For this, we will assume a program and take the following sequence of instructions like this:
100: I1
101: I2
102: I3
.
.
250: BI1
Expected Output is described as follows:
I1 → I2 → BI1
Note: After the ID stage, the processor is able to know the target address of JMP instruction.
Instructions / Cycle 1 2 3 4 5 6
I1 IF ID EX MEM WB
I2 IF ID(PC:250) EX MEM WB
I3 IF ID EX MEM
BI1 IF ID EX
The output sequence is described as follows:
I1 → I2 → I3 → BI1
So the above example shows that the expected output and output sequence are not equal to
each other. It shows that the pipeline is not correctly implemented.
We can correct that problem with the help of stopping the instruction fetch as long as we get
the target address of branch instruction. For this, we will implement the delay slot as long as we
get the target address, which is described in the following table:
I1 IF ID EX MEM WB
I2 IF ID(PC:250) EX MEM WB
Delay - - - - - -
BI1 IF ID EX
The output sequence is described as follows:
I1 → I2 → Delay (Stall) → BI1
In the above example, we can see that there is no operation performed by the delay slot. That's
why this output sequence and the expected output are not equal to each other. But because of
this slot, a stall will be introduced in the pipeline.
Solution for Control Dependency
In the control dependency, we can eliminate the stall in the pipelines with the help of a method
known as Branch prediction. The prediction about which branch will be taken is done at the
1st stage of branch prediction. The branch prediction contains the 0 branch penalty.
Branch Penalty: Branch penalty can be described as the number of stalls that are introduced at
the time of branch operation in the pipelined.
Data Dependency (Data Hazards)
For this, we will assume an ADD instruction S, and three registers, which are described as
follows:
1. S: ADD R1, R2, R3

2. Addresses read by S = I(S) = {R2, R3}
3. Addresses written by S = O(S) = {R1}
In the following way, the instruction S2 will depend on instruction S1:
1. [I(S1) ? O(S2)] ? [O(S1) ? I(S2)] ? [O(S1) ? O(S2)] ? ?

2. The above condition is known as the Bernstein condition. In this condition, there are
three cases, which are described as follows:
3. Flow (data) Dependence: Suppose this dependency contains O(S1) ? I(S2), S1 → S2. In
this case, when S2 reads something, only after that, S1 write.
4. Anti Dependence: Suppose this dependency contains I(S1) ? O(S2), S1 → S2. In this case,
before S2 overwrite S1, the S1 will read something.
5. Output Dependence: Suppose this dependency contains O(S1) ? O(S2), S1 → S2. In this
case, both S1 and S2 write on the same memory location.
6. For example: Here, we will assume that we have two instructions I1, and I2, like this:
7. I1: ADD R1, R2, R3
8. I2: SUB R4, R1, R2
9. The condition of data dependency will occur when the above instructions I1, I2 are
executed in a pipelined processor. It shows that before I1 writes the data, the I2 tries to
read it. As a result, the instruction I2 incorrectly gets the old value from I1, which is
described in the following table:
Instructions / Cycle 1 2 3 4
I1 IF ID EX DM
I2 IF ID (Old value) EX
Here we will use the operand forwarding so that we can minimize the stalls in data dependency.
Operand Forwarding: In this forwarding, we will use the interface registers which exist between
the stages. These registers are used to contain the intermediate output. With the help of
intermediate registers, the dependent instruction is able to directly access the new value.
To explain this, we will take the same example:
I1: ADD R1, R2, R3

I2: SUB R4, R1, R2
Instructions / Cycle 1 2 3 4
I1 IF ID EX DM
I2 IF ID EX
Data Hazards
Due to the data dependency, data hazards have occurred. If the data is modified in different
stages of a pipeline with the help of instructions that exhibit data dependency, in this case, the
data hazard will occur. When the instructions are read/write the registers that are used by some
other instructions, in this case, the instruction hazards will occur. Because of the data hazard,
there will be a delay in the pipeline. The data hazards are basically of three types:
1. RAW
2. WAR
3. WAW
To understand these hazards, we will assume we have two instructions I1 and I2, in such a way
that I2 follows I1. The hazards are described as follows:
RAW:
RAW hazard can be referred to as 'Read after Write'. It is also known as Flow/True data
dependency. If the later instruction tries to read on operand before earlier instruction writes it,
in this case, the RAW hazards will occur. The condition to detect the RAW hazard is when O n and
In+1 both have a minimum one common operand.
For example:
I1: add R1, R2, R3

I2: sub R5, R1, R4
There is a RAW hazard because subtraction instruction reads output of the addition. The hazard
for instructions 'add R1, R2, R3' and 'sub R5, R1, R4' is described as follows:
I1 IF ID EX MEM WB
I2 IF ID EX MEM WB
The RAW hazard is very common.
WAR
WAR can be referred to as 'Write after Read'. It is also known as Anti-Data dependency. If the
later instruction tries to write an operand before the earlier instruction reads it, in this case, the
WAR hazards will occur. The condition to detect the WAR hazard is when In and On+1 both have
a minimum one common operand.
For example:
The dependency is described as follows:
add R1, R2, R3

sub R2, R5, R4
Here addition instruction creates a WAR hazard because subtraction instruction writes R2,
which is read by addition. In a reasonable (in-order) pipeline, the WAR hazard is very uncommon
or impossible. The hazard for instructions 'add R1, R2, R3' and 'sub R2, R5, R4' are described as
follows:
I1 IF ID EX MEM WB
I2 IF ID EX MEM WB
When the instruction tries to enter into the write back stage of the pipeline, at that time, all the
previous instructions contained by the program have already passed through the read stage of
register and read their input values. Now without causing any type of problem, the write
instruction can write its destination register. The WAR instructions contain less problems as
compared to the WAW because in WAR, before the write back stage of a pipeline, the read stage
of a register occur.
WAW
WAW can be referred to as 'Write after Write'. It is also known as Output Data dependency. If
the later instruction tries to write on operand before earlier instruction writes it, in this case,
the WAW hazards will occur. The condition to detect the WAW hazard is when On and On+1 both
have a minimum one common operand.
For example:
The dependency is described as follows:
add R1, R2, R3

sub R1, R2, R4
Here addition instruction creates a WAW hazard because subtraction instruction writes on the
same register. The hazard for instructions 'add R1, R2, R3' and 'sub R1, R2, R4' are described as
follows:
Instructions / Cycle 1 2 3 4 5 6 7
I1 IF ID EX MEM MEM2 MEM3 WB
I2 IF ID EX MEM WB
In the write back stage of a pipeline, the output register of instruction will be written. The order
in which the instruction with WAW hazard appears in the program, in the same order these
instructions will be entered the write back stage of a pipeline. The result of these instructions
will be written into the register in the right order. The processor has improved performance as
compared to the original program because it allows instructions to execute in different orders.
Effects of WAR and WAW
The WAR hazards and WAW hazards occur because the process contains a finite number of
registers. Because of this reason, these hazards are also known as the name dependencies.
The processor will use the different registers to generate the output of each instruction if it
contains an infinite number of registers. There is no chance of occurring the WAR and WAW
hazards in this case.
The WAR and WAW hazards will not cause the delay if a processor uses the same pipeline for all
the instructions and executes these instructions in the same order in which they appear in the
program. This is all because of the process of instructions flow through a pipeline.
What is a Microprocessor?
A microprocessor is a computer processor that is found in most modern personal computers,
smartphones, and other electronic devices. It is a central processing unit (CPU) that performs
most of the processing tasks in a computer system. The microprocessor is a key component of
a computer, as it controls the fetching, decoding, and execution of instructions that are stored
in memory. You can say that microprocessor is used as the brain of the computing devices
which control overall execution and operations. The development of microprocessors has
played a significant role in the evolution of computers and has made it possible for them to
become smaller, faster, and more powerful over time.
Basics of Microprocessor –
A Microprocessor takes a bunch of instructions in machine language and executes them, telling
the processor what it has to do. Microprocessor performs three basic things while executing
the instruction:
1. It performs some basic operations like addition, subtraction, multiplication, division, and
some logical operations using its Arithmetic and Logical Unit (ALU). New Microprocessors
also perform operations on floating-point numbers also.
2. Data in microprocessors can move from one location to another.
3. It has a Program Counter (PC) register that stores the address of the next instruction based
on the value of the PC, Microprocessor jumps from one location to another and takes
decisions.
A typical Microprocessor structure looks like this.

A microprocessor consists of an ALU, control unit and register array. Where ALU performs
arithmetic and logical operations on the data received from an input device or memory.
Control unit controls the instructions and flow of data within the computer. And, register
array consists of registers identified by letters like B, C, D, E, H, L, and accumulator.
Evolution of Microprocessors
We can categorize the microprocessor according to the generations or according to the size of
the microprocessor:
First Generation (4 - bit Microprocessors)
The first generation microprocessors were introduced in the year 1971-1972 by Intel
Corporation. It was named Intel 4004 since it was a 4-bit processor.
It was a processor on a single chip. It could perform simple arithmetic and logical operations
such as addition, subtraction, Boolean OR and Boolean AND.
I had a control unit capable of performing control functions like fetching an instruction from
storage memory, decoding it, and then generating control pulses to execute it.
Second Generation (8 - bit Microprocessor)
The second generation microprocessors were introduced in 1973 again by Intel. It was a first 8
- bit microprocessor which could perform arithmetic and logic operations on 8-bit words. It was
Intel 8008, and another improved version was Intel 8088.
Third Generation (16 - bit Microprocessor)
The third generation microprocessors, introduced in 1978 were represented by Intel's 8086,
Zilog Z800 and 80286, which were 16 - bit processors with a performance like minicomputers.
Fourth Generation (32 - bit Microprocessors)
Several different companies introduced the 32-bit microprocessors, but the most popular one
is the Intel 80386.
Fifth Generation (64 - bit Microprocessors)
From 1995 to now we are in the fifth generation. After 80856, Intel came out with a new
processor namely Pentium processor followed by Pentium Pro CPU, which allows multiple CPUs
in a single system to achieve multiprocessing.
Other improved 64-bit processors are Celeron, Dual, Quad, Octa Core processors.
Types of Processor:
1) Reduced Instruction Set Computer (RISC) –
RISC or Reduced Instruction Set Computer is a computer architecture where instruction is

simple and designed to get executed quickly. Instructions get completed in one clock cycle this
is because of the optimization of instructions and pipelining (a technique that allows for
simultaneous execution of parts, or stages, of instructions more efficiently process
instructions). RISC makes use of multiple registers to avoid large interactions with memory. It
has few addressing nodes.
Example:
1. IBM RS6000
2. MC88100
3. DEC Alpha 21064
4. DEC Alpha 21164
5. DEC Alpha 21264
2) Complex Instruction Set Computer (CISC) –
CISC or Complex Instruction Set Computer is a computer architecture where instructions are
such that a single instruction can execute multiple low-level operations like loading from
memory, storing into memory, or an arithmetic operation, etc. It has multiple addressing
nodes within a single instruction. CISC makes use of very few registers.
Example:
1. Intel 386
2. Intel 486
3. Pentium
4. Pentium Pro
5. Pentium II
6. Pentium III
7. Motorola 68000
8. Motorola 68020
9. Motorola 68040 etc.
RISC Processor
RISC stands for Reduced Instruction Set Computer Processor, a microprocessor architecture
with a simple collection and highly customized set of instructions. It is built to minimize the
instruction execution time by optimizing and limiting the number of instructions. It means each
instruction cycle requires only one clock cycle, and each cycle contains three parameters: fetch,
decode and execute. The RISC processor is also used to perform various complex instructions by
combining them into simpler ones. RISC chips require several transistors, making it cheaper to
design and reduce the execution time for instruction.
Examples of RISC processors are -
SUN's SPARC, PowerPC, Microchip PIC processors, RISC-V.
Advantages of RISC Processor

1. The RISC processor's performance is better due to the simple and limited number of the
instruction set.
2. It requires several transistors that make it cheaper to design.
3. RISC allows the instruction to use free space on a microprocessor because of its simplicity.
4. RISC processor is simpler than a CISC processor because of its simple and quick design,
and it can complete its work in one clock cycle.
Disadvantages of RISC Processor
1. The RISC processor's performance may vary according to the code executed because
subsequent instructions may depend on the previous instruction for their execution in a
cycle.
2. Programmers and compilers often use complex instructions.
3. RISC processors require very fast memory to save various instructions that require a large
collection of cache memory to respond to the instruction in a short time.
RISC Architecture
It is a highly customized set of instructions used in portable devices due to system reliability
such as Apple iPod, mobiles/smartphones, Nintendo DS,
Features of RISC Processor
Some important features of RISC processors are:
1. One cycle execution time: For executing each instruction in a computer, the RISC
processors require one CPI (Clock per cycle). And each CPI includes the fetch, decode and
execute method applied in computer instruction.
2. Pipelining technique: The pipelining technique is used in the RISC processors to execute
multiple parts or stages of instructions to perform more efficiently.
3. A large number of registers: RISC processors are optimized with multiple registers that
can be used to store instruction and quickly respond to the computer and minimize
interaction with computer memory.
4. It supports a simple addressing mode and fixed length of instruction for executing the
pipeline.
5. It uses LOAD and STORE instruction to access the memory location.
6. Simple and limited instruction reduces the execution time of a process in a RISC.
CISC Processor
The CISC Stands for Complex Instruction Set Computer, developed by the Intel. It has a large
collection of complex instructions that range from simple to very complex and specialized in the
assembly language level, which takes a long time to execute the instructions. So, CISC
approaches reducing the number of instruction on each program and ignoring the number of
cycles per instruction. It emphasizes to build complex instructions directly in the hardware
because the hardware is always faster than software. However, CISC chips are relatively slower
as compared to RISC chips but use little instruction than RISC. Examples of CISC processors are
VAX, AMD, Intel x86 and the System/360.
Characteristics of CISC Processor
Following are the main characteristics of the RISC processor:
1. The length of the code is shorts, so it requires very little RAM.

2. CISC or complex instructions may take longer than a single clock cycle to execute the code.
3. Less instruction is needed to write an application.
4. It provides easier programming in assembly language.
5. Support for complex data structure and easy compilation of high-level languages.
6. It is composed of fewer registers and more addressing nodes, typically 5 to 20.
7. Instructions can be larger than a single word.
8. It emphasizes the building of instruction on hardware because it is faster to create than
the software.
CISC Processors Architecture
The CISC architecture helps reduce program code by embedding multiple operations on each
program instruction, which makes the CISC processor more complex. The CISC architecture-
based computer is designed to decrease memory costs because large programs or instruction
required large memory space to store the data, thus increasing the memory requirement, and
a large collection of memory increases the memory cost, which makes them more expensive.
Advantages of CISC Processors
1. The compiler requires little effort to translate high-level programs or statement languages
into assembly or machine language in CISC processors.
2. The code length is quite short, which minimizes the memory requirement.
3. To store the instruction on each CISC, it requires very less RAM.
4. Execution of a single instruction requires several low-level tasks.
5. CISC creates a process to manage power usage that adjusts clock speed and voltage.
6. It uses fewer instructions set to perform the same instruction as the RISC.
Disadvantages of CISC Processors
1. CISC chips are slower than RSIC chips to execute per instruction cycle on each program.
2. The performance of the machine decreases due to the slowness of the clock speed.
3. Executing the pipeline in the CISC processor makes it complicated to use.
4. The CISC chips require more transistors as compared to RISC design.
5. In CISC it uses only 20% of existing instructions in a programming event.
Difference between the RISC and CISC Processors
RISC CISC
It is a Reduced Instruction Set Computer. It is a Complex Instruction Set Computer.
It emphasizes on software to optimize the It emphasizes on hardware to optimize the

instruction set. instruction set.
It is a hard wired unit of programming in the RISC Microprogramming unit in CISC Processor.
Processor.
It requires multiple register sets to store the It requires a single register set to store the
instruction. instruction.
RISC has simple decoding of instruction. CISC has complex decoding of instruction.
Uses of the pipeline are simple in RISC. Uses of the pipeline are difficult in CISC.
It uses a limited number of instruction that It uses a large number of instruction that
requires less time to execute the instructions. requires more time to execute the
instructions.
It uses LOAD and STORE that are independent It uses LOAD and STORE instruction in the
instructions in the register-to-register a program's memory-to-memory interaction of a
interaction. program.
RISC has more transistors on memory registers. CISC has transistors to store complex
instructions.
The execution time of RISC is very short. The execution time of CISC is longer.
RISC architecture can be used with high-end CISC architecture can be used with low-end
applications like telecommunication, image applications like home automation, security
processing, video processing, etc. system, etc.
It has fixed format instruction. It has variable format instruction.
The program written for RISC architecture needs Program written for CISC architecture tends
to take more space in memory. to take less space in memory.
Example of RISC: ARM, PA-RISC, Power Examples of CISC: VAX, Motorola 68000
Architecture, Alpha, AVR, ARC and the SPARC. family, System/360, AMD and the Intel x86
CPUs.
Introduction of Multiprocessor
Multiprocessor:
A Multiprocessor is a computer system with two or more central processing units (CPUs) share
full access to a common RAM. The main objective of using a multiprocessor is to boost the
system’s execution speed, with other objectives being fault tolerance and application
matching.
There are two types of multiprocessors, one is called shared memory multiprocessor and
another is distributed memory multiprocessor. In shared memory multiprocessors, all the
CPUs shares the common memory but in a distributed memory multiprocessor, every CPU has
its own private memory.
Applications of Multiprocessor –
1. As a uniprocessor, such as single instruction, single data stream (SISD).
2. As a multiprocessor, such as single instruction, multiple data stream (SIMD), which is
usually used for vector processing.
3. Multiple series of instructions in a single perspective, such as multiple instruction, single
data stream (MISD), which is used for describing hyper-threading or pipelined processors.
4. Inside a single system for executing multiple, individual series of instructions in multiple
perspectives, such as multiple instruction, multiple data stream (MIMD).
Benefits of using a Multiprocessor –
• Enhanced performance.
• Multiple applications.
• Multi-tasking inside an application.
• High throughput and responsiveness.
• Hardware sharing among CPUs.
2. Multicomputer:
A multicomputer system is a computer system with multiple processors that are connected
together to solve a problem. Each processor has its own memory and it is accessible by that
particular processor and those processors can communicate with each other via an
interconnection network.
As the multicomputer is capable of messages passing between the processors, it is possible to
divide the task between the processors to complete the task. Hence, a multicomputer can be
used for distributed computing. It is cost effective and easier to build a multicomputer than a
multiprocessor.
Difference between multiprocessor and Multicomputer:
1. Multiprocessor is a system with two or more central processing units (CPUs) that is capable
of performing multiple tasks where as a multicomputer is a system with multiple processors
that are attached via an interconnection network to perform a computation task.
2. A multiprocessor system is a single computer that operates with multiple CPUs where as a
multicomputer system is a cluster of computers that operate as a singular computer.
3. Construction of multicomputer is easier and cost effective than a multiprocessor.
4. In multiprocessor system, program tends to be easier where as in multicomputer system,
program tends to be more difficult.
5. Multiprocessor supports parallel computing, Multicomputer supports distributed
computing.
Cache Coherence
A cache coherence issue results from the concurrent operation of several processors and the
possibility that various caches may hold different versions of the identical memory block. The
practice of cache coherence makes sure that alterations in the contents of associated operands
are quickly transmitted across the system.
The cache coherence problem is the issue that arises when several copies of the same data
are kept at various levels of memory.
Cache coherence has three different levels:

o Each writing operation seems to happen instantly.
o Each operand's value changes are seen in every processor in precisely the same order.
o Non-coherent behavior results from many processors interpreting the same action in
various ways.
Methods to resolve Cache Coherence
The two methods listed below can be used to resolve the cache coherence issue:
o Write Through
o Write Back
Write Through
The easiest and most popular method is to write through. Every memory write operation
updates the main memory. If the word is present in the cache memory at the requested address,
the cache memory is also updated simultaneously with the main memory.
The benefit of this approach is that the RAM and cache always hold the same information. In
systems with direct memory access transfer, this quality is crucial. It makes sure the information
in the main memory is up-to-date at all times so that a device interacting over DNA can access
the most recent information.
Advantage - It provides the highest level of consistency.
Disadvantage - It requires a greater number of memory access.
Write Back
Only the catch location is changed during a write operation in this approach. When the word is
withdrawn from the cache, the place is flagged, so it is replicated in the main memory. The right-
back approach was developed because words may be updated numerous times while they are
in the cache. However, as long as they are still there, it doesn't matter whether the copy that is
stored in the main memory is outdated because requests for words are fulfilled from the cache.
An accurate copy must only be transferred back to the main memory when the word is
separated from the cache. According to the analytical findings, between 10% and 30% of all
memory references in a normal program are written into memory.
Advantage - A very small number of memory accesses and write operations.
Disadvantage - Inconsistency may occur in this approach.
The important terms related to the data or information stored in the cache as well as in the
main memory are as follows:
o Modified - The modified term signifies that the data stored in the cache and main memory
are different. This means the data in the cache has been modified, and the changes need
to be reflected in the main memory.
o Exclusive - The exclusive term signifies that the data is clean, i.e., the cache and the main
memory hold identical data.
o Shared - Shared refers to the fact that the cache value contains the most current data
copy, which is then shared across the whole cache as well as main memory.
o Owned - The owned term indicates that the block is currently held by the cache and that
it has acquired ownership of it, i.e., complete privileges to that specific block.
o Invalid - When a cache block is marked as invalid, it means that it needs to be fetched
from another cache or main memory.
Types of Coherence:
There exist three varieties of coherence referred to the coherency mechanisms, which are listed
below:
1. Directory Based - A directory-based system keeps the coherence amongst caches by

storing shared data in a single directory. In order to load an entry from primary memory
into its cache, the processor must request permission through the directory, which serves
as a filter. The directory either upgrades or devalues the other caches that contain that
record when a record is modified.
2. Snooping - Individual caches watch address lines during the snooping process to look for
accesses to memory locations that they have cached. A write invalidate protocol is what
it is known as. When a write activity is seen to a memory address for which a cache
maintains a copy, the cache controller invalidates its own copy of the snooped memory
location.
3. Snarfing - A cache controller uses this approach to try and update its own copy of a
memory location when a second master alters a place in the main memory by keeping an
eye on both the address and the contents. The cache controller updates its own copy of
the underlying memory location with the new data when a write action is detected to a
place of which a cache holds a copy.
Vector Processor:
Vector processing is a central processing unit that can perform the complete vector input in
individual instruction. It is a complete unit of hardware resources that implements a sequential
set of similar data elements in the memory using individual instruction.
The scientific and research computations involve many computations which require extensive
and high-power computers. These computations when run in a conventional computer may take
days or weeks to complete. The science and engineering problems can be specified in methods
of vectors and matrices using vector processing.
Features of Vector Processing

There are various features of Vector processing which are as follows −
• A vector is a structured set of elements. The elements in a vector are scalar quantities. A
vector operand includes an ordered set of n elements, where n is known as the length of
the vector.
• Each clock period processes two successive pairs of elements. During one single clock
period, the dual vector pipes and the dual sets of vector functional units allow the
processing of two pairs of elements.
As the completion of each pair of operations takes place, the results are delivered to
appropriate elements of the result register. The operation continues just before the
various elements processed are similar to the count particularized by the vector length
register.
• In parallel vector processing, more than two results are generated per clock cycle. The
parallel vector operations are automatically started under the following two
circumstances.
• When successive vector instructions facilitate different functional units and multiple
vector registers.
• When successive vector instructions use the resulting flow from one vector register as the
operand of another operation utilizing a different functional unit. This phase is known as
chaining.
• A vector processor implements better with higher vectors because of the foundation
delay in a pipeline.
• Vector processing decrease the overhead related to maintenance of the loop-control

variables which creates it more efficient than scalar processing.
Parallel Processing
Parallel processing can be described as a class of techniques which enables the system to
achieve simultaneous data-processing tasks to increase the computational speed of a computer
system.
A parallel processing system can carry out simultaneous data-processing to achieve faster
execution time. For instance, while an instruction is being processed in the ALU component of
the CPU, the next instruction can be read from memory.
The primary purpose of parallel processing is to enhance the computer processing capability
and increase its throughput, i.e. the amount of processing that can be accomplished during a
given interval of time.
A parallel processing system can be achieved by having a multiplicity of functional units that
perform identical or different operations simultaneously. The data can be distributed among
various multiple functional units.
The following diagram shows one possible way of separating the execution unit into eight
functional units operating in parallel.
The operation performed in each functional unit is indicated in each block if the diagram:
o The adder and integer multiplier performs the arithmetic operation with integer
numbers.
o The floating-point operations are separated into three circuits operating in parallel.
o The logic, shift, and increment operations can be performed concurrently on different
data. All units are independent of each other, so one number can be shifted while another
number is being incremented.

Unit-5 (Coa) Notes

Uploaded by

Copyright:

Available Formats

Unit-5 (Coa) Notes

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit-5 (Coa) Notes

Uploaded by

Copyright:

Available Formats

UNIT - 5

Ai* Bi + Ci for i = 1, 2, 3, ......., 7

R1 ← Ai, R2 ← Bi Input Ai, and Bi

1. Compare the exponents by subtraction.

2. Align the mantissas:

The two mantissas are added in segment three.

After normalization, the result is written as:

1. Fetch instruction from memory.

An operand from memory is fetched in the third segment.

Without pipelining = 9/3 minutes = 3m

With pipelining = 5/3 minutes = 1.67m

Total time = 8 Cycle

Throughput = Number of instructions / Total time to complete the instructions

Dependencies in pipeline Processor

Solutions for Structural dependency

Control Dependency (Branch Hazards)

Expected Output is described as follows:

The output sequence is described as follows:

The output sequence is described as follows:

I1 → I2 → Delay (Stall) → BI1

Solution for Control Dependency

1. S: ADD R1, R2, R3

In the following way, the instruction S2 will depend on instruction S1:

1. [I(S1) ? O(S2)] ? [O(S1) ? I(S2)] ? [O(S1) ? O(S2)] ? ?

I1: ADD R1, R2, R3

I1: add R1, R2, R3

The RAW hazard is very common.

The dependency is described as follows:

add R1, R2, R3

The dependency is described as follows:

add R1, R2, R3

I1 IF ID EX MEM MEM2 MEM3 WB

Effects of WAR and WAW

2. Data in microprocessors can move from one location to another.

A typical Microprocessor structure looks like this.

First Generation (4 - bit Microprocessors)

Second Generation (8 - bit Microprocessor)

Third Generation (16 - bit Microprocessor)

Fifth Generation (64 - bit Microprocessors)

RISC or Reduced Instruction Set Computer is a computer architecture where instruction is

2) Complex Instruction Set Computer (CISC) –

Examples of RISC processors are -

SUN's SPARC, PowerPC, Microchip PIC processors, RISC-V.

Advantages of RISC Processor

Features of RISC Processor

Some important features of RISC processors are:

Characteristics of CISC Processor

Following are the main characteristics of the RISC processor:

1. The length of the code is shorts, so it requires very little RAM.

CISC Processors Architecture

Advantages of CISC Processors

Disadvantages of CISC Processors

Difference between the RISC and CISC Processors

It is a Reduced Instruction Set Computer. It is a Complex Instruction Set Computer.

It emphasizes on software to optimize the It emphasizes on hardware to optimize the

It has fixed format instruction. It has variable format instruction.

Benefits of using a Multiprocessor –

Difference between multiprocessor and Multicomputer: