Unit-5 (Coa) Notes
Unit-5 (Coa) Notes
Unit-5 (Coa) Notes
Introduction to pipelining
The term Pipelining refers to a technique of decomposing a sequential process into sub-
operations, with each sub-operation being executed in a dedicated segment that operates
concurrently with all other segments.
The most important characteristic of a pipeline technique is that several computations can be
in progress in distinct segments at the same time. The overlapping of computation is made
possible by associating a register with each segment in the pipeline. The registers provide
isolation between each segment so that each can operate on distinct data simultaneously.
The structure of a pipeline organization can be represented simply by including an input register
for each segment followed by a combinational circuit.
Let us consider an example of combined multiplication and addition operation to get a better
understanding of the pipeline organization.
The combined multiplication and addition operation is done with a stream of numbers such as:
The operation to be performed on the numbers is decomposed into sub-operations with each
sub-operation to be implemented in a segment within a pipeline.
The sub-operations performed in each segment of the pipeline are defined as:
The following block diagram represents the combined as well as the sub-operations performed
in each segment of the pipeline.
Registers R1, R2, R3, and R4 hold the data and the combinational circuits operate in a particular
segment.
The output generated by the combinational circuit in a given segment is applied as an input
register of the next segment. For instance, from the block diagram, we can see that the register
R3 is used as one of the input registers for the combinational adder circuit.
In general, the pipeline organization is applicable for two areas of computer design which
includes:
1. Arithmetic Pipeline
2. Instruction Pipeline
Arithmetic Pipeline
Arithmetic Pipelines are mostly used in high-speed computers. They are used to implement
floating-point operations, multiplication of fixed-point numbers, and similar computations
encountered in scientific problems.
To understand the concepts of arithmetic pipeline in a more convenient way, let us consider an
example of a pipeline unit for floating-point addition and subtraction.
The inputs to the floating-point adder pipeline are two normalized floating-point binary
numbers defined as:
X = A * 2a = 0.9504 * 103
Y = B * 2b = 0.8200 * 102
Where A and B are two fractions that represent the mantissa and a and b are the exponents.
The combined operation of floating-point addition and subtraction is divided into four
segments. Each segment contains the corresponding sub operation to be performed in the given
pipeline. The sub operations that are shown in the four segments are:
We will discuss each sub operation in a more detailed manner later in this section.
The following block diagram represents the sub operations performed in each segment of the
pipeline.
Note: Registers are placed after each sub operation to store the intermediate results.
1. Compare exponents by subtraction:
The exponents are compared by subtracting them to determine their difference. The larger
exponent is chosen as the exponent of the result.
The difference of the exponents, i.e., 3 - 2 = 1 determines how many times the mantissa
associated with the smaller exponent must be shifted to the right.
The mantissa associated with the smaller exponent is shifted according to the difference of
exponents determined in segment one.
X = 0.9504 * 103
Y = 0.08200 * 103
3. Add mantissas:
Z = X + Y = 1.0324 * 103
4. Normalize the result:
Z = 0.1324 * 104
Instruction Pipeline
Pipeline processing can occur not only in the data stream but in the instruction stream as well.
Most of the digital computers with complex instructions require instruction pipeline to carry out
operations like fetch, decode and execute instructions.
In general, the computer needs to process each instruction with the following sequence of steps.
The organization of an instruction pipeline will be more efficient if the instruction cycle is divided
into segments of equal duration. One of the most common examples of this type of organization
is a Four-segment instruction pipeline.
A four-segment instruction pipeline combines two or more different segments and makes it as
a single one. For instance, the decoding of the instruction can be combined with the calculation
of the effective address into one segment.
The following block diagram shows a typical example of a four-segment instruction pipeline. The
instruction cycle is completed in four segments.
Segment 1:
The instruction fetch segment can be implemented using first in, first out (FIFO) buffer.
Segment 2:
The instruction fetched from memory is decoded in the second segment, and eventually, the
effective address is calculated in a separate arithmetic circuit.
Segment 3:
Segment 4:
The instructions are finally executed in the last segment of the pipeline organization.
Pipelining is a process of arrangement of hardware elements of the CPU such that its overall
performance is increased. Simultaneous execution of more than one instruction takes place in
a pipelined processor. Let us see a real-life example that works on the concept of pipelined
operation. Consider a water bottle packaging plant. Let there be 3 stages that a bottle should
pass through, Inserting the bottle (I), Filling water in the bottle (F), and Sealing the bottle(S).
Let us consider these stages as stage 1, stage 2, and stage 3 respectively. Let each stage take
1 minute to complete its operation. Now, in a non-pipelined operation, a bottle is first inserted
in the plant, after 1 minute it is moved to stage 2 where water is filled. Now, in stage 1 nothing
is happening. Similarly, when the bottle moves to stage 3, both stage 1 and stage 2 are idle.
But in pipelined operation, when the bottle is in stage 2, another bottle can be loaded at stage
1. Similarly, when the bottle is in stage 3, there can be one bottle each in stage 1 and stage 2.
So, after each minute, we get a new bottle at the end of stage 3. Hence, the average time
taken to manufacture 1 bottle is:
IFS||||||
|||IFS|||
| | | | | | I F S (9 minutes)
IFS||
|IFS|
| | I F S (5 minutes)
Thus, pipelined operation increases the efficiency of a system.
Design of a basic pipeline
• In a pipelined processor, a pipeline has two ends, the input end and the output end.
Between these ends, there are multiple stages/segments such that the output of one stage
is connected to the input of the next stage and each stage performs a specific operation.
• Interface registers are used to hold the intermediate output between two stages. These
interface registers are also called latch or buffer.
• All the stages in the pipeline along with the interface registers are controlled by a common
clock.
• Execution in a pipelined processor Execution sequence of instructions in a pipelined
processor can be visualized using a space-time diagram. For example, consider a
processor having 4 stages and let there be 2 instructions to be executed. We can
visualize the execution sequence through the following space-time diagrams:
• Non-overlapped execution:
Stage / Cycle 1 2 3 4 5 6 7 8
S1 I1 I2
S2 I1 I2
S3 I1 I2
S4 I1 I2
S1 I1 I2
S2 I1 I2
S3 I1 I2
S4 I1 I2
Total time = 5 Cycle Pipeline Stages RISC processor has 5 stage instruction pipeline to execute
all the instructions in the RISC instruction set. Following are the 5 stages of the RISC pipeline
with their respective operations:
• Stage 1 (Instruction Fetch) In this stage the CPU reads instructions from the address in
the memory whose value is present in the program counter.
• Stage 2 (Instruction Decode) In this stage, instruction is decoded and the register file is
accessed to get the values from the registers used in the instruction.
• Stage 3 (Instruction Execute) In this stage, ALU operations are performed.
• Stage 4 (Memory Access) In this stage, memory operands are read and written from/to
the memory that is present in the instruction.
• Stage 5 (Write Back) In this stage, computed/fetched value is written back to the register
present in the instructions.
• Performance of a pipelined processor Consider a ‘k’ segment pipeline with clock cycle
time as ‘Tp’. Let there be ‘n’ tasks to be completed in the pipelined processor. Now, the
first instruction is going to take ‘k’ cycles to come out of the pipeline but the other ‘n –
1’ instructions will take only ‘1’ cycle each, i.e., a total of ‘n – 1’ cycles. So, time taken to
execute ‘n’ instructions in a pipelined processor:
ETpipeline = k + n – 1 cycles
= (k + n – 1) Tp
• In the same case, for a non-pipelined processor, the execution time of ‘n’ instructions
will be:
ETnon-pipeline = n * k * Tp
• So, speedup (S) of the pipelined processor over the non-pipelined processor, when ‘n’
tasks are executed on the same processor is:
S = Performance of pipelined processor /
Performance of non-pipelined processor
• As the performance of a processor is inversely proportional to the execution time, we
have,
S = ETnon-pipeline / ETpipeline
=> S = [n * k * Tp] / [(k + n – 1) * Tp]
S = [n * k] / [k + n – 1]
• When the number of tasks ‘n’ is significantly larger than k, that is, n >> k
S=n*k/n
S=k
where ‘k’ are the number of stages in the pipeline.
Efficiency = Given speed up / Max speed up = S / S max We know that Smax = k
So,
Efficiency = S / k
1. Structural dependencies
2. Data dependencies
3. Control dependencies
Because of these dependencies, the stalls will be introduced in a pipeline. A stall can be
described as a cycle without new input in the pipeline. In other words, we can say that the stall
will happen when the later instruction depends on the output of the earlier instruction.
Structural dependencies
Because of the resource conflict in the pipeline, structural dependency usually arises. The
resource conflict can be described as a situation where there is a cycle containing resources such
as ALU (arithmetical logical unit), memory, or register. In resource conflict, more than one
instruction tries to access the same resource.
Example:
Instructions / Cycle 1 2 3 4 5
I1 IF(Mem) ID EX Mem
I2 IF(Mem) ID EX
I3 IF(Mem) ID EX
I4 IF(Mem) ID
The above table contains the four instructions I1, I2, I3, and I4, and five cycles 1, 2, 3, 4, 5. In cycle
4, there is a resource conflict because I1 and I4 are trying to access the same resource. In our
case, the resource is memory. The solution to this problem is that we have to keep the
instruction on wait as long as the required resource becomes available. Because of this wait, the
stall will be introduced in pipelines like this:
Instructions / Cycle 1 2 3 4 5 6 7 8
I1 IF(Mem) ID EX Mem WB
I2 IF(Mem) ID EX Mem WB
I3 IF(Mem) ID EX Mem WB
I4 - - - IF(Mem)
With the help of a hardware mechanism, we can minimize the structural dependency stalls in a
pipeline. The mechanism is known as renaming.
Remaining: In this mechanism, the memory will be divided into two independent modules,
which are known as Data memory (DM) and Code memory (CM). Here, all the instructions are
contained with the help of CM, and all the operands which are required for the instructions are
contained by the DM.
Instructions / Cycle 1 2 3 4 5 6 7
I1 IF(CM) ID EX DM WB
I2 IF(CM) ID EX DM WB
I3 IF(CM) ID EX DM WB
I4 IF(CM) ID EX DM
I5 IF(CM) ID EX
I6 IF(CM) ID
I7 IF(CM)
For example:
For this, we will assume a program and take the following sequence of instructions like this:
100: I1
101: I2
102: I3
.
.
250: BI1
I1 → I2 → BI1
Note: After the ID stage, the processor is able to know the target address of JMP instruction.
Instructions / Cycle 1 2 3 4 5 6
I1 IF ID EX MEM WB
I2 IF ID(PC:250) EX MEM WB
I3 IF ID EX MEM
BI1 IF ID EX
I1 → I2 → I3 → BI1
So the above example shows that the expected output and output sequence are not equal to
each other. It shows that the pipeline is not correctly implemented.
We can correct that problem with the help of stopping the instruction fetch as long as we get
the target address of branch instruction. For this, we will implement the delay slot as long as we
get the target address, which is described in the following table:
Instructions / Cycle 1 2 3 4 5 6
I1 IF ID EX MEM WB
I2 IF ID(PC:250) EX MEM WB
Delay - - - - - -
BI1 IF ID EX
In the above example, we can see that there is no operation performed by the delay slot. That's
why this output sequence and the expected output are not equal to each other. But because of
this slot, a stall will be introduced in the pipeline.
In the control dependency, we can eliminate the stall in the pipelines with the help of a method
known as Branch prediction. The prediction about which branch will be taken is done at the
1st stage of branch prediction. The branch prediction contains the 0 branch penalty.
Branch Penalty: Branch penalty can be described as the number of stalls that are introduced at
the time of branch operation in the pipelined.
Data Dependency (Data Hazards)
For this, we will assume an ADD instruction S, and three registers, which are described as
follows:
Instructions / Cycle 1 2 3 4
I1 IF ID EX DM
I2 IF ID (Old value) EX
Here we will use the operand forwarding so that we can minimize the stalls in data dependency.
Operand Forwarding: In this forwarding, we will use the interface registers which exist between
the stages. These registers are used to contain the intermediate output. With the help of
intermediate registers, the dependent instruction is able to directly access the new value.
To explain this, we will take the same example:
Instructions / Cycle 1 2 3 4
I1 IF ID EX DM
I2 IF ID EX
Data Hazards
Due to the data dependency, data hazards have occurred. If the data is modified in different
stages of a pipeline with the help of instructions that exhibit data dependency, in this case, the
data hazard will occur. When the instructions are read/write the registers that are used by some
other instructions, in this case, the instruction hazards will occur. Because of the data hazard,
there will be a delay in the pipeline. The data hazards are basically of three types:
1. RAW
2. WAR
3. WAW
To understand these hazards, we will assume we have two instructions I1 and I2, in such a way
that I2 follows I1. The hazards are described as follows:
RAW:
RAW hazard can be referred to as 'Read after Write'. It is also known as Flow/True data
dependency. If the later instruction tries to read on operand before earlier instruction writes it,
in this case, the RAW hazards will occur. The condition to detect the RAW hazard is when O n and
In+1 both have a minimum one common operand.
For example:
Instructions / Cycle 1 2 3 4 5 6
I1 IF ID EX MEM WB
I2 IF ID EX MEM WB
WAR
WAR can be referred to as 'Write after Read'. It is also known as Anti-Data dependency. If the
later instruction tries to write an operand before the earlier instruction reads it, in this case, the
WAR hazards will occur. The condition to detect the WAR hazard is when In and On+1 both have
a minimum one common operand.
For example:
Here addition instruction creates a WAR hazard because subtraction instruction writes R2,
which is read by addition. In a reasonable (in-order) pipeline, the WAR hazard is very uncommon
or impossible. The hazard for instructions 'add R1, R2, R3' and 'sub R2, R5, R4' are described as
follows:
Instructions / Cycle 1 2 3 4 5 6
I1 IF ID EX MEM WB
I2 IF ID EX MEM WB
When the instruction tries to enter into the write back stage of the pipeline, at that time, all the
previous instructions contained by the program have already passed through the read stage of
register and read their input values. Now without causing any type of problem, the write
instruction can write its destination register. The WAR instructions contain less problems as
compared to the WAW because in WAR, before the write back stage of a pipeline, the read stage
of a register occur.
WAW
WAW can be referred to as 'Write after Write'. It is also known as Output Data dependency. If
the later instruction tries to write on operand before earlier instruction writes it, in this case,
the WAW hazards will occur. The condition to detect the WAW hazard is when On and On+1 both
have a minimum one common operand.
For example:
Here addition instruction creates a WAW hazard because subtraction instruction writes on the
same register. The hazard for instructions 'add R1, R2, R3' and 'sub R1, R2, R4' are described as
follows:
Instructions / Cycle 1 2 3 4 5 6 7
I2 IF ID EX MEM WB
In the write back stage of a pipeline, the output register of instruction will be written. The order
in which the instruction with WAW hazard appears in the program, in the same order these
instructions will be entered the write back stage of a pipeline. The result of these instructions
will be written into the register in the right order. The processor has improved performance as
compared to the original program because it allows instructions to execute in different orders.
The WAR hazards and WAW hazards occur because the process contains a finite number of
registers. Because of this reason, these hazards are also known as the name dependencies.
The processor will use the different registers to generate the output of each instruction if it
contains an infinite number of registers. There is no chance of occurring the WAR and WAW
hazards in this case.
The WAR and WAW hazards will not cause the delay if a processor uses the same pipeline for all
the instructions and executes these instructions in the same order in which they appear in the
program. This is all because of the process of instructions flow through a pipeline.
What is a Microprocessor?
A microprocessor is a computer processor that is found in most modern personal computers,
smartphones, and other electronic devices. It is a central processing unit (CPU) that performs
most of the processing tasks in a computer system. The microprocessor is a key component of
a computer, as it controls the fetching, decoding, and execution of instructions that are stored
in memory. You can say that microprocessor is used as the brain of the computing devices
which control overall execution and operations. The development of microprocessors has
played a significant role in the evolution of computers and has made it possible for them to
become smaller, faster, and more powerful over time.
Basics of Microprocessor –
A Microprocessor takes a bunch of instructions in machine language and executes them, telling
the processor what it has to do. Microprocessor performs three basic things while executing
the instruction:
1. It performs some basic operations like addition, subtraction, multiplication, division, and
some logical operations using its Arithmetic and Logical Unit (ALU). New Microprocessors
also perform operations on floating-point numbers also.
3. It has a Program Counter (PC) register that stores the address of the next instruction based
on the value of the PC, Microprocessor jumps from one location to another and takes
decisions.
Evolution of Microprocessors
We can categorize the microprocessor according to the generations or according to the size of
the microprocessor:
The first generation microprocessors were introduced in the year 1971-1972 by Intel
Corporation. It was named Intel 4004 since it was a 4-bit processor.
It was a processor on a single chip. It could perform simple arithmetic and logical operations
such as addition, subtraction, Boolean OR and Boolean AND.
I had a control unit capable of performing control functions like fetching an instruction from
storage memory, decoding it, and then generating control pulses to execute it.
The second generation microprocessors were introduced in 1973 again by Intel. It was a first 8
- bit microprocessor which could perform arithmetic and logic operations on 8-bit words. It was
Intel 8008, and another improved version was Intel 8088.
The third generation microprocessors, introduced in 1978 were represented by Intel's 8086,
Zilog Z800 and 80286, which were 16 - bit processors with a performance like minicomputers.
Fourth Generation (32 - bit Microprocessors)
Several different companies introduced the 32-bit microprocessors, but the most popular one
is the Intel 80386.
From 1995 to now we are in the fifth generation. After 80856, Intel came out with a new
processor namely Pentium processor followed by Pentium Pro CPU, which allows multiple CPUs
in a single system to achieve multiprocessing.
Other improved 64-bit processors are Celeron, Dual, Quad, Octa Core processors.
Types of Processor:
1) Reduced Instruction Set Computer (RISC) –
Example:
1. IBM RS6000
2. MC88100
3. DEC Alpha 21064
4. DEC Alpha 21164
5. DEC Alpha 21264
CISC or Complex Instruction Set Computer is a computer architecture where instructions are
such that a single instruction can execute multiple low-level operations like loading from
memory, storing into memory, or an arithmetic operation, etc. It has multiple addressing
nodes within a single instruction. CISC makes use of very few registers.
Example:
1. Intel 386
2. Intel 486
3. Pentium
4. Pentium Pro
5. Pentium II
6. Pentium III
7. Motorola 68000
8. Motorola 68020
9. Motorola 68040 etc.
RISC Processor
RISC stands for Reduced Instruction Set Computer Processor, a microprocessor architecture
with a simple collection and highly customized set of instructions. It is built to minimize the
instruction execution time by optimizing and limiting the number of instructions. It means each
instruction cycle requires only one clock cycle, and each cycle contains three parameters: fetch,
decode and execute. The RISC processor is also used to perform various complex instructions by
combining them into simpler ones. RISC chips require several transistors, making it cheaper to
design and reduce the execution time for instruction.
RISC Architecture
It is a highly customized set of instructions used in portable devices due to system reliability
such as Apple iPod, mobiles/smartphones, Nintendo DS,
1. One cycle execution time: For executing each instruction in a computer, the RISC
processors require one CPI (Clock per cycle). And each CPI includes the fetch, decode and
execute method applied in computer instruction.
2. Pipelining technique: The pipelining technique is used in the RISC processors to execute
multiple parts or stages of instructions to perform more efficiently.
3. A large number of registers: RISC processors are optimized with multiple registers that
can be used to store instruction and quickly respond to the computer and minimize
interaction with computer memory.
4. It supports a simple addressing mode and fixed length of instruction for executing the
pipeline.
5. It uses LOAD and STORE instruction to access the memory location.
6. Simple and limited instruction reduces the execution time of a process in a RISC.
CISC Processor
The CISC Stands for Complex Instruction Set Computer, developed by the Intel. It has a large
collection of complex instructions that range from simple to very complex and specialized in the
assembly language level, which takes a long time to execute the instructions. So, CISC
approaches reducing the number of instruction on each program and ignoring the number of
cycles per instruction. It emphasizes to build complex instructions directly in the hardware
because the hardware is always faster than software. However, CISC chips are relatively slower
as compared to RISC chips but use little instruction than RISC. Examples of CISC processors are
VAX, AMD, Intel x86 and the System/360.
The CISC architecture helps reduce program code by embedding multiple operations on each
program instruction, which makes the CISC processor more complex. The CISC architecture-
based computer is designed to decrease memory costs because large programs or instruction
required large memory space to store the data, thus increasing the memory requirement, and
a large collection of memory increases the memory cost, which makes them more expensive.
1. The compiler requires little effort to translate high-level programs or statement languages
into assembly or machine language in CISC processors.
2. The code length is quite short, which minimizes the memory requirement.
3. To store the instruction on each CISC, it requires very less RAM.
4. Execution of a single instruction requires several low-level tasks.
5. CISC creates a process to manage power usage that adjusts clock speed and voltage.
6. It uses fewer instructions set to perform the same instruction as the RISC.
1. CISC chips are slower than RSIC chips to execute per instruction cycle on each program.
2. The performance of the machine decreases due to the slowness of the clock speed.
3. Executing the pipeline in the CISC processor makes it complicated to use.
4. The CISC chips require more transistors as compared to RISC design.
5. In CISC it uses only 20% of existing instructions in a programming event.
RISC CISC
It is a hard wired unit of programming in the RISC Microprogramming unit in CISC Processor.
Processor.
It requires multiple register sets to store the It requires a single register set to store the
instruction. instruction.
RISC has simple decoding of instruction. CISC has complex decoding of instruction.
Uses of the pipeline are simple in RISC. Uses of the pipeline are difficult in CISC.
It uses a limited number of instruction that It uses a large number of instruction that
requires less time to execute the instructions. requires more time to execute the
instructions.
It uses LOAD and STORE that are independent It uses LOAD and STORE instruction in the
instructions in the register-to-register a program's memory-to-memory interaction of a
interaction. program.
RISC has more transistors on memory registers. CISC has transistors to store complex
instructions.
The execution time of RISC is very short. The execution time of CISC is longer.
RISC architecture can be used with high-end CISC architecture can be used with low-end
applications like telecommunication, image applications like home automation, security
processing, video processing, etc. system, etc.
The program written for RISC architecture needs Program written for CISC architecture tends
to take more space in memory. to take less space in memory.
Example of RISC: ARM, PA-RISC, Power Examples of CISC: VAX, Motorola 68000
Architecture, Alpha, AVR, ARC and the SPARC. family, System/360, AMD and the Intel x86
CPUs.
Introduction of Multiprocessor
Multiprocessor:
A Multiprocessor is a computer system with two or more central processing units (CPUs) share
full access to a common RAM. The main objective of using a multiprocessor is to boost the
system’s execution speed, with other objectives being fault tolerance and application
matching.
There are two types of multiprocessors, one is called shared memory multiprocessor and
another is distributed memory multiprocessor. In shared memory multiprocessors, all the
CPUs shares the common memory but in a distributed memory multiprocessor, every CPU has
its own private memory.
Applications of Multiprocessor –
1. As a uniprocessor, such as single instruction, single data stream (SISD).
2. As a multiprocessor, such as single instruction, multiple data stream (SIMD), which is
usually used for vector processing.
3. Multiple series of instructions in a single perspective, such as multiple instruction, single
data stream (MISD), which is used for describing hyper-threading or pipelined processors.
4. Inside a single system for executing multiple, individual series of instructions in multiple
perspectives, such as multiple instruction, multiple data stream (MIMD).
• Enhanced performance.
• Multiple applications.
• Multi-tasking inside an application.
• High throughput and responsiveness.
• Hardware sharing among CPUs.
2. Multicomputer:
A multicomputer system is a computer system with multiple processors that are connected
together to solve a problem. Each processor has its own memory and it is accessible by that
particular processor and those processors can communicate with each other via an
interconnection network.
As the multicomputer is capable of messages passing between the processors, it is possible to
divide the task between the processors to complete the task. Hence, a multicomputer can be
used for distributed computing. It is cost effective and easier to build a multicomputer than a
multiprocessor.
1. Multiprocessor is a system with two or more central processing units (CPUs) that is capable
of performing multiple tasks where as a multicomputer is a system with multiple processors
that are attached via an interconnection network to perform a computation task.
2. A multiprocessor system is a single computer that operates with multiple CPUs where as a
multicomputer system is a cluster of computers that operate as a singular computer.
3. Construction of multicomputer is easier and cost effective than a multiprocessor.
4. In multiprocessor system, program tends to be easier where as in multicomputer system,
program tends to be more difficult.
5. Multiprocessor supports parallel computing, Multicomputer supports distributed
computing.
Cache Coherence
A cache coherence issue results from the concurrent operation of several processors and the
possibility that various caches may hold different versions of the identical memory block. The
practice of cache coherence makes sure that alterations in the contents of associated operands
are quickly transmitted across the system.
The cache coherence problem is the issue that arises when several copies of the same data
are kept at various levels of memory.
The two methods listed below can be used to resolve the cache coherence issue:
o Write Through
o Write Back
Write Through
The easiest and most popular method is to write through. Every memory write operation
updates the main memory. If the word is present in the cache memory at the requested address,
the cache memory is also updated simultaneously with the main memory.
The benefit of this approach is that the RAM and cache always hold the same information. In
systems with direct memory access transfer, this quality is crucial. It makes sure the information
in the main memory is up-to-date at all times so that a device interacting over DNA can access
the most recent information.
Write Back
Only the catch location is changed during a write operation in this approach. When the word is
withdrawn from the cache, the place is flagged, so it is replicated in the main memory. The right-
back approach was developed because words may be updated numerous times while they are
in the cache. However, as long as they are still there, it doesn't matter whether the copy that is
stored in the main memory is outdated because requests for words are fulfilled from the cache.
An accurate copy must only be transferred back to the main memory when the word is
separated from the cache. According to the analytical findings, between 10% and 30% of all
memory references in a normal program are written into memory.
The important terms related to the data or information stored in the cache as well as in the
main memory are as follows:
o Modified - The modified term signifies that the data stored in the cache and main memory
are different. This means the data in the cache has been modified, and the changes need
to be reflected in the main memory.
o Exclusive - The exclusive term signifies that the data is clean, i.e., the cache and the main
memory hold identical data.
o Shared - Shared refers to the fact that the cache value contains the most current data
copy, which is then shared across the whole cache as well as main memory.
o Owned - The owned term indicates that the block is currently held by the cache and that
it has acquired ownership of it, i.e., complete privileges to that specific block.
o Invalid - When a cache block is marked as invalid, it means that it needs to be fetched
from another cache or main memory.
Types of Coherence:
There exist three varieties of coherence referred to the coherency mechanisms, which are listed
below:
Vector Processor:
Vector processing is a central processing unit that can perform the complete vector input in
individual instruction. It is a complete unit of hardware resources that implements a sequential
set of similar data elements in the memory using individual instruction.
The scientific and research computations involve many computations which require extensive
and high-power computers. These computations when run in a conventional computer may take
days or weeks to complete. The science and engineering problems can be specified in methods
of vectors and matrices using vector processing.
• Each clock period processes two successive pairs of elements. During one single clock
period, the dual vector pipes and the dual sets of vector functional units allow the
processing of two pairs of elements.
As the completion of each pair of operations takes place, the results are delivered to
appropriate elements of the result register. The operation continues just before the
various elements processed are similar to the count particularized by the vector length
register.
• In parallel vector processing, more than two results are generated per clock cycle. The
parallel vector operations are automatically started under the following two
circumstances.
• When successive vector instructions facilitate different functional units and multiple
vector registers.
• When successive vector instructions use the resulting flow from one vector register as the
operand of another operation utilizing a different functional unit. This phase is known as
chaining.
• A vector processor implements better with higher vectors because of the foundation
delay in a pipeline.
Parallel Processing
Parallel processing can be described as a class of techniques which enables the system to
achieve simultaneous data-processing tasks to increase the computational speed of a computer
system.
A parallel processing system can carry out simultaneous data-processing to achieve faster
execution time. For instance, while an instruction is being processed in the ALU component of
the CPU, the next instruction can be read from memory.
The primary purpose of parallel processing is to enhance the computer processing capability
and increase its throughput, i.e. the amount of processing that can be accomplished during a
given interval of time.
A parallel processing system can be achieved by having a multiplicity of functional units that
perform identical or different operations simultaneously. The data can be distributed among
various multiple functional units.
The following diagram shows one possible way of separating the execution unit into eight
functional units operating in parallel.
The operation performed in each functional unit is indicated in each block if the diagram:
o The adder and integer multiplier performs the arithmetic operation with integer
numbers.
o The floating-point operations are separated into three circuits operating in parallel.
o The logic, shift, and increment operations can be performed concurrently on different
data. All units are independent of each other, so one number can be shifted while another
number is being incremented.