Advanced Computer Architecture
Advanced Computer Architecture
Advanced Computer Architecture
UNIT – 1
Contents:
1.1 Introduction
Over the past four decades the computer industry has experienced four generations of
development. The first generation used Vacuum Tubes (1940 – 1950s) to discrete diodes to
transistors (1950 – 1960s), to small and medium scale integrated circuits (1960 – 1970s) and to
very large scale integrated devices (1970s and beyond). Increases in device speed and reliability
and reduction in hardware cost and physical size have greatly enhanced computer performance.
The relationships between data, information, knowledge and intelligence are demonstrated.
Parallel processing demands concurrent execution of many programs in a computer. The highest
level of parallel processing is conducted among multiple jobs through multiprogramming, time
sharing and multiprocessing
The first IBM scientific ,transistorized computer, IBM 1620, became available in 1960.
1978 - Visicalc spreadsheet software was written by Daniel Bricklin and Bob Frankston
1979 - Micropro released Wordstar that set the standard for word processing software
1980 - IBM signed a contract with the Microsoft Co. of Bill Gates and Paul Allen and
Steve Ballmer to supply an operating system for IBM's new PC model. Microsoft paid
$25,000 to Seattle Computer for the rights to QDOS that became Microsoft DOS, and
Microsoft began its climb to become the dominant computer company in the world.
1984 - Apple Computer introduced the Macintosh personal computer January 24.
1987 - Bill Atkinson of Apple Computers created a software program called HyperCard
that was bundled free with all Macintosh computers.
Computer usage started with data processing, while is still a major task of today’s
computers. With more and more data structures developed, many users are shifting to computer
roles from pure data processing to information processing. A high degree of parallelism has been
found at these levels. As the accumulated knowledge bases expanded rapidly in recent years,
there grew a strong demand to use computers for knowledge processing. Intelligence is very
difficult to create; its processing even more so.
Todays computers are very fast and obedient and have many reliable memory cells to be
qualified for data-information-knowledge processing.
Computers are far from being satisfactory in performing theorem proving, logical inference and
creative thinking.
5
From an operating point of view, computer systems have improved chronologically in four
phases:
batch processing
multiprogramming
time sharing
multiprocessing
Intelligence
Processing
Knowledge
Processing Increasing Complexity
Increasing Volumes and Sophistication in
of raw material to be Processing
processed
Information
Processing
Data Processing
Figure 1.1 The spaces of data, information, knowledge and intelligence from the viewpoint of computer
processing
In these four operating modes, the degree of parallelism increase sharply from phase to phase.
We define parallel processing as
Parallel processing is an efficient form of information processing which emphasizes the
exploitation of concurrent events in the computing process. Concurrency implies parallelism,
simultaneity, and pipelining. Parallel processing demands concurrent executiom of many
programs in the computer. The highest level of parallel processing is conducted among multiple
jobs or programs through multiprogramming, time sharing, and multiprocessing.
Console
CPU
R PC
Disk
0
. Main Memory
ALU 232 Words of 32
. bits each
.
Registers
Local
Memory
The trend is also supported by the increasing demand for a faster real-time, resource
sharing and fault-tolerant computing environment.
7
It requires a broad knowledge of and experience with all aspects of algorithms, languages,
software, hardware, performance evaluation and computing alternatives.
To achieve parallel processing requires the development of more capable and cost
effective computer system.
With respect to parallel processing, the general architecture trend is being shifted from
conventional uniprocessor systems to multiprocessor systems to an array of processing elements
controlled by one uniprocessor. From the operating system point of view computer systems have
been improved to batch processing, multiprogramming, and time sharing and multiprocessing.
Computers to be used in the 1990 may be the next generation and very large scale integrated
chips will be used with high density modular design. More than 1000 mega float point operation
per second are expected in these future supercomputers. The evolution of computer systems
helps in learning the generations of computer systems.
The first generation used Vacuum Tubes (1940 – 1950s) to discrete diodes to transistors
(1950 – 1960s), to small and medium scale integrated circuits (1960 – 1970s) and to very
large scale integrated devices (1970s and beyond).
1.6 References
Contents:
The main aim of this lesson is to know the architectural concepts of Uniprocessor systems. The
development of Uniprocessor system will be introduced categorically.
2.1 Introduction
The typical Uniprocessor system consists of three major components: the main memory,
the Central processing unit (CPU) and the Input-output (I/O) sub-system. The CPU contains an
arithmetic and logic unit (ALU) with an optional floating-point accelerator, and some local cache
memory with an optional diagnostic memory. The CPU, the main memory and the I/O
subsystems are all connected to a common bus, the synchronous backplane interconnect (SBI)
through this bus, all I/O device scan communicate with each other, with the CPU, or with the
memory.
A number of parallel processing mechanisms have been developed in uniprocessor
computers and they are identified as multiplicity of functional units, parallelism and pipelining
within the CPU, overlapped CPU and I/O operations, use of a hierarchical memory system,
multiprogramming and time sharing, multiplicity of functional units.
A typical uniprocessor computer consists of three major components: the main memory,
the central processing unit (CPU), and the input-output (I/O) subsystem.
The architectures of two commercially available uniprocessor computers are given below to
show the possible interconnection of structures among the three subsystems.
There are sixteen 32-bit general purpose registers, one of which serves as the program
Counter (pc).there is also a special CPU status register containing information about the current
state of the processor and of the program being executed. The CPU contains an arithmetic and
logic unit (ALU) with an optional floating-point accelerator, and some local cache memory with
an optional diagnostic memory.
9
The CPU, the main memory and the I/O subsystems are all connected to a common bus, the
synchronous backplane interconnect (SBI) through this bus, all I/O device scan communicate
with each other, with the CPU, or with the memory. Peripheral storage or I/O devices can be
connected directly to the SBI through the unibus and its controller or through a mass bus and its
controller.
LOGICAL STORAGE UNITS
LSU0 LSU1 LSU2 LSU3
STORAGE CONTROLLER
I/O CHANNELS
The CPU contains the instruction decoding and execution units as well as a cache. Main
memory is divided into four units, referred to as logical storage units that are four-way
interleaved. The storage controller provides mutltiport connections between the CPU and the
four LSUs. Peripherals are connected to the system via high speed I/O channels which operate
asynchronously with the CPU.
10
The early computer has only one ALU in its CPU and hence performing a long sequence
of ALU instructions takes more amount of time. The CDC-6600 has 10 functional units built into
its CPU.
These 10 units are independent of each other and may operate simultaneously.
A score board is used to keep track of the availability of the functional units and registers being
demanded. With 10 functional units and 24 registers available, the instruction issue rate can be
significantly increased.
Another good example of a multifunction uniprocessor is the IBM 360/91 which has 2
parallel execution units. One for fixed point arithmetic and the other for floating point arithmetic.
Within the floating point E-unit are two functional units:one for floating point add- subtract and
other for floating point multiply – divide. IBM 360/91 is a highly pipelined, multifunction
scientific uniprocessor.
Parallel adders, using such techniques as carry-look ahead and carry –save, are now built
into almost all ALUs. This is in contrast to the bit serial adders used in the first generation
machines. High speed multiplier recording and convergence division are techniques for
exploring parallelism and the sharing of hardware resources for the functions of multiply and
divide. The use of multiple functional units is a form of parallelism with the CPU.
Various phases of instructions executions are now pipelined, including instruction
fetch,decode,operand fetch, arithmetic logic execution, and store result.
I/O operations can be performed simultaneously with the CPU competitions by using
separate I/O controllers, channels, or I/O processors.
The direct memory access (DMA) channel can be used to provide direct information transfer
between the I/O devices and the main memory. The DMA is conducted on a cycle stealing basis,
which is apparent to the CPU.
The CPU is 1000 times faster than memory access. A hierarchical memory system can be
used to close up the speed gap. The hierarchical order listed is
registers
Cache
Main Memory
Magnetic Disk
Magnetic Tape
The inner most level is the register files directly addressable by ALU.
Cache memory can be used to serve as a buffer between the CPU and the main memory. Virtual
memory space can be established with the use of disks and tapes at the outer levels.
CPU is the fastest unit in computer. The bandwidth of a system is defined as the number
of operations performed per unit time. In case of main memory the memory bandwidth is
measured by the number of words that can be accessed per unit time.
The speed gap between the CPU and the main memory can be closed up by using fast
cache memory between them. A block of memory words is moved from the main memory into
the cache so that immediate instructions can be available most of the time from the cache.
Input-output channels with different speeds can be used between the slow I/O devices
and the main memory. The I/O channels perform buffering and multiplexing functions to transfer
the data from multiple disks into the main memory by stealing cycles from the CPU.
Multiprogramming
Within the same interval of time, there may be multiple processes active in a computer,
competing for memory, I/O and CPU resources. Some computers are I/O bound and some are
CPU bound. Various types of programs are mixed up to balance bandwidths among functional
units.
Example
Whenever a process P1 is tied up with I/O processor for performing input output
operation at the same moment CPU can be tied up with an process P2. This allows simultaneous
execution of programs. The interleaving of CPU and I/O operations among several
programs is called as Multiprogramming.
Time-Sharing
The mainframes of the batch era were firmly established by the late 1960s when advances
in semiconductor technology made the solid-state memory and integrated circuit feasible. These
12
advances in hardware technology spawned the minicomputer era. They were small, fast, and
inexpensive enough to be spread throughout the company at the divisional level.
Multiprogramming mainly deals with sharing of many programs by the CPU. Sometimes
high priority programs may occupy the CPU for long time and other programs are put up in
queue. This problem can be overcome by a concept called as Time sharing in which every
process is allotted a time slice of CPU time and thereafter after its respective time slice is over
CPU is allotted to the next program if the process is not completed it will be in queue waiting for
the second chance to receive the CPU time.
The architectural design of Uniprocessor systems has been discussed with the help of 2
examples system architecture of the supermini VAX-11/780 Uniprocessor system. And System
Architecture of the mainframe IBM system 370/Model 168 Uniprocessor computer. Various
components such as main memory, Unibus Adapter, mass Bus adapter SBI I/O device have been
discussed.
A number of parallel processing mechanisms have been developed in Uniprocessor
computers and the categorization made to understand various parallelism.
2.6 References
3.1 Introduction
Parallel computers are those systems that emphasize parallel processing. The process of
executing an instruction in a digital computer involves 4 major steps namely Instruction fetch,
Instruction decoding, Operand fetch, Execution.
In a pipelined computer successive instructions are executed in an overlapped fashion.
In a non pipelined computer these four steps must be completed before the next instructions can
be issued.
An array processor is a synchronous parallel computer with multiple arithmetic logic
units called processing elements (PE) that can operate in parallel in lock step fashion.
By replication one can achieve spatial parallelism. The PEs are synchronized to perform the
same function at the same time.
A basic multiprocessor contains two or more processors of comparable capabilities. All
processors share access to common sets of memory modules, I/O channels and peripheral
devices.
Parallel computers are those systems that emphasize parallel processing. We divide parallel
computers into three architectural configurations:
Pipeline computers
Array processors
multiprocessors
Instruction decoding
Operand fetch
Execution
In a pipelined computer successive instructions are executed in an overlapped fashion.
In a non pipelined computer these four steps must be completed before the next instructions
can be issued.
S0 S1 S2 S3 S4
By replication one can achieve spatial parallelism. The PEs are synchronized to perform the
same function at the same time.
Scalar and control type of instructions are directly executed in the control unit (CU).
Each PE consists of an ALU registers and a local memory. The PEs are interconnected by a data-
routing network. Vector instructions are broadcasted to the PEs for distributed execution over
different component operands fetched directly from local memories. Array processors designed
with associative memories are called as associative processors.
Figure 3.3 Functional structure of a modern pipeline computer with scalar and vector
capabilities
The fundamental difference between an array processor and a multiprocessor system is that the
processing elements in an array processor operate synchronously but processors in a
multiprocessor systems may not operate synchronously.
3.6 References
From Net : Tarek A. El-Ghazawi, Dept. of Electrical and Computer Engineering, The George
Washington University
18
Contents:
The main objective is to learn various architectural classification schemes, Flynn’s classification,
Feng’s classification, and Handler’s Classification.
4.1 Introduction
The Flynn’s classification scheme is based on the multiplicity of instruction streams and
data streams in a computer system. Feng’s scheme is based on serial versus parallel processing.
Handler’s classification is determined by the degree of parallelism and pipelining in various
subsystem levels.
The most popular taxonomy of computer architecture was defined by Flynn in 1966.
Flynn's classification scheme is based on the notion of a stream of information. Two types of
information flow into a processor: instructions and data. The instruction stream is defined as the
sequence of instructions performed by the processing unit. The data stream is defined as the data
traffic exchanged between the memory and the processing unit.
According to Flynn's classification, either of the instruction or data streams can be single or
multiple.
Computer architecture can be classified into the following four distinct categories:
single-instruction single-data streams (SISD);
single-instruction multiple-data streams (SIMD);
multiple-instruction single-data streams (MISD); and
19
CU-control unit
PU-processor unit
MM-memory module
SM-Shared memory
IS-instruction stream
DS-data stream
There are n processor units, each receiving distinct instructions operating over the same data
streams and its derivatives. The output of one processor become input of the other in the macro
pipe. No real embodiment of this class exists.
Currently, the most common type of parallel computer. Most modern computers fall into
this category.
Multiple Instruction: every processor may be executing a different instruction stream
Multiple Data: every processor may be working with a different data stream
Execution can be synchronous or asynchronous, deterministic or non-deterministic
Examples: most current supercomputers, networked parallel computer "grids" and multi-
processor SMP computers - including some types of PCs.
1. SISD IBM 701, IBM 1620, IBM 7090, PDP VAX11/ 780
2. SISD (With IBM360/91 (3); IBM 370/168 UP
multiple
functional units)
3. SIMD (Word Illiac – IV ; PEPE
Slice
Processing)
4. SIMD (Bit Slice STARAN; MPP; DAP
23
processing)
5. MIMD (Loosely IBM 370/168 MP; Univac 1100/80
Coupled)
6. MIMD(Tightly Burroughs- D – 825
Coupled)
Tse-yun Feng suggested the use of degree of parallelism to classify various computer
architectures.
Serial Versus Parallel Processing
The maximum number of binary digits that can be processed within a unit time by a
computer system is called the maximum parallelism degree P.
A bit slice is a string of bits one from each of the words at the same vertical position.
There are 4 types of methods under above classification
Word Serial and Bit Serial (WSBS)
Word Parallel and Bit Serial (WPBS)
Word Serial and Bit Parallel(WSBP)
Word Parallel and Bit Parallel (WPBP)
WSBS has been called bit parallel processing because one bit is processed at a time.
WPBS has been called bit slice processing because m-bit slice is processes at a time.
WSBP is found in most existing computers and has been called as Word Slice processing
because one word of n bit processed at a time.
WPBP is known as fully parallel processing in which an array on n x m bits is processes at one
time.
Mode Computer Model Degree of
parallelism
WSPS The ‘MINIMA’ (1,1)
N=1
M=1
WPBS STARAN (1,256)
N=1 MPP (1,16384)
M>1 DAP (1,4096)
WSBP IBM 370/168 UP (64,1)
n>1 CDC 6600 (60,1)
m=1 Burrough 7700 (48,1)
(Word Slice Processing) VAX 11/780 (16/32,1)
WPBP Illiav IV (64,64)
n>1
m>1
(fully parallel Processing)
24
Wolfgang Handler has proposed a classification scheme for identifying the parallelism
degree and pipelining degree built into the hardware structure of a computer system. He
considers at three subsystem levels:
Each PCU corresponds to one processor or one CPU. The ALU is equivalent to Processor
Element (PE). The BLC corresponds to combinational logic circuitry needed to perform 1 bit
operations in the ALU.
A computer system C can be characterized by a triple containing six independent entities
The architectural classification schemes has been presented in this lesson under 3 different
classifications Flynn’s, Feng’s and Handler’s. The instruction format representation has also be
given for Flynn’s scheme and examples of all classifications has been discussed.
4.6 References
http://en.wikipedia.org/wiki/Multiprocessing
Free On-line Dictionary of Computing, which is licensed under the GFDL.
26
Contents:
The main objective of this lesson is introducing some representative applications of high-
performance computers. This helps in knowing the computational needs of important
applications.
5.1 Introduction
Fast and efficient computers are in high demand in many scientific, engineering and
energy resource, medical, military, artificial intelligence and the basic research areas. Large scale
computations are performed in these application areas. Parallel processing computers are needed
to meet these demands.
Fast and efficient computers are in high demand in many scientific, engineering, energy
resource, medical, military, AI, and basic research areas.
Parallel processing computers are needed to meet these demands.
Large scale scientific problem solving involves three interactive disciplines;
Theories
Experiments
Computations
Theoretical scientists develop mathematical models that computer engineers solve numerically,
the numerical results then suggest new theories. Experimental science provides data for
computational science and the latter can model processes that are hard to approach in the
laboratory.
Computer Simulation has several advantages:
It is far cheaper than physical experiments
It can solve much wider range of problems that specific laboratory equipments can
27
Computational approaches are only limited by computer speed and memory capacity, while
physical experiments have many special practical constraints.
Predictive modelling is done through extensive computer simulation experiments, which often
involve large-scale computations to achieve the desired accuracy and turnaround time.
Weather modelling is necessary for short range forecasts and do long range hazard predictions
such as flood, drought and environment pollutions.
Large computers are in great demand in the areas of econometrics, social engineering,
government census, crime control, and the modelling of the world’s economy for the year 2000.
Fast computers have been in high demand for solving many engineering design problems, such
as the finite element analysis needed for structural designs and wind tunnel experiments for aero
dynamics studies.
The design of dams, bridges, ships, supersonic jets, high buildings, and space vehicles requires
the resolution of a large system of algebraic equations.
B) Computational Aerodynamics
Large scale computers have made significant contributions in providing new technological
capabilities and economies in pressing ahead with aircraft and spacecraft lift and turbulence
studies.
28
Intelligent I/O interfaces are being demanded for future supercomputers that must directly
communicate with human beings in images, speech, and natural languages. The various
intelligent functions that demand parallel processing are:
Image Processing
Pattern Recognition
Computer Vision
Speech Understanding
Machien Interface
CAD/CAM/CAI/OA
Intelligent Robotics
Expert Computer Systems
Knowledge Engineering
Computer analysis of remotely sensed earth resource data has many potential applications in
agriculture, forestry, geology, and water resource.
Energy affects the progress of the entire economy on a global basis. Computer can play the
important role in the discovery of oil and gas and the management of their recovery, in the
development of workable plasma fusion energy and in ensuring nuclear reactor safety.
A) Seismic Exploration
Many oil companies are investing in the use of attached array processors or vector
supercomputer for seismic data processing, which accounts for about 10 percent of the oil
finding costs. Seismic explorations sets off a sonic wave by explosive or by jamming a heavy
hydraulic ram into the ground about the spot are used to pick up the echoes.
B) Reservoir Modelling
Super computers are used to perform three dimensional modelling of oil fields.
Nuclear fusion researchers to use a computer 100 times more powerful than any existing one to
model the plasma dynamics in the proposed Tokamak fusion power generator.
Nuclear reactor designs and safety control can both be aided by computer simulation studies.
These studies attempt to provide for :
29
Fast computers are needed in the computer assisted tomography, artificial heart design, liver
diagnosis, brain damage estimation, and genetic engineering studies. Military defence needs to
use supercomputers for weapon design, effects, simulation and other electronic warfare.
The human body can be modelled by computer assisted tomography (CAT) scanning.
B) Genetic Engineering
Many of the aforementioned application areas are related to basic scientific research.
The above details are some of the parallel processing applications, without using super
computers, many of these challenges to advance human civilization could be hardly realized.
1. How parallel processing can be applied in Engineering design & Simulation? Give
examples.
2. How parallel processing can be applied in Medicine & military research? Give examples.
Computational approaches are only limited by computer speed and memory capacity, while
physical experiments have many special practical constraints.
Various Parallel Processing Applications are
30
5.6 References
UNIT – II
Contents:
The main objective of this lesson is given by an example of how a simple job can be solved in
parallel in many different ways.
6.1 Introduction
The simple example given here is correction of answer sheets by teachers and it explains
the concept of parallelism and how tasks are allocated to processors for getting maximum
efficiency in solving problems in parallel. The term temporal means pertaining to time and this
method breaks up a job into a set of tasks to be executed overlapped in time and thus it is
temporal parallelism. In case of data parallelism the input data is divided into independent sets
and processed simultaneously.
The term temporal means pertaining to time. An example is considered suppose 1000
candidates appear in an examination. There are answers to 4 questions in each answer book and a
teacher has to correct the answers based upon the following instructions.
Instructions given a teacher to correct an answer book
Step 1 : Take an answer book from the pile of answer books.
Step 2 : Correct the answer to Q1 namely A1.
Step 3 : Repeat for Q2, Q3 and Q4 as A2, A3 and A4
Step 4 : Add marks given for each answer.
Step 5 : Put answer books in a pile of corrected answer books.
32
Step 6 : Repeat steps 1 to 5 until no more answer books are left in the input.
If a teacher takes 20 minutes to correct a paper, then 20,000 minutes will be taken and if to speed
up the correction the following methods can be applied.
The four teachers can sit cooperatively to correct an answer book. The first teacher
corrects answer for Q1 and passes it to the second teacher who corrects for Q2. (The first teacher
immediately takes up the second paper) and passes it to the third teacher who corrects for Q3 and
passes it to the fourth teacher who corrects for Q4. After correction of 4 papers all the teacher
will be busy.
Time taken to correct A1= Time taken to correct A2= Time taken to correct A3=
Time taken to correct A4 = 5 minutes, then time taken to correct one single paper will be 5
minutes. The total time taken to correct 1000 papers will be 20 + (999 * 5) = 51015 minutes.
This is about 1/4th of the time taken by one teacher.
This method is called as parallel processing as 4 teacher work in parallel. The type of parallelism
used in this method is called as temporal parallelism. The term temporal means pertaining to
time. This method breaks up the job into set of tasks to be executed overlapped in time it is said
to use temporal parallelism. It is also known as assembly line processing or pipeline processing
or vector processing.
=p (k+n-1)
k k
np
speedup due to pipeline processing = = 1 + [(k-1) / n]
P(k+n-1)/k
33
The answer books are divided into four piles and each pile is given to a teacher. Assume
each teacher takes 20 minutes to correct an answer book so that every teacher gets 250 answer
books and the time taken to correct all the 1000 papers is 5000 minutes. This type of parallelism
is called as data parallelism as the input data is divided into independent sets and processed
simultaneously.
P1 to P250 allotted to T1
P251 to P500 allotted to T2
P501 to P750 allotted to T3
P751 to P1000 allotted to T4
Assuming
k
Speedup due to parallel processing =
1 + (k2q/np)
The speed is not directly proportional to the number of teachers as the time to distribute jobs to
teachers (which is an unavoidable overhead) increases as the number of teacher is increased.
34
The previous 2 methods are combined to get this method. This method almost halves the
time taken by a single pipeline. Even though this method reduces the time to complete the set of
jobs it also has the drawbacks of both temporal and data parallelism. This method is effective
only if the number of jobs given to each pipeline is much larger that the number of stages in the
pipeline.
In this method the head examiner gives one answer paper to each teacher and keeps the
rest with him. All teachers simultaneously correct the paper given to them. A teacher who
completes the correction goes to the head examiner for another paper, which is given to him for
correction. If second teacher completes then he queues up in front of the head examiner and
waits for his turn to get an answer sheet. This procedure is repeated until all the papers are
corrected.
T is difficult to increase the number of teachers as it will increase the probability of many
teachers completing their jobs simultaneously thereby leading to long queues of teachers
waiting to get an answer paper.
The teacher may be given small bunch of papers for correction rather than giving one by
one and he held up in queue before the head examiner.
The assignment of jobs to teacher in method 3 is static schedule and the assignment is done
initially and not changed. Here the assignment is done dynamically who corrects the papers first
will get more answer sheets.
There are many ways of solving a problem in parallel. The main advantages and disadvantages
with reference to each method have been discussed and this helps in choosing the method based
on the situation.
1.What are the method available for achieving data parallelism. Elaborate.
2. What are the method available for achieving temporal parallelism. Elaborate
The term temporal means pertaining to time. This method breaks up the job into set of
tasks to be executed overlapped in time it is said to use temporal parallelism. It is also known as
assembly line processing or pipeline processing or vector processing.
In case of data parallelism, the input data is divided into independent sets and processed
simultaneously.
6.6 References
Lesson 7: Comparison of temporal and data parallel processing & Data parallel processing
with specialized processors
Contents:
The main objective of this lesson is to differentiate between temporal and data parallel
processing and to discuss few more methods of data parallel processing with specialized
processors.
7.1 Introduction
In case of temporal parallel processing jobs are divided into set of independent tasks, and
they are given equal time, Bubbles in jobs leads to idling of processors, task assignment is static
and it is not tolerant to processor faults.
In case of data parallelism full jobs are assigned to processing, and they may take different
time and no concept of synchronization needed and it tolerant to processor faults.
Various other methods such as Specialist data parallelism in which head examiner despatches
answer sheets to examiner, In coarse grained temporal parallel processing the answer sheets are
divided and assigned to input tray and output tray , In case of agenda parallelism the answer
book is used as an agenda of questions.
processors processors
4 Task assignment static Job assignment may be static, dynamic
or quasi dynamic
5 Not tolerant to processor faults Tolerates processor faults
6 Efficient with fine grained tasks Efficient with coarse grained tasks and
quasi dynamic scheduling
7 Scales well as long as the number of Scales well as long as the number of
data items to be processed is much jobs is much greater than the number of
larger than the number of processors processors and processing time is much
in the pipeline and the time taken to higher than the time to distribute data to
communicate task from one processors
processor to the next is negligible.
Data parallel processing is more tolerant but requires each teacher to be capable of correcting
answers to all questions with equal ease.
Here all teachers work independently and simultaneously. Many teachers spend lot of time in
waiting for other teachers to complete their work.
Procedure 2.3 Coarse grained specialist temporal processing
Answer papers are divided into 4 equal piles and put in the in-trays of each teacher. Each teacher
repeats 4 times simultaneously steps 1 to 5.
For teachers Ti (I = 1 to 4) do in parallel
38
Parallel Processing
Specialist General
Agenda
Comparative study of data parallel processing and temporal parallel parallelism has been made
and other data processing with specialized processors has been carried out.
In method 6, a head examiner dispatches the answer sheet to teachers.
In method 7, all teachers work independently and simultaneously. Many teachers spend lot of
time in waiting for other teachers to complete their work.
In method 8, answer book is thought as an agenda of questions to be graded.
Besides the various issues, there are also constraints placed by the structure and interconnection
of computers in a parallel computing system. Thus picking a suitable method is also governed by
the architecture of the parallel computer using which a problem is solved.
7.6 References
Parallel Computer Architecture and programming – V. Rajaraman and C.Siva Ram Murthy
41
Contents:
The main aim of this lesson is to bring important types of problems encountered in
parallel computers in which parallelism can be exploited.
8.1 Introduction
A problem of recipe of Chinese vegetable fried rice preparation is taken, and the task graph is
drawn for each tasks. The tasks may be independent of one another or some times may be
dependent of the previous task. The task graph is drawn shown and the number of cooks and
their time allotment table is also drawn.
In general tasks of a job are inter-related. Some tasks can be done simultaneously and
independently while others have to wait for the completion of previous tasks. The inter-relation
of various tasks of a job may be represented graphically as a Task Graph.
The Circles represents Tasks.
A line with arrow connecting 2 circles shows dependency of tasks.
The direction of arrow shows precedence.
A task at the head end of an arrow can be done after all tasks at their respective tails are
done.
T8 Drop carrots and French beans separately in boiling water and keep for 1 minute.
T9 Drain and cool carrots and French beans.
T10 Dice carrots.
T11 Dice French beans.
T12 Peel onion and dice into small pieces. Wash and chop spring onions
T13 Clean cauliflower. Cut into small pieces
T14 Heat oil in iron pan and fry diced onion and cauliflower for 1 minute in heated oil.
T15 Add diced carrots, French beans to above and fry for 2 minutes
T16 Add cooled rice, chopped spring onions and Soya sauce to the above and stir and
fry for 5 minutes
There are 16 tasks in cooking Chinese vegetable fried rice. Some of these tasks can be carried
out simultaneously and others have to be done in sequence. For instance T1, T2, T5,T6, T7, T12
and T13 can be done simultaneously whereas T8 cannot be done unless T5, T6,T7 are done.
If suppose this dish has to be prepared for 50 people and 4 cooks are ready to do it. Task
assignments for the cooks must be such that they work simultaneously and synchronize.
Task T1 T2 T3 T4 T5 T6 T7 T8
Time 5 10 15 10 5 10 8 1
Task T9 T10 T11 T12 T13 T14 T15 T16
Time 10 10 10 15 12 4 2 5
Table 8. 1 Time for each task
43
T1 T2 T5 T6 T7 T12 T13
T14
T3 T8
T4 T9
T10 T11
T15
T16
Cook 1 T1 T2 T3 T4 T16
5 10 15 10 5
Cook 2 T5 T6 T8 T9 T10 T15 Idle
5 10 1 10 10 2
Cook 3 T7 T13 T14 Idle
8 12 4
Cook 4 T12 Idle T11 Idle
15 11 10
Step 1 : Find tasks which can be carried out in parallel in level 1. Sum their total tile. In the
above case the sum of tasks 1,2,5,6,7,12,13 = 65 minutes. There are 4 cooks so the tasks
assigned for each cook will be 65 / 4 = 16 minutes
The assignment based on the logic is
Cook1 (T1,T2) (15 minutes)
Cook 2 (T5,T6) (15 minutes)
Cook 3 (T7, T13) ( 20 minutes)
Cook 4 (T12) (15 minutes)
Step 2: Find tasks that can be carried out in level 2. They are T3, T8, and T14. There are
4 cooks. T14 cannot start unless T12 and T13 are completed. Thus cook1 is allocated T3;
Cook2 T8; Cook 3 T14; Cook 4 No task. AT the end of step 2 Cook 1 has
worked for 30 minutes.
Step 3 : The tasks which can be carried out in parallel are T4 and T9. T4 has to follow T3
and T9 has to follow T8. Cook1 T4 and cook2 T9
Step 4: The next allocated tasks are T10 and T11 each taking 10 minutes. They follow
completion of T9. Assignments are Cook 2 T10 and Cook4 T11. Cook 4 cannot
start T11 till cook 2 completes T9.
Step 5 : In the next level T15 can be allocated and as it has to follow completion of T10
and T11. This can be allocated to T9.
Step 6 : Only T16 is left and it cannot start till T4, T15 and T14 are complete.T4 is last to
finish and T16 is allocated to Cook 2
The problem is an example of parallel computing. These bring important problems encountered
during parallelism.
The lesson indicated the problem encountered during parallelism and the graph depicted the
dependent tasks and independent tasks. The cook has been assigned the task in task graph and it
represented the time requirement for each cook to carry out the task successfully.
8.6 References
Parallel Computer Architecture and programming – V. Rajaraman and C.Siva Ram Murthy
45
Contents:
The main aim of this lesson is to execute a number of instructions in parallel by scheduling them
suitably on a single processor.
9.1 Introduction
One of the important methods of increasing the speed of PEs is pipelined execution of
instructions. An architecture SMAC2P is used to explain the concept of instruction cycle which
is broken into 5 steps. They are Fetch, Decode, Execute, Memory Access and Store register.
One of the important methods of increasing the speed of PEs is pipelined execution of
instructions.
Pipelining is an effective method of increasing the execution speed of processors
provided the following “ideal” conditions are satisfied:
1. It is possible to break an instruction into a number of independent parts, each part taking
nearly equal time to execute.
2. There is so called locality in instruction execution. If the instructions are not executed in
sequence but “jump around” due to many branch instructions, then pipelining is not
effective.
3. Successive instruction is such that the work done during the execution of an instruction
can be effectively used by the next and successive instruction. Successive instructions are
also independent of one another.
4. Sufficient resources are available in a processor so that if a resource is required by
successive instructions in the pipeline it is readily available.
In actual practice these “ideal” conditions are not always satisfied. The non-ideal situation arises
because of the following reasons:
46
1. It is not possible to break up an instruction execution into a number of parts each taking
exactly the same time. For ex., executing a floating point division will normally take
much longer than say, decoding an instruction. It is, however, possible to introduce
delays (if needed) so that different parts of an instruction take equal time.
2. All real programs have branch instructions which disturb the sequential execution of
instructions. Fortunately statistical studies show that the probability of encountering a
branch instruction is low; around 17%. Further it may be possible to predict when
branches will be taken a little ahead of time and take some pre-emptive action.
3. Successive instructions are not always independent. The results produced by an
instruction may be required by the next instruction and must be made available at the
right time.
4. There are always resource constraints in a processor as the chip size is limited. It will not
be cost effective to duplicate some resources indiscriminately. For ex., it may not be
useful have more than two floating point arithmetic units.
The challenges is thus to make pipelining work under these non-ideal conditions.
The computer is a Reduced Instruction Set Computer (RISC) which is designed to facilitate
pipelined instruction execution. The computer is similar to SMAC2. It has the following units.
Read
PC IMAR Instruction IR
Memory
A data cache(or memory) and an instruction cache (or memory). It is assumed that the
instructions to be executed are stored in the instruction memory and the data to be read or
written are stored in the data memory. These two memories have their own address and data
registers. (IMAR- Memory Address Register of instruction memory, DMAR – Memory
Address Register of data memory, IR - Data Register of Instruction memory, MDR - Data
Register of Data memory )
47
A Program Counter (PC) which contains the address of the next instruction to be executed.
The machine is word addressable. Each word is 32 bits long.
A register file with 32 registers. These are general purpose registers used to store operands
and index values.
The instructions are all of equal length and the relative positions of operands are fixed. There
are 3 instruction types shown in the fig
The only instructions which access memory are load and store instructions. A load
instruction reads a word from the data memory and stores it in a specified in the register file
and a store instruction stores the contents of a register in the data memory. This is called
Load-Store Architecture.
An ALU which carries out one integer arithmetic or one logic operation in one clock cycle.
Solid lines indicate Data paths and dashed lines indicate control signal.
IMAR PC
IP IMEM[IMAR]
NAR PC + 1
All the above operations are carried out in one cycle.
The instructions are decoded and the operands fields are used to read values of specified
registers. The operations carried out during this step are shown below:
B1 Reg [ IR21...25]
B2 Reg [ IR16…20]
IMM IR0…15(Mtype instruction)
All these operations sre carried out in one clock cycle. Observe that the outputs from the register
file are stored in two buffer registers B1 and B2.
B1
IR
Register File
B2
IMM
The ALU operations are carried out. The instructions where ALU is used may be classified as
shown below:
Two registers are operands
B3 B1 <operation> B2
48
Where operation is ADD, SUB, MUL or DIV and B3 is the register where ALU
output is stored
Increment/Decrement (INC/DEC/BCT)
B3 B1 + 1 (Increment)
B3 B1 – 1 (Decrement or BCT)
Branch on equal (JEQ)
B3 B1 – B2
If B3 = 0 set zero flag in Status Register = 1
Effective address calculation( for LD/ST)
For load and store instructions the effective address is stored in B3
B3 B2 + IMM
Control
M
B3 U
X
ALU B3
M
B3
U
X
1
Control
B3
For branch instructions the address of the next instruction to be executed must be found.
The four branch instructions in our computer are JMP, JMI, JEQ and BCT
49
B1
IR (0..25)
M
IMM
U
NAR X
Condition
PC
Figure 9.4 Data flow for load / store, next address step
Step 5: Store register (SR)
The result of ALU operation or load immediate or load from memory instruction is stored back
in the specified register in the register file. The 3 cases are:
Conditional branch instruction JEQ,JMI and BCT require 3 cycles. Instructions INC,
DEC and LDI do not need a memory access cycle and can complete in 3 cycles.
The proportion of these instructions in actual programs has to be found statistically by
executing a large number of programs and counting the number of such instructions.
Control
B3
M
MDR U
X
IMM
Register File
9.6 References
Parallel Computer Architecture and programming – V. Rajaraman and C.Siva Ram Murthy
51
Contents:
The main objective of this lesson is to study about various pipeline Hazards and the preventive
measures.
10.1 Introduction
Limits to pipelining: Hazards prevent next instruction from executing during its designated clock
cycle
– Structural hazards: HW cannot support this combination of instructions (single person to fold
and put clothes away)
– Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing
sock)
– Control hazards: Pipelining of branches & other instructions that change the PC
– Common solution is to stall the pipeline until the hazard is resolved, inserting one or more
“bubbles” in the pipeline
Delays in pipeline execution of instruction due to non ideal conditions are called Pipeline
hazards.
The non –ideal conditions are :
Available resources in a processor are limited.
Successive instructions are not independent of one another.
All programs have branches and loops
Each of the above conditions causes delays in pipeline. It can be classified as
Structural Hazards. They arise from resource conflicts when the hardware cannot support all
possible combinations of instructions in simultaneous overlapped execution.
52
Data Hazards. They arise when an instruction depends on the result of a previous instruction in
a way that is exposed by the overlapping of instructions in the pipeline.
Control Hazards. They arise from the pipelining of branches and other instructions that change
the PC.
Some functional unit is not fully pipelined. Then a sequence of instructions using that
unpipelined unit cannot proceed at the rate of one per clock cycle.
Some resource has not been duplicated enough to allow all combinations of instructions
in the pipeline to execute.
Example1:
A machine may have only one register-file write port, but in some cases the pipeline might want
to perform two writes in a clock cycle.
Example2:
A machine has shared a single-memory pipeline for data and instructions. As a result, when an
instruction contains a data-memory reference(load), it will conflict with the instruction reference
for a later instruction (instr 3):
To resolve this, we stall the pipeline for one clock cycle when a data-memory access occurs. The
effect of the stall is actually to occupy the resources for that instruction slot. The following table
shows how the stalls are actually implemented.
Instr 1 2 3 4 5 6 7 8 9
Load IF ID EX MEM WB
Instr 1 IF ID EX MEM WB
Instr 2 IF ID EX MEM WB
Stall bubble bubble bubble bubble bubble
Instr 3 IF ID EX MEM WB
A major effect of pipelining is to change the relative timing of instructions by overlapping their
execution. This introduces data and control hazards. Data hazards occur when the pipeline
changes the order of read/write accesses to operands so that the order differs from the order seen
by sequentially executing instructions on the unpipelined machine.
1 2 3 4 5 6 7 8 9
ADD R1, R2, R3 IF ID EX MEM WB
SUB R4, R5, R1 IF IDsub EX MEM WB
AND R6, R1, R7 IF IDand EX MEM WB
OR R8, R1, R9 IF IDor EX MEM WB
XOR R10,R1,R11 IF Idxor EX MEM WB
All the instructions after the ADD use the result of the ADD instruction (in R1). The ADD
instruction writes the value of R1 in the WB stage (shown black), and the SUB instruction reads
the value during ID stage (IDsub). This problem is called a data hazard. Unless precautions are
taken to prevent it, the SUB instruction will read the wrong value and try to use it.
54
The AND instruction is also affected by this data hazard. The write of R1 does not complete until
the end of cycle 5 (shown black). Thus, the AND instruction that reads the registers during cycle
4 (IDand) will receive the wrong result.
The XOR instruction operates properly, because its register read occur in cycle 6 after the
register write by ADD.
The next page discusses forwarding, a technique to eliminate the stalls for the hazard involving
the SUB and AND instructions.
We will also classify the data hazards and consider the cases when stalls can not be eliminated.
We will see what compiler can do to schedule the pipeline to avoid stalls.
The problem with data hazards, introduced by this sequence of instructions can be solved with a
simple hardware technique called forwarding.
1 2 3 4 5 6 7
ADD R1, R2, R3 IF ID EX MEM WB
SUB R4, R5, R1 IF IDsub EX MEM WB
AND R6, R1, R7 IF IDand EX MEM WB
The key insight in forwarding is that the result is not really needed by SUB until after the ADD
actually produces it. The only problem is to make it available for SUB when it needs it.
If the result can be moved from where the ADD produces it (EX/MEM register), to where the
SUB needs it (ALU input latch), then the need for a stall can be avoided.
Using this observation , forwarding works as follows:
The ALU result from the EX/MEM register is always fed back to the ALU input latches. If the
forwarding hardware detects that the previous ALU operation has written the register
corresponding to the source for the current ALU operation, control logic selects the forwarded
result as the ALU input rather than the value read from the register file.
Forwarding of results to the ALU requires the additional of three extra inputs on each ALU
multiplexer and the addition of three paths to the new inputs.
1 2 3 4 5 6 7 8 9
ADD R1,R2, R3 IF ID EX MEM WB
SUB R4,R5, R1 IF stall stall IDsub EX MEM WB
AND R6,R1, R7 stall stall IF IDand EX MEM WB
As our example shows, we need to forward results not only from the immediately previous
instruction, but possibly from an instruction that started three cycles earlier. Forwarding can be
arranged from MEM/WB latch to ALU input also. Using those forwarding paths the code
sequence can be executed without stalls:
1 2 3 4 5 6 7
ADD R1, R2, R3 IF ID EXadd MEMadd WB
SUB R4, R5, R1 IF ID EXsub MEM WB
AND R6, R1, R7 IF ID EXand MEM WB
Forwarding can be generalized to include passing the result directly to the functional unit that
requires it: a result is forwarded from the output of one unit to the input of another, rather than
just from the result of a unit to the input of the same unit.
Control hazards can cause a greater performance loss for DLX pipeline than data hazards.
When a branch is executed, it may or may not change the PC (program counter) to something
other than its current value plus 4. If a branch changes the PC to its target address, it is a taken
branch; if it falls through, it is not taken.
If instruction i is a taken branch, then the PC is normally not changed until the end of MEM
stage, after the completion of the address calculation and comparison (see diagram).
The simplest method of dealing with branches is to stall the pipeline as soon as the branch is
detected until we reach the MEM stage, which determines the new PC. The pipeline behavior
looks like :
Branch IF ID EX MEM WB
56
There are many methods to deal with the pipeline stalls caused by branch delay. We discuss four
simple compile-time schemes in which predictions are static - they are fixed for each branch
during the entire execution, and the predictions are compile-time guesses.
Stall pipeline
Predict taken
Predict not taken
Delayed branch
Stall pipeline
The simplest scheme to handle branches is to freeze or flush the pipeline, holding or deleting any
instructions after the branch until the branch destination is known.
Advantage: simple both to software and hardware (solution described earlier)
A higher performance, and only slightly more complex, scheme is to predict the branch as not
taken, simply allowing the hardware to continue as if the branch were not executed. Care must be
taken not to change the machine state until the branch outcome is definitely known.
The complexity arises from:
we have to know when the state might be changed by an instruction;
we have to know how to "back out" a change.
The pipeline with this scheme implemented behaves as shown below:
Instr
Instr i+1 IF ID EX MEM WB
Instr i+2 IF ID EX MEM WB
Taken Branch
IF ID EX MEM WB
Instr
Instr i+1 IF idle idle idle idle
Branch target IF ID EX MEM WB
Branch target+1 IF ID EX MEM WB
When branch is not taken, determined during ID, we have fetched the fall-through and just
continue. If the branch is taken during ID, we restart the fetch at the branch target. This causes
all instructions following the branch to stall one clock cycle.
Predict Taken
An alternative scheme is to predict the branch as taken. As soon as the branch is decoded and the
target address is computed, we assume the branch to be taken and begin fetching and executing
at the target address.
Because in DLX pipeline the target address is not known any earlier than the branch outcome,
there is no advantage in this approach. In some machines where the target address is known
before the branch outcome a predict-taken scheme might make sense.
Delayed Branch
The another difficulty which makes pipeline processing challenging is due to the interruption of
normal flow of a program due to events such as illegal instruction codes, page faults and I/O
calls. These are called as Exception conditions.
Various list of exception conditions has been listed out. It has been mentioned that whether after
the exception the normal process can be started or not.
The problem of restarting computation is complicated by the fact that several instructions will be
in various stages of completion in the pipeline. If the pipeline processing can be stopped when an
exception condition is detected in such a way that all instructions which occur before the one
causing the exception are completed and all instructions which were in progress at the instant
exception occurred can be restarted from the beginning, the pipeline is said to have precise
exceptions.
It discussed the various delays in pipeline execution and the different hazards are structural
hazards, data hazards and control hazards and the prevention mechanism for the hazards.
1.What delays are encountered in pipeline execution? How can they be overcome?
2. Discuss the difficulties involved in pipelining. Also, explain the prevention measures.
Structural Hazard
Data Hazard
Control Hazard
Branch prediction Buffer
10.6 References
Materials from Net : Shankar Balachandran, Dept. of Computer Science and Engineering,IIT-
Madras,[email protected]
60
UNIT – III
Contents:
The main objective of this lesson is to known the basic properties of pipelining,
classification of pipeline processors and the required memory support.
11.1 Introduction
Pipeline is similar to the assembly line in industrial plant. To achieve pipelining one must
divide the input process into a sequence of sub tasks and each of which can be executed
concurrently with other stages. The various classification or pipeline line processor are
arithmetic pipelining, instruction pipelining, processor pipelining have also been briefly
discussed.
11.2 Pipelining
Types of Pipelines
Linear Pipelines
Non-linear Pipelines
Single Function Pipelines
Multifunctional Pipelines
» Static
» Dynamic
Let T be a task which can be partitioned into K subtasks according to the linear precedence
relation:
T= {T1,T2,..........,Tk} ; i.e.,
a subtask Tj cannot start until { Ti " i < j } are finished. This can be modelled with the linear
precedence graph:
A linear pipeline (No Feedback!) can always be constructed to process a succession of subtask
with a linear precedence graph.
Space-Time Diagrams
Consider a four stage linear pipeline processor and a sequence of tasks.
T1, T2,...
Where each task has 4 subtasks (1st subscript if for task, 2nd is for subtask) as follows:
T1=>{T11,T12,T13,T14}
T2=>{T21,T22,T23,T24}
…
T2=>{Tn1,Tn2,Tn3,Tn4}
A space-time diagram can be constructed to illustrate how the overlapping execution of the tasks
as follows
Speedup Sk- the speedup of a k-stage linear pipeline processor(over an equivalent non-pipelined)
is given by Sk=(T)/(Tk)=
With a non-pipelined processor, each task takes k clocks, thus for n tasks T1=n. k clocks
With a pipelined processor we need k clocks to fill the pipe and generate the first result
(n-1) clocks to generate the remaining n-1 results
Efficiency “E” - the ration of the busy time span over the overall time span (note : E is easy to
see from spacetime)
63
Note that :
Overall time span <=> Allocated computational power
Busy time span <=> Utilized computational power and
Thus E <=> Pipeline Utilization
and
To illustrate the operation principles of a pipeline computation, the design of a pipeline floating
point adder is given. It is constructed in four stages. The inputs are
A = a x 2p
B = b x 2q
Where a and b are 2 fractions and p and q are their exponents and here base 2 is assumed.
To compute the sum
C = A+ B = c x 2r = d x 2s
Operations performed in the four pipeline stages are specified.
1. Compare the 2 exponents p and q to reveal the larger exponent r =max(p,q) and to
determine their difference t =p-q
2. Shift right the fraction associated with the smaller exponent by t bits to equalize the
two components before fraction addition.
3. Add the preshifted fraction with the other fraction to produce the intermediate sum
fraction c where 0 <= c <1.
64
4. Count the number of leading zeroes, say u, in fraction c and shift left c by u bits to
produce the normalized fraction sum d = c x 2u, with a leading bit 1. Update the large
exponent s by subtracting s= r – u to produce the output exponent.
Arithmetic Pipelining
The arithmetic and logic units of a computer can be segmentized for pipeline operations in
various data formats. Well known arithmetic pipeline examples are
Star 100
The eight stage pipes used in TI-ASC
65
Instruction Pipelining
The execution of a stream of instructions can be pipelined by overlapping the execution of the
current instruction with fetch, decode and operand fetch of subsequent instructions. This
technique is known as instruction look-ahead.
Processor pipelining
This refers to pipelining processing of the same data stream by a cascade of processors each of
which processes a specific task. The data stream passes the first processor with results stored in
memory block which is also accessible by the second processor. The second processor then
passes the refined results to the third and so on.
A pipeline with fixed and dedicated function such as floating adder is called unifuncitonal
pipeline. Eg : Cray-1
A multifunction pipe may perform different functions, either at different times or at the same
time, by interconnecting different subsets of stages in the pipeline.
Eg : TI-ASC
A scalar pipeline processes a sequence of scalar operands under the control of DO loop.
A vector pipeline is designed to handle vector instructions over vector operands.
The basics of pipelining has been discussed such as structure of a linear pipeline processor, space
time diagram of a linear pipeline processor for over lapped processing of multiple tasks. Four
pipeline stages have been explained with a pipelined floating point adder. Various classification
schemes for pipeline processors have been explained.
11.6 References
Lesson 12 : General Pipeline and Reservation Tables, Arithmetic Pipeline Design Examples
Contents:
The main objective of this lesson is to learn about reservation tables and how successive pipeline
stages are utilized for a specific evaluation function.
12.1 Introduction
The interconnection structures and data flow patterns in general pipelines are
characterized either feedforward or feedbackward connections, in addition to the cascaded
connections in a linear pipeline. A 2D chart known as reservation table shows how the
successive pipelines stages are utilized for a specific function evaluation in successive pipeline
cycles. Multiplication of 2 numbers is done by repeated addition and shift operations.
Reservation tables are used how successive pipeline stages are utilized for a specific
evaluation function.
Consider an example of pipeline structure with both feed forward and feedback
connections. The pipeline is dual functional denoted as function A and function B. The pipeline
stages are numbered as S1, S2 and S3. The feed forward connection connects a stage Si to a stage
Sj such that j ≥ i + 2 and feedback connection connects to Si to a stage Sj such that j <= i.
Time t0 t1 t2 t3 t4 t5 t6 t7
S1 A A A
S2 A A
S3 A A A
Time t0 t1 t2 t3 t4 t5 t6
S1 B B
S2 B B
S3 B B B
Input
S1
S2
Feed Back
Connections
Output (A)
S3
Output (B)
The row corresponds to the 2 functions of the sample pipeline. The rows correspond to pipeline
stages and the columns to clock time units. The total number of clock units in the table is called
the evaluation time. A reservation table represents the flow of data through the pipeline for one
complete evaluation of a given function. A market entry in the (i,j)th square of the table
indicates the stage Si will be used j time units after initiation of the function evaluation.
The multiplication of 2 fixed point numbers is done by repeated add-shift operations, using ALU
which has built in add and shift functions. Multiple number additions can be realized with a
multilevel tree adder. The conventional carry propagation adder (CPA) adds 2 input numbers say
A and B, to produce one output number called the sum A+B carry save adder (CSA) receives
three input numbers, say A,B and D and two output numbers, the sum S and the Carry vector C.
A = 1 1 1 1 0 1
B = 0 1 0 1 1 0
D = 1 1 0 1 1 1
C = 1 1 0 1 1 1
S = 0 1 1 1 0 0
A+B+D = 1 1 1 0 0 1 0
A CSA can be implemented with a cascade of full adders with the carry-out of a lower
stage connected to the carry-in of a higher stage. A carry-save adder can be implemented with a
set of full adders with all the carry-in terminals serving as the input lines for the third input
number D, and all the carry-out terminals serving as the output lines for the carry vector C.
This pipeline is designed to multiply two 6 bit numbers. There are five pipeline stages.
The first stage is for the generation of all 6 x 6 = 36 immediate product terms, which forms the
six rows of shifted multiplicands. The six numbers are then fed into two CSAs in the second
stage. In total four CSAs are interconnected to form a three level merges six numbers into two
numbers: the sum vector S and the carry vector C. The final stage us a CPA which adds the two
numbers C and S to produce the final output of the product A x B.
A B 70
S1
Shifted Multiplicand Generator
W6 W5 W4 W3 W2 W1
S2
CSA CSA
C S C S
S3
CSA
C S
S4
CSA
C S
S5 CPA
P=AxB
71
Many interesting pipeline utilization can be revealed by the reservation table. It is possible to
have multiple marks in a row or in a column. A CSA (carry save adder) is used to perform
multiple number additions.
1. Give the reservation tables and sample pipeline for any two functions.
2. With example, discuss carry propagation adder (CPA) and carry save adder (CSA).
The conventional carry propagation adder (CPA) adds 2 input numbers and produces an
output number called as Sum.
A carry save adder (CSA) receives three input numbers A, B, D and outputs of 2 numbers
the sum vector and the carry vector.
12.6 References
Lesson 13 : Data Buffering and Busing Structure, Internal Forwarding and Register
Tagging, Hazard Detection and Resolution, Job Sequencing and Collision Prevention.
Contents:
The objective of this lesson is to be familiar with busing structure, register tagging and
various pipeline hazards and its preventive measures and job sequencing and Collision
prevention.
13.1 Introduction
Buffers are used to speed close up the speed gap between memory accesse=s for either
instructions or operands. Buffering can avoid unnecessary idling of the processing stages caused
by memory access conflicts or by unexpected branching or interrupts. The concepts of busing is
discussed which eliminates the time delay to store and to retrieve intermediate results or to from
the registers.
The computer performance can be greatly enhanced if one can eliminate unnecessary
memory accesses and combine transitive or multiple fetch-store operations with faster register
operations. This is carried by register tagging and forwarding. A pipeline hazard refers to a
situation in which a correct program ceases to work correctly due to implementing the processor
with a pipeline.
There are three fundamental types of hazard:
Data hazards,
Branch hazards, and
Structural hazards.
Another method to smooth the traffic flow in a pipeline is to use buffers to close up the
speed gap between the memory accesses for either instructions or operands and arithmetic and
logic executions in the functional pipes. The instruction or operand buffers provide a continuous
supply of instructions or operands to the appropriate pipeline units. Buffering can avoid
73
unnecessary idling of the processing stages caused by memory access conflicts or by unexpected
branching or interrupts. Sometimes the entire loop instructions can be stored in the buffer to
avoid repeated fetch of the same instructions loop, if the buffer size is sufficiently large. It is
very large in the usage of pipeline computers.
Three buffer types are used in various instructions and data types. Instructions are fetched to the
instruction fetch buffer before sending them to the instruction unit. After decoding, fixed point
and floating point instructions and data are sent to their dedicated buffers. The store address and
data buffers are used for continuously storing results back to memory. The storage conflict buffer
is used only used when memory
Figure 13.1 Data Buffers, transfer paths, reservation stations and common data bus
Busing Buffers
The sub function being executed by one stage should be independent of the other sub
functions being executed by the remaining stages; otherwise some process in the pipeline must
be halted until the dependency is removed. When one instruction waiting to be executed is first
to be modified by a future instruction, the execution of this instruction must be suspended until
the dependency is released.
Another example is the conflicting use of some registers or memory locations by
different segments of a pipeline. These problems cause additional time delays. An efficient
internal busing structure is desired to route the resulting stations with minimum time delays.
In the AP 120B or FPS 164 attached processor the busing structure are even more
sophisticated. Seven data buses provide multiple data paths. The output of the floating point
adder in the AP 120B can be directly routed back to the input of the floating point adder, to the
input of the floating point multiplier, to the data pad, or to the data memory. Similar busing is
74
provided for the output of the floating point multiplier. This eliminates the time delay to store
and to retrieve intermediate results or to from the registers.
Store-Fetch Forwarding
The store the n fetch can be replaced by 2 parallel operations, one store and one register transfer.
2 memory accesses
Mi (R1) (store)
R2 (Mi) (Fetch)
Is being replaced by
Only one memory access
Mi (R1) (store)
R2 (R1) (register Transfer)
Fetch-Fetch Forwarding
The following fetch operations can be replaced by one fetch and one register transfer. One
memory access has been eliminated.
2 memory accesses
R1 (Mi) (fetch)
R2 (Mi) (Fetch)
Is being replaced by
Only one memory access
R1 (Mi) (Fetch)
R2 (R1) (register Transfer)
75
Mi Mi
R1 R2 R1 R2
Mi Mi
R1 R2 R1 R2
Mi
Mi
R1 R2
R2
Store-Store overwriting
Store-Store Overwriting
The following two memory updates of the same word can be combined into one; since the
second store overwrites the first.
2 memory accesses
Mi (R1) (store)
76
Mi (R2) (store)
Is being replaced by
Only one memory access
Mi (R2) (store)
The above steps shows how to apply internal forwarding to simplify a sequence of arithmetic and
memory access operations
Defining hazards
The next issue in pipelining is hazards. A pipeline hazard refers to a situation in which a correct
program ceases to work correctly due to implementing the processor with a pipeline.
There are three fundamental types of hazard:
Data hazards,
Branch hazards, and
Structural hazards.
Structural Hazards
These occur when a single piece of hardware is used in more than one stage of the pipeline, so
it's possible for two instructions to need it at the same time.
So, for instance, suppose we'd only used a single memory unit instead of separate instruction
memory and data memories. A simple (non-pipelined) implementation would work equally well
with either approach, but in a pipelined implementation we'd run into trouble any time we
wanted to fetch an instruction at the same time a lw or sw was reading or writing its data.
In effect, the pipeline design we're starting from has anticipated and resolved this hazard by
adding extra hardware.
Interestingly, the earlier editions of our text used a simple implementation with only a single
memory, and separated it into an instruction memory and a data memory when they introduced
pipelining. This edition starts right off with the two memories.
Also, the first Sparc implementations (remember, Sparc is almost exactly the RISC machine
defined by one of the authors) did have exactly this hazard, with the result that load instructions
took an extra cycle and store instructions took two extra cycles.
Data Hazards
77
This is when reads and writes of data occur in a different order in the pipeline than in the
program code. There are three different types of data hazard (named according to the order of
operations that must be maintained):
RAW
A Read After Write hazard occurs when, in the code as written, one instruction reads a location
after an earlier instruction writes new data to it, but in the pipeline the write occurs after the read
(so the instruction doing the read gets stale data).
WAR
A Write After Read hazard is the reverse of a RAW: in the code a write occurs after a read, but
the pipeline causes write to happen first.
WAW
A Write After Write hazard is a situation in which two writes occur out of order. We normally
only consider it a WAW hazard when there is no read in between; if there is, then we have a
RAW and/or WAR hazard to resolve, and by the time we've gotten that straightened out the
WAW has likely taken care of itself.
(the text defines data hazards, but doesn't mention the further subdivision into RAW, WAR, and
WAW. Their graduate level text mentions those)
Control Hazards
This is when a decision needs to be made, but the information needed to make the decision is not
available yet. A Control Hazard is actually the same thing as a RAW data hazard (see above), but
is considered separately because different techniques can be employed to resolve it - in effect,
we'll make it less important by trying to make good guesses as to what the decision is going to
be.
Two notes: First, there is no such thing as a RAR hazard, since it doesn't matter if reads occur
out of order. Second, in the MIPS pipeline, the only hazards possible are branch hazards and
RAW data hazards.
Resolving Hazards
There are four possible techniques for resolving a hazard. In order of preference, they are:
Forward. If the data is available somewhere, but is just not where we want it, we can create
extra data paths to ``forward'' the data to where it is needed. This is the best solution, since it
doesn't slow the machine down and doesn't change the semantics of the instruction set. All of the
hazards in the example above can be handled by forwarding.
Add hardware. This is most appropriate to structural hazards; if a piece of hardware has to be
used twice in an instruction, see if there is a way to duplicate the hardware. This is exactly what
the example MIPS pipeline does with the two memory units (if there were only one, as was the
case with RISC and early SPARC, instruction fetches would have to stall waiting for memory
reads and writes), and the use of an ALU and two dedicated adders.
78
Stall. We can simply make the later instruction wait until the hazard resolves itself. This is
undesirable because it slows down the machine, but may be necessary. Handling a hazard on
waiting for data from memory by stalling would be an example here. Notice that the hazard is
guaranteed to resolve itself eventually, since it wouldn't have existed if the machine hadn't been
pipelined. By the time the entire downstream pipe is empty the effect is the same as if the
machine hadn't been pipelined, so the hazard has to be gone by then.
Document (AKA punt). Define instruction sequences that are forbidden, or change the
semantics of instructions, to account for the hazard. Examples are delayed loads and delayed
branches. This is the worst solution, both because it results in obscure conditions on permissible
instruction sequences, and (more importantly) because it ties the instruction set to a particular
pipeline implementation. A later implementation is likely to have to use forwarding or stalls
anyway, while emulating the hazards that existed in the earlier implementation. Both Sparc and
MIPS have been bitten by this; one of the nice things about the late, lamented Alpha was the
effort they put into creating an exceptionally "clean" sequential semantics for the instruction set,
to avoid backward compatibility issues tying them to old implementations.
Problem:
Fundamental concepts:
Latency - number of time units between two initiations (any positive integer 1, 2,…)
Latency sequence – sequence of latencies between successive initiations
Latency cycle – a latency sequence that repeats itself
Control strategy – the procedure to choose a latency sequence
Greedy strategy – a control strategy that always minimizes the latency between the
current initiation and the very last initiation
Definitions:
1. A collision occurs when two tasks are initiated with latency (initiation interval) equal to the
column distance between two “X” on some row of the reservation table.
2. The set of column distances F ={l1,l2,…,lr} between all possible pairs of “X” on each row of
the reservation table is called the forbidden set of latencies.
3. The collision vector is a binary vector C = (Cn…C2 C1),
Where Ci=1 if i belongs to F (set of forbidden latencies) and Ci=0 otherwise.
79
Example: Let us consider a Reservation Table with the following set of forbidden
latencies F and permitted latencies P (complementation of F).
Facts:
1. The collision vector shows both permitted and forbidden latencies from the same
reservation table.
2. One can use n-bit shift register to hold the collision vector for implementing a control strategy
for successive task initiations in the pipeline. Upon initiation of the first task, the collision vector
is parallel-loaded into the shift register as the initial state. The shift register is then shifted right
one bit at a time, entering 0’s from the left end. A collision free initiation is allowed at time
instant t+k a bit 0 is being shifted at of the register after k shifts from time t.
A state diagram is used to characterize the successive initiations of tasks in the
pipeline in order to find the shortest latency sequence to optimize the control strategy. A state on
the diagram is represented by the contents of the shift register after the proper number of shifts is
made, which is equal to the latency between the current and next task initiations.
3. The successive collision vectors are used to prevent future task collisions with
previously initiated tasks, while the collision vector C is used to prevent possible
collisions with the current task. If a collision vector has a “1” in the ith bit (from the
right), at time t, then the task sequence should avoid the initiation of a task at time t+i.
4. Closed logs or cycles in the state diagram indicate the steady – state sustainable latency
sequence of task initiations without collisions.
The average latency of a cycle is the sum of its latencies (period) divided by the
number of states in the cycle.
5. The throughput of a pipeline is inversely proportional to the reciprocal of the average latency.
A latency sequence is called permissible if no collisions exist in the successive initiations
governed by the given latency sequence.
81
6. The maximum throughput is achieved by an optimal scheduling strategy that achieves the
(MAL) minimum average latency without collisions.
Corollaries:
1. The job-sequencing problem is equivalent to finding a permissible latency cycle with the
MAL in the state diagram.
2. The minimum number of X’s in array single row of the reservation table is a lower
bound of the MAL.
Simple cycles are those latency cycles in which each state appears only once per
each iteration of the cycle.
A single cycle is a greedy cycle if each latency contained in the cycle is the
minimal latency (outgoing arc) from a state in the cycle.
A good task-initiation sequence should include the greedy cycle.
Buffers helped in closing up the speed gap. It helps in avoiding idling of the processing
stages caused by memory access. Busing concepts eliminated the time delay. A pipeline hazard
refers to a situation in which a correct program ceases to work correctly due to implementing the
processor with a pipeline. Various pipeline hazards are Data hazards, Branch hazards, and
Structural hazards.
1. How buffering can be done using Data Buffering and Busing Structure? Explain.
2. Define Hazard. What are the types of hazards? How they can be detected and resolved?
3. Discuss i. Store-Fetch Forwarding ii. Fetch-Fetch Forwarding iii. Store-Store overwriting
4. Discuss Job Sequencing and Collision Prevention
13.6 References
The main aim this lesson is to learn the vector processing requirements, its characteristics and the
instructions used by vector and to perform a comparative study to know the difference between
scalar and vector operations.
14.1 Introduction
A vector processor consists of a scalar processor and a vector unit, which could be
thought of as an independent functional unit capable of efficient vector operations. Various
pipelined vector processing methods are Horizontal Processing, in which vector computations
are performed horizontally from left to right in row fashion.
Vertical processing, in which vector computations are carried out vertically from top to
bottom in column fashion. Vector looping, in which segmented vector loop computations are
performed from left to right and top to bottom in a combined horizontal and vertical method.
A vector operand contains an ordered set of n elements, where n is called the length of
the vector. Each element in a vector is a scalar quantity, which may be a floating point number,
an integer, a logical value or a character.
A vector processor consists of a scalar processor and a vector unit, which could be thought of as
an independent functional unit capable of efficient vector operations.
Vector Hardware
Vector computers have hardware to perform the vector operations efficiently. Operands can not
be used directly from memory but rather are loaded into registers and are put back in registers
after the operation. Vector hardware has the special ability to overlap or pipeline operand
processing.
84
Vector functional units pipelined, fully segmented each stage of the pipeline performs a step of
the function on different operand(s) once pipeline is full, a new result is produced each clock
period (cp).
Pipelining
The pipeline is divided up into individual segments, each of which is completely independent
and involves no hardware sharing. This means that the machine can be working on separate
operands at the same time. This ability enables it to produce one result per clock period as soon
as the pipeline is full. The same instruction is obeyed repeatedly using the pipeline technique so
the vector processor processes all the elements of a vector in exactly the same way. The pipeline
segments arithmetic operation such as floating point multiply into stages passing the output of
one stage to the next stage as input. The next pair of operands may enter the pipeline after the
first stage has processed the previous pair of operands. The processing of a number of operands
may be carried out simultaneously.
The loading of a vector register is itself a pipelined operation, with the ability to load one
element each clock period after some initial startup overhead.
Chaining
Theoretical speedup depends on the number of segments in the pipeline so there is a direct
relationship between the number of stages in the pipeline you can keep full and the performance
of the code. The size of the pipeline can be increased by chaining thus the Cray combines more
than one pipeline to increase its effective size. Chaining means that the result from a pipeline can
be used as an operand in a second pipeline as illustrated in the next diagram.
85
This example shows how two pipelines can be chained together to form an effectively single
pipeline containing more segments. The output from the first segment is fed directly into the
second set of segments thus giving a resultant effective pipeline length of 8. Speedup (over scalar
code) is dependent on the number of stages in the pipeline. Chaining increases the number of
stages
The ISA of a scalar processor is augmented with vector instructions of the following types:
Vector-vector instructions:
f1: Vi Vj (e.g. MOVE Va, Vb)
f2: Vj x Vk Vi (e.g. ADD Va, Vb, Vc)
Vector-scalar instructions:
f3: s x Vi Vj (e.g. ADD R1, Va, Vb)
86
Vector-memory instructions:
f4: M V (e.g. Vector Load)
f5: V M (e.g. Vector Store)
Masking instructions:
fa: Va x Vm Vb (e.g. MMOVE V1, V2, V3)
Gather and scatter are used to process sparse matrices/vectors. The gather operation, uses a base
address and a set of indices to access from memory "few" of the elements of a large vector into
one of the vector registers. The scatter operation does the opposite. The masking operations
allows conditional execution of an instruction based on a "masking" register.
A Boolean vector can be generated as a result of comparing two vectors, and can be used as a
masking vector for enabling and disabling component operations in a vector instruction.
A compress instruction will shorten a vector under the control of a masking of vector.
A merge instruction combines two vectors under the control of a masking vector.
In general machine operation suitable for pipelining should have the following properties:
Identical Processes (or functions) are repeatedly invoked many times, each of which
can be subdivided into subprocesses (or sub functions)
Successive Operands are fed through the pipeline segments and require as few buffers
and local controls as possible.
Operations executed by distinct pipelines should be able to share expensive resources,
such as memories and buses in the system.
The operation code must be specified in order to select the functional unit or to
reconfigure a multifunctional unit to perform the specified operation.
For a memory reference instruction, the base addresses are needed for both source
operands and result vectors. If the operands and results are located in the vector register
file, the designated vector registers must be specified.
The address increment between the elements must be specified.
The address offset relative to the base address should be specified. Using the base
address and the offset the relative effective address can be calculated.
The Vector length is needed to determine the termination of a vector instruction.
The Relative Vector/Scalar Performance and Amdahl Law
87
A scalar operation works on only one pair of operands from the S register and returns the
result to another S register whereas a vector operation can work on 64 pairs of operands together
to produce 64 results executing only one instruction. Computational efficiency is achieved by
processing each element of a vector identically eg initializing all the elements of a vector to zero.
A vector instruction provides iterative processing of successive vector register elements by
obtaining the operands from the first element of one or more V registers and delivering the result
to another V register. Successive operand pairs are transmitted to a functional unit in each clock
period so that the first result emerges after the start up time of the functional unit and successive
results appear each clock cycle.
Vector overhead is larger than scalar overhead, one reason being the vector length which has to
be computed to determine how many vector registers are going to be needed (ie the number of
elements divided by 64).
Each vector register can hold up to 64 words so vectors can only be processed in 64 element
segments. This is important when it comes to programming as one situation to be avoided is
where the number of elements to be processed exceeds the register capacity by a small amount
eg a vector length of 65. What happens in this case is that the first 64 elements are processed
from one register, the 65th element must then be processed using a separate register, after the
first 64 elements have been processed. The functional unit will process this element in a time
equal to the start up time instead of one clock cycle hence reducing the computational efficiency.
There is a sharp decrease in performance at each point where the vector length spills over into a
new register.
The Cray can receive a result by a vector register and retransmit it as an operand to a subsequent
operation in the same clock period. In other words a register may be both a result and an operand
register which allows the chaining of two or more vector operations together as seen earlier. In
this way two or more results may be produced per clock cycle.
Parallelism is also possible as the functional units can operate concurrently and two or more
units may be co-operating at once. This combined with chaining, using the result of one
functional unit as the input of another, leads to very high processing speeds.
Vector Processing
Load a series of elements from array KK to a vector register and a series of elements from array
LL to another vector register (these operations occur simultaneously except for instruction issue
time)
Add the corresponding elements from the two vector registers and send the results to another
vector register, representing array JJ
Store the register used for array JJ to memory
This sequence would be repeated if the array had more elements than the maximum elements
used in vector processing ie 64.
DO 10 I =1, 3
L(I) = J(I) + K(I)
N(I) = L(I) + M(I)
10 CONTINUE
Scalar Version
The two statements within this loop are each executed three times, with the operations
alternating;
L(I) is calculated before N(I) in each iteration
the new value of L(I) is used to calculate the value of N(I).
Vector Version
With vector processing the first line within the loop processes all elements of the array before the
second line is executed.
Scalar Vector
The loop repeats the loop control Using pipelines the overhead is reduced
overhead in each iteration
A vector length register is used to control
the vector operations
The pipeline vector computers can be divided into 2 architectural configurations according to
where the operands are received in a vector processor.
They are :
Memory -to- memory Architecture, in which source operands, intermediate and final results are
retrieved directly from the main memory.
Register-to-register architecture, in which operands and results are retrieved indirectly from the
main memory through the use of large number of vector or scalar registers.
Vector computations are often involved in processing large arrays of data. By ordering
successive computations in the array, the vector array processing can be classified into three
types :
Let { ai for 1 <= i<= n) ne n scalar contstants, Xj = (X1j,X2j…… Xmj)T for j = 1,2,3 ….n ne n
column vectors and Yj = (Y1j,Y2j…… Ym)T be a column vector of m components. The
computation to be performed is
Y = ai.x1 + a2.x2 + …. an.xn
The various vector instructions have been discussed and the comparative study between scalar
and vector helps in known differences existing between them. The pipeline vector computers are
also divided into 2 architectural configurations according to where the operands are received in a
vector processor has been discussed and the various pipelined vector processing methods has
been mentioned.
Pipelining
Difference between vector and scalar
Vector Processing methods
91
14.6 References
http://www.cs.berkeley.edu/~pattrsn/252F96/Lecture06.ps
http://www.pcc.qub.ac.uk/tec/courses/cray/stu-notes/CRAY-completeMIF_2.html
http://www-ugrad.cs.colorado.edu/~csci4576/VectorArch/VectorArch.html
92
UNIT – IV
Lesson 15 : SIMD Array Processors, Organization, Masking and Data Routing & Inter PE
communications
The aim of this lesson is to know about various organization and control mechanisms of
array processors.
15.1 Introduction
Depending on whether there is one or several of these streams we have 4 classes of computers.
• Single Instruction Stream, Single Data Stream: SISD
• Multiple Instruction Stream, Single Data Stream: MISD
• Single Instruction Stream, Multiple Data Stream: SIMD
• Multiple Instruction Stream, Multiple Data Stream: MIMD
All N identical processors operate under the control of a single instruction stream issued by a
central control unit.
There are N data streams; one per processor, so different data can be used in each processor.
• The processors operate synchronously and a global clock is used to ensure lockstep
operation, i.e., at each step (global clock tick) all processors execute the same instruction,
each on a different datum.
• Array processors such as the ICL DAP, Connection Machine CM-200, and MasPar are
SIMD computers.
• SIMD machines are particularly useful at exploiting data parallelism to solve problems
having a regular structure in which the same instructions are applied to subsets of data.
The same instruction is issued to all 4 processors (add two numbers), and all processors execute
the instructions simultaneously. It takes one step to add the matrices, compared with 4 steps on a
SISD machine.
• In this example the instruction is simple, but in general it could be more complex such as
merging two lists of numbers.
• The data may be simple (one number) or complex (several numbers).
94
Each processor PE has its own memory PEMi. It has a set of working registers and flags
names Ai, Bi ,Ci. It contains an Arithmetic and Logic unit Si and a local index register Ii, an
address register Di and a data routing register Ri. The Ri of each PEi is connected to the Rj of
other PEs via the interconnection network. When data transfer among PEs occurs, it is the
content of Ri registers that are being transferred.
Some array processor may use 2 routing register, one for input and the other for output. Each PEi
is either active or in the inactive mode during each instruction cycle. If a PEi is active, it executes
the instruction broadcast to it by the CU. If a PEi is inactive, it will not execute the broadcast
instruction.
The masking schemes are used to specify the status flag Si of PEi. The conventions Si = 1 is
chosen for an active PEi and Si = 0 for an inactive PEi. In the CU, there is a global index register
I and a Masking register M. The M register has N bits.
The physical length of a vector is determined by the number of PEs. The CU performs the
segmentation of a long vector into vector loops, the setting of a global address, and the offset
increment.
To CU
Ai Bi Ci
.
. To other PEs via
Di Ii Ri . the interconnection
PEi
network
Si
ALU
PEMi
Operation Mode
Control Strategy
Switching Technology
Network Topology
A network can be depicted by a graph which nodes represent switching points and edges
represent communication links. The topologies can be of 2 types
Static
Dynamic
In static link between 2 processors are passive and dedicated buses cannot be reconfigured for
direct connection to other processors.
In dynamic link, configuration can be made by setting the network's active switching elements.
97
The SIMD computer organization has been explained. The various data masking and
routing concepts has been explained. The concept of master and slave process has been
discussed. These are fundamental decisions in determining the appropriate structure of an
interconnection network for an SIMD machine. The decisions are made between Operation
Modes, Control Strategies, Switching methodologies, Network Topologies.
In circuit switching a physical path is actually established between source and destination
before transmission of data.
This is suitable for bulk data transmission.
In packet switching, data is put in a packet and routed through interconnection network
without establishing a physical connection path.
Synchronous communication is needed for establishing communication paths synchronously
for either a data manipulating function or for a data instruction broadcast.
Asynchronous communication is needed for multiprocessing in which connection requests
are issued dynamically.
A system may also be designed to facilitate both synchronous and asynchronous operations
Contents:
16.0 Aims and Objectives
16.1 Introduction
16.2 SIMD Interconnection Networks
16.2.1 Static Versus Dynamic Networks
16.2.2 Mesh-Connected Illiac Networks
16.2.3 Cube Interconnection Networks
16.2.4 Shuffle-Exchange Omega Networks
16.3 Let us Sum Up
16.4 Lesson-end Activities
16.5 Points for Discussions
16.6 References
The main aim of this lesson is to learn Single stage, recirculating networks and Multistage
Networks. Important network classes has been included such as Illiac Network, Cube Network,
Omega Network.
16.1 Introduction
The SIMD networks classified into 2 categories based on topologies called as Static
Networks and Dynamic Networks. The diagrammatical representation of static interconnection
networks is shown. Dynamic networks has been further classified as Single stage versus
Multistage. Examples of Single stage network is implemented in ILLIAC and examples of
Multistage network is Omega network
The topological structure of an SIMD array processor is mainly characterized by the data routing
network used in interconnecting the processing elements.
Interconnection Networks
• Parallel computers with many processors do not use shared memory hardware.
• Instead each processor has its own local memory and data communication takes place via
message passing over an interconnection network.
99
In this case each processor has its own private (local) memory and there is no global,
shared memory. The processors need to be connected in some way to allow them to
communicate data.
Processor
+ memory
Processor
+ memory
Processor
+ memory
Interconnection
Network
Processor
Processor + memory
+ memory
Bus
Interconnection
network
The SIMD interconnection networks are classified into the following 2 categories based on
network topologies
Static Networks
Dynamic Networks
Static Networks
Topologies in the static networks can be classified according to the dimension required for
layout.
One dimensional topologies include
Linear array
Two dimensional topologies include
The ring
Star
Tree
Mesh
Systolic Array
Thee dimensional topologies include
Completely connected chordal ring
3 cube
3 cube connected cycle
Dynamic Networks
A single stage switching network with N input selectors (IS) and N output selectors (OS).
Each is essentially a 1- to-D demultiplexer and each OS is an M-to-1 multiplexer. Cross bar
network is a single stage network.
The single stage network is also called as recirculating network. Data items may have to
recirculate through the single stage several times before reaching their final destinations. The
number of recirculation depends on the connectivity in the single stage network.
Multistage Networks
Multi-stage networks are based on the so called shuffle-exchange switching element,
which is basically a 2 x 2 crossbar. Multiple layers of these elements are connected and
form the network.
Many stages of interconnected switches form a multistage SIMD networks.
Three characterizing feature describe multistage networks
The Switch Box
Network Topology
101
Control Structure
Each box is essentially an interchange device with 2 inputs and 2 outputs. There are 4 states in a
switch box. They are
Straight
Exchange
Upper Broadcast
Lower broadcast.
Figure 16.4 A two-by-two switching box and its four interconnection states
102
Banyan
Baseline
Cube
Delta
Flip
Indirect cube
Omega
In Blocking networks, simultaneous connections of more than one terminal pair may result
conflicts in the use of network communication links.
Examples : Data Manipulator, Flip, N cube, omega
In rearrangeable network, a network can perform all possible connections between inputs and
outputs by rearranging its existing connections so that a connection path for a new input-output
pair can always be established.
Example : Benes Network
In a mesh network nodes are arranged as a q-dimensional lattice, and communication is allowed
only between neighboring nodes.
In a periodic mesh, nodes on the edge of the mesh have wrap-around connections to nodes on the
other side. This is sometimes called a toroidal mesh.
Mesh Metrics
For a q-dimensional non-periodic lattice with kq nodes:
• Network connectivity = q
• Network diameter = q(k-1)
104
Each PEi is directly connected to its four neighbors in the mesh network. Horizontally, all PEs of
all rows form a linear circular list as governed by the following two permutations, each with a
single cycle of order N. The permutation cycles (a b c) (d e) stands for permutation ab, bc,
ca and de, ed in a circular fashion with each pair of parentheses.
R+1 = (0 1 2 ….N-1)
R–1 = (N-1 ….. 2 1 0)
For the example betwork of N = 16 and r = 16 = 4, the shift by a distance of four is specifies by
the followimg permutations, each with four cycles of order four each:
The cube network consists of m functions defined by cubei (Pm-I ..Pi+1 PiPi-i ... Po)
= Pm-i ….Pi+iPiPi-i …..Po
for 0 <= i < m. When the PE addresses are considered as the corners of an m-dimensional cube
this network connects each PE to its m neighbors.
Two functions straight and exchange switch boxes are used in constructing multistage cube
networks. The stages are numbered as 0 at input end and increased to n-1 at the output stage.
Exchange:
E(An-1…A1 A0)= (An-1…A1A0’)=C0 . The complement of LSM means data exchange
between 2 PEs with adjacent addresses.
Exchange connections links nodes whose numbers differ in their lowest bit.
Perfect shuffle connections link node i with node 2i mod(n-1), except for node n-1 which is
connected to itself.
Example of n = 8
The perfect shuffle cuts the deck into 2 halves from the center and remixes them evenly.
The inverse perfect shuffle does the opposite to restore the original ordering.
Figure 16.13 (a) Perfect Shuffle and Inverse Perfect Shuffle for n = 8
108
Below is an 8-node shuffle-exchange network, in which shuffle links are shown with solid lines,
and exchange links with dashed lines.
0 1 2 3 4 5 6 7
Shuffle-Exchange Networks
Shuffle-Exchange Networks
• Let ak-1, ak-2,…, a1, a0 be the address of a node in a shuffle-exchange network in binary.
• A datum at this node will be at node number
ak-2,…, a1, a0, ak-1
after a shuffle operation.
• This corresponds to a cyclic leftward shift in the binary address.
• After k shuffle operations we get back to the node we started with, and nodes through which
we pass are called a necklace.
Various SIMD interconnection networks has been explained in detail and the different
classification of networks as static and dynamic networks. An Illiac network, Cube network and
Omega network has been explained.
16.6 References
Contents:
The main aim of this lesson is to learn the architectural models of multiprocessor defined as
loosely coupled and tightly coupled multiprocessor. A number of architectural features are
described below for a processor to be effective in multiprogramming environment.
17.1 Introduction
A multiprocessor is expected to reach faster speed than the fastest uni-processor. Multiprocessor
characteristics are Interconnection Structures, Interprocessor Arbitration, Interprocessor
Communication and Synchronization, Cache Coherence. Multiprocessing sometimes refers to
the execution of multiple concurrent software processes in a system as opposed to a single
process at any one instant. The 2 architectural models of Multiprocessor are
Tightly Coupled Multiprocessor are defined as Tasks and/or processors communicate in a highly
synchronized fashion, Communicates through a common shared memory, Shared memory
system
Loosely Coupled System is defined as Tasks or processors do not communicate in a
synchronized fashion, Communicates by message passing packets, Overhead for data exchange
is high, Distributed memory system.
A number of processors (two or more) are connected in a manner that allows them to share the
simultaneous execution of a single task. In addition, a multiprocessor consisting of a number of
single uni-processors is expected to be more cost effective than building a high-performance
single processor. An additional advantage of a multiprocessor consisting of n processors is that if
a single processor fails, the remaining fault-free n-1 processors should be able to provide
continued service albeit with degraded performance.
111
In loosely coupled Multiprocessors each processor has a set of input-output devices and a
large local memory from where instructions and data are accessed.
Computer Module is a combination of
o Processor
o Local Memory
o I/O Interface
Processes which executes on different computer modules communicate by exchanging
messages through a Message Transfer System (MTS).
The Degree of coupling is very Loose. Hence it is often referred as distributed system.
Loosely coupled system is usually efficient when the interactions between tasks are
minimal.
I/O
Local
Memory
(LM)
Local Bus
Processor
(P)
I/O I/O
Local Local
Memory Memory
(LM) (LM)
Processor
Local
Bus
…… Processor
Local
Bus
(P) (P)
It consists of a
processor
local memory
Local Input-output devices
Interface to other computer Modules
The interface may contain a channel and arbiter switch (CAS)
If requests from two or more different modules collide in accessing a physical segment of the
MTs, the arbiter is responsible for choosing one of the simultaneous requests according to a
given service discipline.
It is also responsible for delaying requests until the servicing of the selected request is
completed.
The channel within the CAS may have a high speed communication memory which is used
for buffering block transfers of messages.
The message transfer system is a time shared bus or a shared memory system.
The example of LCS is the Cm* architecture. Each computer module of the Cm* includes a local
switch called the Slocal. The Slocal intercepts and routes the processor’s requests to the memory
and the I/O devices outside of the computer module via a map bus. It also accepts references
from other computer modules to its local memory and I/O devices. The address translation uses
the four higher bits of the processors address along with the current address by the X-bit of the
LSI -11 processor status word (PSW), to access a mapping tablewhich determines whether the
reference is local or nonlocal.
A virtual address of the nonlocal reference is formed by concatenating the nonlocal address field
given by the mapping table and the source processors identification. The virtual address is
fetched by the Kmap via the map bus in response to a service request for nonlocal access. A
number of computer modules may be connected to a map bus so that they share the use of a
single KMap. The Kmap is a processor that is responsible for mapping addresses and routing
data between Slocals.
113
A cluster is regared as the lowest level, is made up of computer modules, a KMap and the map
bus. Clustering can enhance the cooperative ability of the processors in a cluster to operate on
shared data with low communication overhead.
The three processors in the Kmap are the Kbus, the Linc and the Pmap. The KBus is the bus
controller which arbitrates requests to the map bus. The Pmap is the mapping processor which
responds to requests from the Kbus and Linc. It also performs most of the request processing.
The Pmap communicates with the computer modules in its cluster via the map bus which is a
packet switched bus.
Three sets of queues provide interfaces between the Kbus, Linc and the Pmap. Since PMap is
much faster than the memory in the computer modules, it is multiprogrammed to handle upto
eight concurrent requests. Each of the eight partition is called a context and exists in the Pmap.
Typically, each context processes one transaction. If one context needs to wait for a message
packet to return with the reply to some request, the PMap switches to another context that is
ready to run so that some other transaction can proceed concurrently.
Intercluster Bus 1
Intercluster Bus 0
Service Queue
Port 0
Send Port 1 Send
Run Queue Queue Queue
Map Bus
Kbus P map
Out Queue
The throughput of the hierarchical loosely coupled multiprocessor may be too slow for some
applications that require fast response times. If high speed or real time processing is required the
TCS may be used.
Two typical models are discussed
In the first model, it consists of
o p processors
o l memory modules
o d input-output channels
The above units are connected through a set of three interconnection networks namely the
o processor-memory interconnection network (PMIN)
o I/O processor interconnection network (IOPIN)
o Interrupt Signal Interconnection Network (ISIN)
The PMIN is a switch which can connect every processor to every module. It has pl set of
cross points. It is a multistage network. A memory can satisfy only one processor’s request in a
given memory cycle. Hence if more than one processors attempt to access the same memory
module, a conflict occurs and it is resolved by the PMIN.
Another method to resolve the conflict is to associate a reserved storage area with each
processor and it is called as ULM unmapped local memory. It helps in reducing the traffic in
PMIN.
Since each memory references goes through the PMIN, it encounters delay in the processors
memory switch so this can be overcome by using a private cache with each processor.
A multiprocessor organization which uses a private cache with each processor is shown. This
multiprocessor organization encounters the cache coherence problem. More than one consistent
copy of data may exist in the system.
In the figure there is a module attached to each processor that directs the memory reference to
either the ULM or the private cache of that processor. This module is called the memory map
and is similar in operation to the Slocal.
115
CPS
CP0 CP1
CM CMC ECM
PPS
CM : Central Memory
CMC : Central Memory Controller
CPi : ith central processor
CPS : Central Processing System
PPS : peripheral processing subsystem
A number of desirable architectural features are described below for a processor to be effective
in multiprocessing environment.
Processor recoverability
The process and processor are 2 different entities. If the processor fails there should be another
processor to take up the routine. Reliability of the processor should be present.
A general purpose register is a large register file that can be used for multi-programmed
processor. For effective utilization it is necessary for the processor to support more than one
addressing domain and hence to provide a domain change or context switching operation.
118
A processor intended to be used in the construction of a general purpose medium to large scale
multiprocessor must support a large physical address space. In addition a large virtual space is
also needed.
The processor design must provide some implementation of invisible actions which serve as the
basis for synchronization primitives.
The set of processor used in multiprocessor must have an efficient means of interprocessor
mechanism.
Instruction Set
The instruction set of the processor should have adequate facilities for implementing high level
languages that permit effective concurrency at the procedural level and for efficiently
manipulating data structures.
The architectural model of Multiprocessor system has been discussed such as Tightly and
Loosely coupled multiprocessor, its block diagram and examples has been neatly explained. The
tightly coupled processor shares a common memory and the loosely coupled processor does not
share a common memory. The concept of cache coherence has also be discussed. Various
processor characteristics of multiprocessor have been last topic of this lesson.
o Multiprocessor
o Tightly Coupled Multiprocessor
o Loosely Coupled Multiprocessor
o Cache Coherence
119
17.6 References
Contents:
The main objective of this lesson is to define the characteristics of multiprocessor systems, (ie)
is the ability of each processor to share a set of memory modules, and I/O devices. This sharing
capability is provided through a set of interconnection networks.
18.1 Introduction
Interconnection networks helps in sharing resources, one is between the processor and the
memory modules and the other between the processor and the I/O Systems. There are different
physical forms available for interconnection networks. They are time shared or common bus,
Crossbar switch and multistage networks for multiprocessors.
The organization and performance of a multiple processor system are greatly influenced
by the interconnection network used to connect them.
perform with the data being transferred, after which the data transfer is finally initiated. A
receiving unit recognizes its address placed on the bus and responds of the control signals from
the sender.
M3 S7 M6 S5 M4
Multiprocessor with unidirectional buses uses both the buses and not much is actually gained.
This method increases the complexity. The interconnection subsystem becomes an active device.
When multiple devices concurrently request the use of bus, the device with highest
priority is granted the bus. This method is called as daisy chaining and all the services are
assigned static priorities according to their locations along a bus grant control line. The device
closest to the bus control unit will get highest priority.
Requests are made through a common line, BRQ Bus request. If the central bus control accepts
the request it passes a BGT Bus grant signal to the concerned device iff the SACK Bus busy line
is free or idle.
Bus Request
BRQ
Bus
The available bus band widths divided into fixed time slices and then offered to various
devices in around robin fashion. If the allotted device does not use the time slice then the time
slice is wasted. This is called as fixed time slicing or time division multiplexing
The priorities allocated to the devices are done dynamically and thus every device gets a
chance to use the bus and does not suffer longer turn around time.
The two algorithms are
o Least Recently used (LRU)
o Rotating daisy chain (RDC)
The LRU gives the highest priority to the requesting device that has not used the bus for the
longest interval.
123
In a RDC scheme, no central controller exists, and the bus grant line is connected from
the last device back to the first in a closed loop. Whichever device is granted access to the bus
serves as a bus controller for the following arbitration (an arbitrary device is selected to have
initial access to the bus). Each device’s priority for a given arbitration is determined by that
device’s distance along the bus grant line from the device currently serving as a bus controller;
the latter device has the lowest priority. Hence the priorities are dynamically with each bus cycle.
Bus Request
BRQ
Bus
In the FCFS, requests are simply honored in the order received. It favors no particular
processor or device on the bus. The performance measures will be degraded in case of FCFS
since the device waiting for the bus for a longer period of time just because the request is made
late.
If the number of buses in a time shared bus is increased, a point is reached at which there is a
separate path available for each memory unit. The interconnection networking is called as a
nonblocking crossbar.
Advantages
- Multiple paths -> high transfer rate
124
Disadvantages
- Memory control logic
- Large number of cables and connections
Memory
MM MM MM MM
CPU1
CPU2
CPU3
CPU4
Figure 18.5 Cross Bar switch system organization for Multiprocessors
In order to design multistage networks, the basic principles involved in the construction
and control of simple crossbar has to be understood.
This 2 x 2 switch has the capability of connecting the input A to either the output labelled
0 or to the output labelled 1, depending on the value of some control bit CA of the input A. If
125
CA = 0 the input is connected to the upper output, and if CA = 1, the connection is made to the
lower output. Terminal B of the switches behaves similarly with a control bit CB. The 2 x 2
module also has the capability to arbitrate between conflicting requests. If both inputs A and B
require the same output terminal, then only one of them will be connected and the other will be
blocked or rejected.
The interconnection networks such as time shared, Crossbar Switch and multiport
memory has been discussed. In time shared bus various bus arbitration algorithms such as daisy
chain, round robin, LRU, FCFS has been explained. Crossbar Switch is the most complex
interconnection system. There is a potential for the highest total transfer rate and the multiport
memory is also an expensive memory units since the switching circuitry us included in the
memory unit and the final table which has clearly differentiated among all three interconnection
networks.
127
18.6 References
UNIT – V
Contents:
The main aim is to learn about parallel algorithms and how a problem can be solved in a parallel
computer. The abstract machine models are discussed and these models are useful in the design
of analysis of parallel algorithms.
19.1 Introduction
In order to simplify the design and analysis of parallel algorithms, parallel computers are
represented by various abstract models. These models capture the important features of parallel
computer. The Random Access Machine is the preliminary model and the PRAM is one of the
popular models for designing parallel algorithms. Various division of PRAM models are EREW,
CREW,ERCW, CRCW. The concept of interconnection networks has been discussed since the
exchanges of data among processors takes place through the shared memory. The Combinational
circuit can be viewed as a device with a set of input lines at one end and a set of output lines at
the other end is also a model of parallel computers.
The Algorithms designed for sequential computers are known as Sequential Algorithms.
The algorithm worked out for a parallel computer is termed as Parallel algorithms.
Traditionally, software has been written for serial computation:
o To be run on a single computer having a single Central Processing Unit (CPU);
o A problem is broken into a discrete series of instructions.
o Instructions are executed one after another.
o Only one instruction may execute at any moment in time.
129
In the simplest sense, parallel computing is the simultaneous use of multiple compute
resources to solve a computational problem.
o To be run using multiple CPUs
o A problem is broken into discrete parts that can be solved concurrently
o Each part is further broken down to a series of instructions
o Instructions from each part execute simultaneously on different CPUs
Various abstract machine models are discussed, and it helps in designing parallel algorithms.
MEMORY
PROCESSOR ACCESS MEMORY UNIT
UNIT
Execute : The processor performs a basic arithmetic and logic operations on the contents
of one or two of its registers.
Each PRAM step consists of three phases, executed in the following order:
A read phase in which each processor may read a value from shared memory
A compute phase in which each processor may perform basic arithmetic/logical operations
on their local data.
A write phase where each processor may write a value to shared memory.
Note that this prevents reads and writes from being simultaneous.
Above requires a PRAM step to be sufficiently long to allow processors to do different
arithmetic/logic operations simultaneously.
Parallel computers with many processors do not use shared memory hardware.
Instead each processor has its own local memory and data communication takes place via
message passing over an interconnection network.
The characteristics of the interconnection network are important in determining the
performance of a multicomputer.
If network is too slow for an application, processors may have to wait for data to arrive.
o Fan-out: Number of output lines carrying data to the outside world or to the next stage.
Component Characteristics
o Has no program
o Has no feedback
o Depth: The number of stages in a circuit . Gives worst case running time for problem
o Width: Maximal number of components per stage.
o Size: The total number of components
The butterfly circuit has n inputs and n outputs. The depth of the circuit is (1 + log n) and the
width of the circuit is n. The size in this case is (n + n log n) where n = 8
Various models of computations have been discussed with its schematic diagram. The
PRAM is one of the popular models for designing parallel algorithms, its phases of algorithms
134
and it’s Model of subdivision based on memory access called as EREW, CREW, ERCW, and
CRCW has been discussed. The concept of combinatorial circuit its important parameters such as
width, size and depth has been given as an example with respect to butterfly circuit.
19.6 References
PRAM Models, Advanced Algorithms & Data Structures, Lecture Theme 13, Prof.
Dr. Th. Ottmann
PRAM and Basic Algorithms by Shietung Peng
Introduction to Parallel Processing: Algorithms and Architectures by Behrooz Parhami
135
Contents:
The main aim of this lesson is to analyse the parallel algorithms based on the principle
criteria Running time, Number of Processors and Cost.
20.1 Introduction
Any sequential algorithm is evaluated based on 2 parameters called as the running time
complexity and space complexity. The parallel algorithms are evaluated based on running time,
number of processors and cost factors. The concept of prefix computation has been discussed
which is a very useful sub operation in many parallel algorithms.
The running time of an algorithm is a function of the input given to the algorithm. The
algorithm may perform well for certain inputs and relatively bad for certain others. The worst
case running time is defined as the maximum running time of the algorithm taken over all the
136
inputs. The average case running time is the average running time of the algorithm over all the
inputs.
The running time depends not only on the algorithm but also on the machine it is being executed.
Notion of Order
Consider 2 functions f(n) and g(n) defined from the set of positive integer to the set of
positive reals.
The function g(n) is said to be order of at least f(n) denoted by (f(n))if there exist
positive constantsc,n0 such that g(n) >= cf(n) for all n >= n0
The function g(n) is said to be order at most f(n) denoted by O (f(n))if there exist
positive constantsc,n0 such that g(n) <= cf(n) for all n >= n0
The function g(n) is said to be of the same order as f(n) if g(n) O (f(n)) and g(n)
(f(n)).
The lower bound on a problem is indicative of the minimum number of steps required to
solve the problem in the worst case.. Every problem is associated with a lower bound and upper
bound.
Speedup
Speed-up is the ratio of the time taken to run the best sequential algorithm on one
processor of the parallel machine divided by the time to run on N processors of the parallel
machine.
S(N) = Tseq/Tpar(N)
Efficiency is the speed-up per processor.
(N) = S(N)/N=(1/N)(Tseq/Tpar(N))
Overhead is defined as
f(N) = 1/ (N) -1
The next important criteria for evaluating parallel algorithm are the number of processors
required to solve the problem. Given a problem of input size n, the number of processors
required by an algorithm, is a function of n denoted by p(n).
20.2.3 Cost
The cost of a parallel algorithm is defined as the product of the running time of the
parallel algorithm and the number of processors used.
The efficiency of a parallel algorithm is defined as the ratio of worst case running time of the
fastest sequential algorithm and the cost of the parallel algorithm.
Prefix computation is a very useful sub operation in many parallel algorithms. The prefix
computation problem is defined as follows.
A set is given, with an operation on the set such that
1. is a binary operation
2. is a closed under
3. is associative
S0=x0
S1=.x0 x1
.
.
.
Sn-1 = x0 x1 ….. xn-1
Obtaining the set S= {s0,s1,….,sn-1} given the set X is known as the prefix computation.
The indices of the element used to compute is form a string 012…I, which is a prefix of the
string 012…n-1, and hence the name prefix computation for this problem.
Assume that the set is the set of natural numbers and the operation is the + operation.
Let X = {x0, x1, …. , xn-1}be the input. The partial sums si(0<=i<=n-1) where
si = x0 + x1 + ….. + xi can be found easily on a RAM in (n) time
The PRAM algorithm given here requires n processors P0, P1, …. , Pn-1, where n is a
power of 2. There are n memory locations m0, m1, … m n-1. Initially the memory location
mi(0<=i<=n-1) contains the input xi. When the algorithm terminates the location mi (0<=i<=n-1)
has the partial sum output si. The algorithm is given below.
It contains of (log n) iterations; during each step, the binary operation + is performed by
pairs of processors whose indices are separated by a distance twice that in the previous iteration.
Analysis
138
There are log n iterations in this algorithm. In each iteration one addition is done in
parallel and hence takes O(1) time. The time complexity is therefore O(log n) Since the
algorithm uses n processors, the cost of the algorithm is O (n log n), which is clearly not cost
optimal. This algorithm has the optimal time complexity among the non-CRCW machines. This
is because prefix computation of sn-1 requires O (n) additions which has a lower bound of O (log
n) on any non-CRCW parallel machine.
This algorithm also runs in (log n) time but makes use of fewer processors to achieve this. Let
X = { x0,x1,….,xn-1}be the input to the prefix computation. Let k= log n and m=n/k. The
algorithm uses m processors P0, P1, …. , Pm-1. The input sequence X is split into m subsequences,
each of size k, namely
Y0 = x0, x1, ….., xk-1
Y1 = xk, xk+1, ….., x2k-1
.
.
.
Ym-1 = xn-k, xn-k+1, ….., xn-1
The algorithm given below proceeds in 3 steps.
Step1 : The processor Pi (0<=i<=m-1) works on the sequence Yi, and computes the prefix sums
of the sequence Yi, using a sequential algorithm. Thus the processor Pi computes Sik,Sik+1, ….,
S(i+1)k-1, where
Step2 : The processors P0, P1, ..., Pm-1 perform a parallel prefix computation on the sequence {sk-
1, s2k-1, …, sn-1}. This parallel prefix computation is done using the algorithm described earlier.
When this step is completed, sik-1 will be replaced by sk-1 + s2k-1,….,sik-1.
Sequence – 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
Use n / [lg n] PEs with lg(n) items each
0,1,2,3 4,5,6,7 8,9,10,11 12,13,14,15
STEP 1: Each PE performs sequential prefix sum
0,1,3,6 4,9,15,22 8,17,27,38 12,19,39,54
STEP 2: Perform parallel prefix sum on last nr. in PEs
0,1,3,6 4,9,15,28 8,17,27,66 12,19,39,120
Now prefix value is correct for last number in each PE
STEP 3: Add last number of each sequence to incorrect sums in next sequence (in parallel)
0,1,3,6 10,15,21,28 36,45,55,66 78,91,105,120
AlgorithmAnalysis
Analysis:
Step 1 takes O(k) = O(lg n) time.
Step 2 takes
o O(lg m) = O(lg n/k)
o O(lg n- lg k) = O(lg n - lg lg n)
o O(lg n)
Step 3 takes O(k) = O(lg n) time
o The overall time for this algorithm is O(n).
o The overall cost is O((lg n) O n/(lg n)) = O(n)
The algorithm for prefix computation on a linked list technique is called as pointer jumping. The
data are stored in the pointer based structures called as linked lists, trees etc.
• Given a single linked list L with n objects, compute, for each object in L, its distance
from the end of the list.
• Formally: suppose next is the pointer field
– d[i]= 0 if next[i]=nil
– d[next[i]]+1 if next[i]nil
• Serial algorithm: (n).
then d[i]0
else d[i]1
while there exists an object i such that next[i]nil
do for each processor i, in parallel
do if next[i]nil
then d[i] d[i]+ d[next[i]]
next[i] next[next[i]]
Loop invariant: for each i, the sum of d values in the sublist headed by i is the correct
distance from i to the end of the original list L.
Parallel memory must be synchronized: the reads on the right must occur before the
writes on the left. Moreover, read d[i] and then read d[next[i]].
An EREW algorithm: every read and write is exclusive. For an object i, its processor
reads d[i], and then its precedent processor reads its d[i]. Writes are all in distinct
locations.
Loop invariant: for each i, the sum of d values in the sublist headed by i is the correct
distance from i to the end of the original list L.
Parallel memory must be synchronized: the reads on the right must occur before the
writes on the left. Moreover, read d[i] and then read d[next[i]].
An EREW algorithm: every read and write is exclusive. For an object i, its processor
reads d[i], and then its precedent processor reads its d[i]. Writes are all in distinct
locations.
The three principle criteria that were used in evaluation parallel algorithms are running
time, number of processors and cost has been discussed. Prefix computation a very useful sub
operation in many parallel algorithms has been defined and procedure for prefix computation has
been analysed and its complexity calculation has been done. The algorithm for prefix
computation called pointer jumping helped in solving problems whose data are stored as pointer
based.
1. Analyse Parallel Algorithms in terms of Running Time, Number of Processors and Cost.
2. Explain prefix computation in detail.
Pointer Jumping
The algorithm for prefix computation on a linked list technique is called as pointer
jumping. The data are stored in the pointer based structures called as linked lists, trees etc.
20.6 References
Parallel Computer architectures and Programming, V. Rajaraman & C. Siva Ram Murthy
142
Lesson 21 : Sorting
Contents:
The main of this lesson is to learn the concept of sorting and how it is implemented in parallel
processor and the design of various algorithms
21.1 Introduction
and ai through an-1 is monotonically decreasing. The next algorithm discussed was merging in
which the unsorted sequences are divided in to two halves and perform sorting on individual half
recursively and then merging together to obtain the sorted sequence.
Various sorting algorithms on PRAM models has been discussed. In sorting on linear
array, the Odd-even transposition method has been adopted and sorting on hypercube based on
bitonic method has been discussed.
21.2 Sorting
The problem of sorting is important as sorting data is at the heart of many computations. Various
sorting algorithms has been described.
The input to the combinational circuits is n unsorted numbers and the output is the sorted
permutation of the n numbers.
The basic processing unit in the combinational circuit is the comparator. A comparator has 2
inputs x and y, and has two output x' and y'.
143
monotonically increasing and ai through an-1 is monotonically decreasing, or (2) there exists a
cyclic shift of indices so that the first condition is satisfied.
Let s = <a0,a1,…,an-1> be a bitonic sequence such that a0 ≤ a1 ≤ ··· ≤ an/2-1 and an/2 ≥
an/2+1 ≥ ··· ≥ an-1.
Consider the following subsequences of s:
s1 = <min{a0,an/2},min{a1,an/2+1},…,min{an/2-1,an-1 }>
s2 = <max{a0,an/2},max{a1,an/2+1},…,max{an/2-1,an-1}>
Note that s1 and s2 are both bitonic and each element of s1 is less than every element in s2.
The procedure can be applied recursively on s1 and s2 to get the sorted sequence.
Figure 21.2 Merging a 16-element bitonic sequence through a series of log 16 bitonic splits.
We can easily build a sorting network to implement this bitonic merge algorithm.
Such a network is called a bitonic merging network.
144
The network contains log n columns. Each column contains n/2 comparators and performs one
step of the bitonic merge.
We denote a bitonic merging network with n inputs by BM[n].
Replacing the comparators by Ө comparators results in a decreasing output sequence; such a
network is denoted by ӨBM[n].
A bitonic merging network for n = 16. The input wires are numbered 0,1,…, n - 1, and the binary
representation of these numbers is shown. Each column of comparators is drawn separately; the
entire figure represents a BM[16] bitonic merging network. The network takes a bitonic
sequence and outputs it in sorted order.
We must first build a single bitonic sequence from the given sequence.
A sequence of length 2 is a bitonic sequence.
A bitonic sequence of length 4 can be built by sorting the first two elements using BM[2] and
next two, using ӨBM[2].
This process can be repeated to generate larger bitonic sequences.
145
A schematic representation of a network that converts an input sequence into a bitonic sequence.
In this example, BM[k] and ӨBM[k] denote bitonic merging networks of input size k that use
Å and Ө comparators, respectively. The last merging network (BM[16]) sorts the input. In this
example, n = 16.
Analysis
The bitonic sorting network for sorting a sequence of n numbers consists of log n stages. The last
stage of the network uses a (+)BM(n) which has a depth of log n. The remaining stages perform
a complete sort of n/2 elements. The depth of the sorting network is given by recirrence relation
D(n) = d(n/2) + log n solving this we get d(n) = ((log 2) + log n/2 = O(log2 n)
The second combinational circuit for sorting is based on the merge sort algorithm for a
sequential computer.
First develop a merging circuit. The idea is to split the two sequences to be merged into
two subsequence each. merge them separately, then merge two merged sequence into
one.
First, the even and odd indices elements of a sorted array elements are merged by the
same odd-even merge algorithm.
Then the two output sequences are merging into the final sorted sequence. The element
only needs to be compared with neighbouring elements from the other sequence, hence
can be done in one additional time step.
146
Analysis
Width : Each comparator has exactly 2 inputs and 2 outputs. The circuit takes 2m inputs and
produces 2m outputs. Thus it has width of m.
147
Depth: The merging circuit consists of 2 smaller circuits for merging sequences of length m/2,
followed by one stage of comparators. Thus the depth d(m) of the circuit is
D(m) = 1 (m = 1)
D(m) = d(m/2) + 1 (m >1)
Size : The size p(m) of the circuit can be obtained by solving the following recurrence relations.
P(1) = 1 (m=1)
P(m)= 2p(m/2) + m – 1 (m>1)
Solving we get p(m)0 = 1 + mlog m.
Various sorting algorithms for the PRAM models of a parallel computer. All these algorithms are
based on the idea of sorting by enumeration. The sorting is performed in such a way that every
element is compared with all other elements and placed in its sorted sequence.
Let S = <s1,s2….sn> denoted the unsorted input sequence. The position of each elemnt si of S in
the sorted sequence, known as rank of Si is determined by computing ri, the number of elements
smaller than it.
The array used is R = <r1,r2…..rn> to keep track of the rank of the elements in the input
sequence. The array Ri is initialized to 0 in all the algorithms. The ri will contain the number of
elements in the input sequence.
CRCW Sorting
A algorithm is described for SUM CRCW model of a parallel computer which has n2 processors.
When more than one processor attempts to write to a common memory location, the sum of all
the individual values is written onto the memory location.
n2 processors are represented in matrix as n x n. P i,j denoted the processor in the ith row and jth
column. The processor Pi,j tries to write the value 1 in ri if si > sj or si= sj and i > j.
For i := 1 to n do in parallel
For j := 1 to n do in parallel
If si > sj or (si = sj and i >j) then
Pi,j writes 1 to ri
End if
End for
End for
For i := 1 to n do in parallel
Pi,1 puts si in (ri +1) position of S
End for
The above algorithm takes 0(1) time and it uses O(n2) processors which is very large.
CREW Sorting
148
The sorting algorithm for the CREW model is very similar the previous one. The algorithm takes
n processors and has a time complexity of O(n). Let the processors be denoted by <P1,P2….Pn> .
The processor Pi computes the value ri after n iterations. In the jth iteration of the algorithm, all
the processors read sj, the processor pi increments ri if si > sj or si = sj and i > j. Therefore, after
n iterations ri contains the number of elements smaller than si in the input sequence.
For i := 1 to n do in parallel
For j := 1 to n do
If si > sj or (si = sj and i >j) then
Pi,j writes 1 to ri
End if
End for
Pi,1 puts si in (ri +1) position of S
End for
EREW Sorting
Concurrent read conflicts cab be avoided if we ensure that each processor reads a unique element
in each iteration. This can be achieved by allowing the processors to read elements in a cyclic
manner. In the first iteration the processors P1,P2….Pn read s1,s2,s3….and in the second iteration
the processors read s2,s3…sn and s1 respectivbelyand so on. This in the jth iteration the
processors P1,P2….Pn reads Sj , Sj+1….. sj-1 and update corresponding r values. This has a
running complexity of O(n).
For i := 1 to n do in parallel
For j := 0 to n-1 do
K := (i+j) mod n
If si > sk or (si = sk and i >k) then
Pi, adds 1 to ri
End if
End for
Pi,1 puts si in (ri +1) position of S
End for
In interconnection networks the comparisons are made as, consider 2 adjacent processors Pi and
Pj. The processor pi has the element ai and the processor pj has the element aj. The network
processor Pi sends the element ai to the processor Pj and the processor Pj sends the element aj to
the processor Pi. The processor Pi keeps the element min(ai,aj) and the processor Pj keeps the
element max(ai,aj). This is refereed as compare exchange (Pi,Pj).
149
ai
Pi Pj
ai
ai, aj aj, ai
Pi Pj
Pi Pj
At odd steps, compare contents of cell 1 and 2, cells 3 and 4, etc, and swap values if
necessary so that the smaller value ends up in the leftmost cell. At even steps do the
same but for the pairs 2 and 3, 4 and 5, etc.
For i:= 1 to n do
If I is odd then
For k := 1,3….. 2[n/2] -1 do in parallel
Compare-exchange (Pk,Pk+1)
Else
For k : 2,4…..2[(n-1)/2] do in parallel
Compare-exchange (Pk,Pk+1)
150
End for
End if
End for
There are n parallel phases in the algorithm and during each phase the odd –indexed or even
indexed processors perform a compare exchange operations which takes O(1) time. The time
complexity of the algorithm is O(n). The number of processors used is n. The cost is O(n2) which
is not optimal.
Sorting on a Hypercube
Bitonic sort algorithm can be efficiently mapped onto the hypercube network without much
communication overhead. The basic idea is to simulate the functioning of the combinational
circuit for bitonic sorting using a hypercube network. Let the input wires (lines) in the
combinational circuit are numbered as 0000, 0001……. 1111 in binary from top to bottom.
The comparison operation is performed between 2 wires whose indices differ exactly in one bit.
In a hypercube, processors whose indices differ in only bit are neighbours. Thus an optimal
mapping of input wires to hypercube processors is the one that maps an input wore with index l
to a processor with index l where l = 1,2,….n-1 and n is the number of processors. Whenever a
151
Various sorting techniques in parallel processors has been discussed. The bitonic sort,
odd-even merge sort, CRCW sorting, CREW sorting and EREW sorting. A method called as
odd-even transposition sort has been discussed and using the method of bitonic sort sorting on
hypercube has been implemented.
21.6 References
Parallel Computer architectures and Programming, V. Rajaraman & C. Siva Ram Murthy
IFI TE IFI TECHNICAL REPORTS, Institute of Computer Science, Clausthal University
of Technology
153
Contents:
22.1 Introduction
22.2 Searching
If the array is sorted in nondecreasing order then a binary search algorithm can provide the
answer in time. This is optimal because it requires this many of bits to identify an
element.
154
EREW Searching
In this case the value for must first be broadcast to all of the processors, requiring
time.
The sequence can then be subdivided into subsequences of length each and each
processor is assigned
Binary search on these subsequences requires time so that the parallel algorithm
CREW Searching
In this case, does not need to be broadcast since all processors can simply read the value in
Recall that the binary search algorithm compares first the element that is at the middle of the
array, for odd and say for even , making a decision to keep the upper or lower
half of the array (if the middle element is not itself equal to ).
At each stage of the algorithm the array is split into subsquences of equal length and the
processors simultaneously check the boundaries of their assigned subsequence. Each
processor determines whether the target element is to the left or the right of the boundary.
The next stage divides the selected subsequences in subsequences and continues.
Since each stage is applied to an array whose length is the length of the sequence
As intuitive proof (Akl, 1989), let be the smallest integer such that , that is
;for large n
156
Although this is not cost optimal, it can be shown to be the best possible cost for parallel
searching. Note that any algorithm using processors can compare an input element to at
most elements of simultaneously. The remaining subsequence (on average) will be at least
CRCW searching
For arrays of non-unique elements, the processors may well attempt to return the index of a
found element in parallel.
Similar to broadcast, write conflicts over processors can be resolved in steps. Foor
CRCW machines and with , this additional constant eliminates any benefits. any
speedup.
Procedure Search
Matrix operations are also the fundamental components of many numerical and non-numerical
algorithms.
157
For I := 1 to n do in parallel
For j := 1 to n do in parallel
Ci,j = 0
For k := 1 to n do
Ci,j = Ci,j + ai,k * bk,j;
End for
End for
End for
In case of CREW model one advantage is that a memory location can be accessed by any other
processor. In EREW model one needs to ensure that every processor reads the value from a
memory location which is not being accessed by any other processor.
For I := 1 to n do in parallel
For j := 1 to n do in parallel
Ci,j = 0
For k := 1 to n do
lk := (i+j+k)mod n+1;
158
For I := 1 to n do in parallel
For j := 1 to n do in parallel
For s := 1 to n do in parallel
Ci,j = 0
Ci,j = Ci,j + ai,s * bs,,j;
End for
End for
End for
The problem of solving a system of linear equations (Ax = b) is central fro many problems in
scientific and engineering computing. One of the method for solving linear equation is gauss
elimination method.
a0 0 x0 + a0 1 x1 + … + a0 n-1 xn-1 = b0
a1 0 x0 + a1 1 x1 + … + a1 n-1 xn-1 = b1
…….
x0 + u0 1 x1 + … + u0 n-1 xn-1 = y0
x1 + … + u1 n-1 xn-1 = y1
……..
xn-1 = yn-1
Back substitute
159
Phase 2: Eliminate
Using k-th row, make k-th column of A zero for row# > k
for i = k+1 to n-1
for j = k+1 to n-1 Aij -= Aik*Akj
bi -= aik * yk
Aik = 0
for k = 0 to n-1
k-th phase 1
performed by Pk
sequentially, no communication
k-th phase 2
Pk broadcasts k-th row to Pk+1, ... , Pn-1
performed in parallel by Pk+1, ... , Pn-1
p = n, row-stripes partition
for all Pi (i = 0 .. n-1) do in parallel
for k = 0 to n-1
if (i > k)
receive row k, send it down
perform k-th phase 2
if (i == k)
perform k-th phase 1
send row k down
For k := 1 to n do
For j := k+1 to n do
A[k,j] = a[k,j] / a[k,k]
End for
Y[k] := b[k] / a[k,k]
160
A[k,k] := 1
For I := k+1 to n do
For j := k+1 to n do
A[i,j] := a[i,j] – a[i,k] * a[k,j]
End for
B[i] := b[i] – a[i,k] * y[k]
A[i,k]:= 0;
End for
end for
Various algorithms has been carried out for searching operations using PRAM models
and the difference in the execution has been discussed. Matrix multiplication has been
implemented through CREW, EREW, CRCW model and gauss elimination has been solved
using parallel algorithms.
22.6 References
http://www-unix.mcs.anl.gov/dbpp/text/node45.html#eqlamatmulc
Parallel and Distributed Processing, CSE 8380, SMU school of engineering