Module-1: Metrics and Measures
Module-1: Metrics and Measures
Module-1: Metrics and Measures
MODULE-1
Computers have gone through two major stages of development: mechanical and electronic. Prior
to 1945, computers were made with mechanical or electromechanical parts.The earliest mechanical
computer can be traced back to 500 BC in the form of the abacus used in China.
Computer Generations:
Over the past five decades, electronic computers have gone through five generations of
development. Each of the first three generations lasted about 10 years.The fourth generation covered
a time span of 15 years. We have just entered the fifth generation with the use of processors and
memory devices with morethan 1 million transistors on a single silicon chip.
Computing Problem
Algorithms and Data Structure
High Level languages
Hardware resources
System Software
Application Software
Operating System
Compiler Support
- Computing Problems
The use of a computer is driven by real-life problems demanding fast and accurate solutions.
Depending on the nature of the problems, the solutions may require different computing resources.
For numerical problems in science and technology, the solutions demand complex mathematical
formulations and tedious integer or floating-point computations.
For alpha numerical problems in business and government, the solutions demand accurate
transactions, large database management, and information retrieval operations.
For artificial intelligence (AI) problems, the solutions demand logic inferences and symbolic
manipulations.
These computing problems have been labeled numerical computing, transaction processing, and
logical reasoning.
Some complex problems may demand a combination of these processing modes.
- Hardware Resources
A modern computer system demonstrates its power through coordinated efforts by hardware
- These peripherals are connected to mainframe computers directly or through local or wide-area
networks.
- Operating System
An effective operating system manages the allocation and deallocation of resources during the
execution of user programs.
Beyond the OS, application software must be developed to benefit the users.
Standard benchmark programs are needed for performance evaluation.
Mapping is a bidirectional process matching algorithmic structure with hardware architecture, and
vice versa.
Efficient mapping will benefit the programmer and produce better source codes.
The mapping of algorithmic and data structures onto the machine architecture includes processor
scheduling, memory maps, interprocessor communications, etc.
These activities are usually architecture-dependent.
- Compiler Support
There are three compiler upgrade approaches:
Preprocessor: A preprocessor uses a sequential compiler and a low-level library of the target
computer to implement high-level parallel constructs.
Precompiler: The precompiler approach requires some program flow analysis, dependence
Prepared by: C.VALARMATHI ,AP/CSE Sri Sairam College of Engineering, Bengaluru 4/
Be
ng
alu
18cs733 2018 Scheme
Regulation: 18CS733 -ADVANCED COMPUTER ARCHITECTURE
The study of computer architecture involves both hardware organization and programming/software
requirements. As seen by an assembly language programmer, computer architecture is abstracted by
its instruction set, which includes opcode (operation codes), addressing modes, registers, virtual
memory, etc. From the hardware implementation point of view, the abstract machine is organized
with CPUs, caches, buses, microcode, pipelines, physical memory, etc. Therefore, the study of
architecture covers both instruction-set architectures and machine implementation organizations.
Lookahead techniques were introduced to prefetch instructions in order to overlap I/E (instruction
fetch/decode and execution) operations and to enable functional parallelism. Functional parallelism was
supported by two approaches:
1. using multiple functional units simultaneously,
2. to practice pipelining at various processing levels.
Flynn's Classification
Michael Flynn (1972) introduced a classification of various computer architectures based on notions of
instruction and data streams.
1. SISD (Single Instruction stream over a Single Data stream) computers
2. SIMD (Single Instruction stream over Multiple Data streams) machines
3. MIMD (Multiple Instruction streams over Multiple Data streams) machines.
4. MISD (Multiple Instruction streams and a Single Data stream) machines
Thus in these computers same data flow through a linear array of processors executing different
instruction streams.
This architecture is also known as Systolic Arrays for pipelined execution of specific instructions.
Some conceivable uses might be:
1. multiple frequency filters operating on a single signal stream
2. multiple cryptography algorithms attempting to crack a single coded message.
4. MISD (Multiple Instruction streams and a Single Data stream) machines
Development Layers
• An accurate estimate of the average CPI requires a large amount of program code to be traced over
a long period of time.
• Unless specifically focusing on a single instruction type, we simply use the term CPI to mean the
average value with respect to a given instruction set and a given program mix.
Performance Factors
• Let Ic be the number of instructions in a given program, or the instruction count.
• The CPU time (T in seconds/program) needed to execute the program is estimated by finding the
product of three contributing factors:
T = Ic x CPI x τ (1.1)
• The execution of an instruction requires going through a cycle of events involving the instruction
fetch, decode, operand(s) fetch, execution, and store results.
• In this cycle, only the instruction decode and execution phases are carried out in the CPU.
• The remaining three operations may be required to access the memory. We define a memory cycle
as the time needed to complete one memory reference.
• Usually, a memory cycle is k times the processor cycle τ.
• The value of k depends on the speed of the memory technology and processor-memory
interconnection scheme used.
• The CPI of an instruction type can be divided into two component terms corresponding to the total
processor cycles and memory cycles needed to complete the execution of the instruction.
• Depending on the instruction type, the complete instruction cycle may involve one to four memory
references (one for instruction fetch, two for operand fetch, and one for store results). Therefore we
can rewrite Eq. 1.1 as follows;
T = Ic x (p + m x k) x τ (1.2)
where p is the number of processor cycles needed for the instruction decode and execution,
m is the number of memory references needed,
k is the ratio between memory cycle and processor cycle,
Ic is the instruction count,
r is the processor cycle time.
Equation 1.2 can be further refined once the CPi components (p,m,k) are weighted over the entire
instruction set.
System Attributes
• The above five performance factors (Ic, p, m, k, τ) are influenced by four system attributes:
instruction-set architecture, compiler technology, CPU implementation and control, and cache and
memory hierarchy, as specified in Table 1.2.
• The instruction-set architecture affects the program length (Ic) and processor cycle needed (p). The
compiler technology affects the values of Ic, p and the memory reference count (m).
• The CPU implementation and control determine the total processor time (p. τ) needed.
• Finally, the memory technology and hierarchy design affect the memory access latency (k. τ). The
above CPU time can be used as a basis in estimating the execution rate of a processor.
MIPS Rate
• Let C be the total number of clock cycles needed to execute a given program.
• Then the CPU time in Eq. 1.2 can be estimated as T = C x τ = C/f.
• Furthermore, CPI = C/Ic and T = Ic x CPI x τ = Ic x CPI/f. The processor speed is often measured in
Prepared by: C.VALARMATHI ,AP/CSE Sri Sairam College of Engineering, Bengaluru 11
/B
en
gal
18cs733 2018 Scheme
Regulation: 18CS733 -ADVANCED COMPUTER ARCHITECTURE
• Based on Eq. 1.3, the CPU time in Eq. 1.2 can also be written as T = Ic X 10-6 / MIPS.
• Based on the system attributes identified in Table 1.2 and the above derived expressions, we conclude
by indicating the fact that the MIPS rate of a given computer is directly proportional to the clock
rate and inversely proportional to the CPI.
• All four system attributes, instruction set, compiler, processor, and memory technologies, affect the
MIPS rate, which varies also from program to program.
Throughput Rate
• Number of programs a system can execute per unit time, called the system throughput Ws (in
programs/second).
• In a multiprogrammed system, the system throughput is often lower than the CPU throughput Wp
defined by:
• Note that Wp = (MIPS) X 106/Ic from Eq. 1.3- The unit for Wp is programs/second.
• The CPU throughput is a measure of how many programs can be executed per second, based on the
MIPS rate and average program length (Ic).
• The reason why Ws < Wp is due to the additional system overheads caused by the I/O, compiler, and
OS when multiple programs are interleaved for CPU execution by multiprogramming or timesharing
operations.
• If the CPU is kept busy in a perfect program-interleaving fashion, then Ws = Wp. This will probably
never happen, since the system overhead often causes an extra delay and the CPU may be left idle for
some cycles.
Programming Environments
Conventional computers are used in a sequential programming environment with tools developed for
a uniprocessor computer.
Parallel computers need parallel tools that allow specification or easy detection of parallelism and
operating systems that can perform parallel scheduling of concurrent events, shared memory
allocation, and shared peripheral and communication links.
Implicit Parallelism
An implicit approach uses a conventional language, such as C, Fortran, Lisp, or Pascal, to write the
source program.
The sequentially coded source program is translated into parallel object code by a parallelizing
compiler.
The compiler must be able to detect parallelism and assign target machine resources. This compiler
approach has been applied in programming shared-memory multiprocessors.
With parallelism being implicit, success relies heavily on the "intelligence" of a parallelizing
compiler.
This approach requires less effort on the part of the programmer.
Explicit Parallelism
The second approach (Fig. 1.5b) requires more effort by the programmer to develop a source
program using parallel dialects of C, Fortran, Lisp, or Pascal.
1. Shared-Memory Multiprocessors
There are 3 shared-memory multiprocessor models:
i. Uniform Memory-access (UMA) model,
ii. Non uniform-Memory-access (NUMA) model
iii. Cache-Only Memory Architecture (COMA) model.
These models differ in how the memory and peripheral resources are shared or distributed.
In a UMA multiprocessor model (Fig. 1.6), the physical memory is uniformly shared by all the
processors.
All processors have equal access time to all memory words, which is why it is called uniform
memory access.
Each processor may use a private cache. Peripherals are also shared in some fashion.
Multiprocessors are called tightly coupled systems dun to the high degree of resource sharing. The
system interconnect takes the form of a common bus, a crossbar switch, or a multistage network.
coordinate parallel events, synchronization and communication among processors are done through
using shared variables in the common memory.
When all processors have equal access to all peripheral devices, the system is called a symmetric
multiprocessor. In this case, all the processors are equally capable of running the executive
programs, such as the OS kernel and I/O service routines.
A NUMA multiprocessor is a shared-memory system in which the access time varies with the
location of the memory word.
Two NUMA machine models are depicted in Fig. 1.7.
The shared memory is physically distributed to all processors, called local memories.
The collection of all local memories forms a global address space accessible by all processors.
It is faster to access a local memory with a local processor. The access of remote memory attached
to other processors takes longer due to the added delay through the interconnection network.
The BBN TC-2000 Butterfly multiprocessor assumes the configuration shown in Fig. 1.7a.
iii. Cache-Only Memory Architecture (COMA) model
2. Distributed-Memory Multicomputers
SIMD computers have a single instruction stream over multiple data streams.
An operational model of an SIMD computer is specified by a 5-tuple:
M = (N,C, I,M, R)
where
1. N is the number of processing elements (PEs) in the machine. For example, the Illiac IV had 64
PEs and the Connection Machine CM-2 had 65,536 PEs.
2. C is the set of instructions directly executed by the control unit (CU), including scalar and
program flow control instructions.
3. I s the set of instructions broadcast by the CU to all PEs for parallel execution. These include
arithmetic, logic, data routing, masking, and other local operations executed by each active PE
over data within that PE.
4. M is the set of masking schemes, where each mask partitions the set of PEs into enabled and
disabled subsets.
R is the set of data-routing functions, specifying various patterns to be set up in the interconnection
network for inter-PE communications.
PRAM is a theoretical model of parallel computation in which an arbitrary but finite number of
processors can access any value in an arbitrarily large shared memory in a single time step.
The Parallel Random Access Machine (PRAM) is an abstract model for parallel computation which
assumes that all the processors operate synchronously under a single clock and are able to randomly
access a large shared memory.
Processors may execute different instruction streams, but work synchronously. This model assumes
a shared memory, multiprocessor machine as shown:
The machine size n can be arbitrarily large.
The machine is synchronous at the instruction level. That is, each processor is executing it's own
series of instructions, and the entire machine operates at a basic time step (cycle). Within each cycle,
each processor executes exactly one operation or does nothing, i.e. it is idle.
An instruction can be any random access machine instruction, such as: fetch some operands from
memory, perform an ALU operation on the data, and store the result back in memory.
All processors implicitly synchronize on each cycle and the synchronization overhead is assumed
to be zero.
Communication is done through reading and writing of shared variables.
Memory access can be specified to be UMA, NUMA, EREW, CREW, or CRCW with a defined
conflict policy.
The PRAM model can apply to SIMD class machines if all processors execute identical instructions
on the same cycle or to MIMD class machines if the processors are executing different instructions.
Load imbalance is the only form of overhead in the PRAM model.
model must specify how concurrent read and concurrent write of memory are handled.
Four memory-update options are possible:
Exclusive Read (ER) — This allows at mast one processor to read from any memory location in
each cycle, a rather restrictive policy.
Exclusive Write (EW) — This allows at most one processor to write into a memory location at a
time.
Concurrent Read (CR) — This allows multiple processors to read the same information from the
same memory cell in the same cycle.
Concurrent Write (CW) — This allows simultaneous writes to the same memory location.
In order to avoid confusion, some policy must be set up to resolve the write conflicts. Various
combinations of the above options lead to several variants of the PRAM model as specified below.
PRAM Variants
There are 4 variants of the PRAM model, depending on how the memory reads and writes are handled.
EREW – PRAM Model (Exclusive Read, Exclusive Write): This model forbids more than one
processor from reading or writing the same memory cell simultaneously. This is the most restrictive
PRAM model proposed.
CREW – PRAM Model (Concurrent Read, Exclusive Write); The write conflicts are avoided by
mutual exclusion. Concurrent reads to the same memory location arc allowed.
ERCW – PRAM Model – This allows exclusive read or concurrent writes to the same memory
location.
CRCW – PRAM Model (Concurrent Read, Concurrent Write); This model allows either
concurrent reads or concurrent writes to the same memory location.
Condition of parallelism
2.2.1 Data and Resource Dependence
The ability to execute several program segments in parallel requires each segment to be
independent of the other segments. We use a dependence graph to describe the relations.
The nodes of a dependence graph correspond to the program statement (instructions), and directed
edges with different labels are used to represent the ordered relations among the statements.
The analysis of dependence graphs shows where opportunity exists for parallelization and
vectorization.
Data dependence:
The ordering relationship between statements is indicated by the data dependence. Five type of data
dependence are defined below:
1. Flow dependence: A statement S2 is flow dependent on S1 if an execution path exists from s1 to S2
and if at least one output (variables assigned) of S1feeds in as input (operands
to be used) to S2 also called RAW hazard and denoted as
3. Output dependence: Two statements are output dependent if they produce (write) the same output
variable. Also called WAW hazard and denoted as
4. I/O dependence: Read and write are I/O statements. I/O dependence occurs not because the same
variable is involved but because the same file referenced by both I/O statement.
5. Unknown dependence: The dependence relation between two statements cannot be determined in
the following situations:
• The subscript of a variable is itself subscribed (indirect addressing)
• The subscript does not contain the loop index variable.
• A variable appears more than once with subscripts having different coefficients of the loop
variable.
• The subscript is non linear in the loop index variable.
Parallel execution of program segments which do not have total data independence can produce non-
deterministic results.
Example: Consider the following fragment of a program:
S1: Load R1, A /R1 Memory(A) /
S2: Add R2, R1 /R2 (R1) + (R2)/
S3: Move R1, R3 /R1 (R3)/
S4: Store B, R1 /Memory(B) (R1)/
Control Dependence:
This refers to the situation where the order of the execution of statements cannot be determined
before run time.
For example all condition statement, where the flow of statement depends on the output.
Different paths taken after a conditional branch may depend on the data hence we need to eliminate
this data dependence among the instructions.
This dependence also exists between operations performed in successive iterations of looping
procedure. Control dependence often prohibits parallelism from being exploited.
Control-independent example:
for (i=0; i<n; i++)
{
a[i] = c[i];
if (a[i] < 0) a[i] = 1;
}
Control-dependent example:
for (i=1; i<n; i++)
{
if (a[i-1] < 0) a[i] = 1;
}
Control dependence also avoids parallelism to being exploited. Compilers are used to eliminate this
control dependence and exploit the parallelism.
Resource dependence:
Data and control dependencies are based on the independence of the work to be done.
Resource independence is concerned with conflicts in using shared resources, such as registers,
integer and floating point ALUs, etc. ALU conflicts are called ALU dependence.
Memory (storage) conflicts are called storage dependence.
Bernstein’s Conditions
Bernstein’s conditions are a set of conditions which must exist if two processes can execute in parallel.
Notation
Ii is the set of all input variables for a process Pi. Ii is also called the read set or domain of Pi. Oi is
the set of all output variables for a process Pi . Oi is also called write set.
If P1 and P2 can execute in parallel (which is written as P1 || P2), then:
In terms of data dependencies, Bernstein’s conditions imply that two processes can execute in
parallel if they are flow-independent, anti-independent, and output-independent.
The parallelism relation || is commutative (Pi || Pj implies Pj || Pi ), but not transitive (Pi || Pj and Pj
|| Pk does not imply Pi || Pk ) .
Therefore, || is not an equivalence relation. Intersection of the input sets is allowed.
Assume that each statement requires one step to execute. No pipelining is considered here. The
dependence graph shown in 2.2a demonstrates flow dependence as well as resource dependence. In
sequential execution, five steps are needed (Fig. 2.2b).
If two adders are available simultaneously, the parallel execution requires only 3 steps as shown in
Fig 2.2c.
Pairwise, there are 10 pairs of statements to check against Bernstein’s conditions. Only 5 pairs,
P1||P5, P2||P3, P2||P5, P5||P3 and P4||P5 can execute in parallel as revealed in Fig 2.2a if there are no
resource conflicts.
Collectively, only P2||P3||P5 is possible(Fig. 2.2c) because P2||P3, P3||P5 and P5||P2 are all possible.
Software Parallelism
Software parallelism is defined by the control and data dependence of programs, and is revealed in the
program’s flow graph i.e., it is defined by dependencies with in the code and is a function of algorithm,
programming style, and compiler optimization.
Consider the example program graph in Fig. 2.3a. There are eight instructions (four loads and four
arithmetic operations) to be executed in three consecutive machine cycles.
Four load operations are performed in the first cycle, followed by two multiply operations in the
second cycle and two add/subtract operations in the third cycle.
Therefore, the parallelism varies from 4 to 2 in three cycles. The average software parallelism is
equal to 8/3 = 2.67 instructions per cycle in this example program.
Now consider execution of the same program by a two-issue processor which can execute one
memory access (load or write) and one arithmetic (add, subtract, multiply, etc.) operation
simultaneously.
With this hardware restriction, the program must execute in seven machine cycles as shown in Fig.
2.3b. Therefore, the hardware parallelism displays an average value of 8/7 = 1.14 instructions
executed per cycle.
This demonstrates a mismatch between the software parallelism and the hardware parallelism.
Let us try to match the software parallelism shown in Fig. 2.3a in a hardware platform of a dual-
processor system, where single-issue processors are used.
The achievable hardware parallelism is shown in Fig 2.4. Six processor cycles are needed to
execute 12 instructions by two processors.
S1 and S2 are two inserted store operations, l5 and l6 are two inserted load operations for
interprocessor communication through the shared memory.
Loop-level Parallelism
Typical loop has less than 500 instructions. If a loop operation is independent between iterations, it
can be handled by a pipeline, or by a SIMD machine.
Most optimized program construct to execute on a parallel or vector machine.
Some loops (e.g. recursive) are difficult to handle. Loop-level parallelism is still considered fine
grain computation.
Procedure-level Parallelism
Medium-sized grain; usually less than 2000 instructions.
Detection of parallelism is more difficult than with smaller grains; interprocedural dependence
analysis is difficult and history-sensitive.
Communication requirement less than instruction level SPMD (single procedure multiple data) is a
special case Multitasking belongs to this level.
Subprogram-level Parallelism
Job step level; grain typically has thousands of instructions; medium- or coarse-grain level.
Job steps can overlap across different jobs. Multiprograming conducted at this level No compilers
available to exploit medium- or coarse-grain parallelism at present.
Communication Latency
Balancing granularity and latency can yield better performance. Various latencies attributed to machine
architecture, technology, and communication patterns used.
Latency imposes a limiting factor on machine scalability.
Ex: Memory latency increases as memory capacity increases, limiting the amount of memory that can
be used with a given tolerance for communication latency.
Interprocessor Communication Latency
• Needs to be minimized by system designer
• Affected by signal delays and communication patterns Ex: n communicating tasks may require n
(n - 1)/2 communication links, and the complexity grows quadratically, effectively limiting the
number of processors in the system.
Communication Patterns
• Determined by algorithms used and architectural support provided
• Patterns include permutations broadcast multicast conference
• Tradeoffs often exist between granularity of parallelism and communication demand.
There is an obvious tradeoff between the time spent scheduling and synchronizing parallel grains and
the speedup obtained by parallel execution.
One approach to the problem is called ―grain packing.‖
Scheduling
A schedule is a mapping of nodes to processors and start times such that communication delay
requirements are observed, and no two nodes are executing on the same processor at the same time.
Some general scheduling goals are:
• Schedule all fine-grain activities in a node to the same processor to minimize communication
delays.
• Select grain sizes for packing to achieve better schedules for a particular parallel machine.
With respect to the fine-grain versus coarse-grain program graphs in Fig. 2.6, two multiprocessor
schedules are shown in Fig. 2.7. The fine-grain schedule is longer (42 time units) because more
communication delays were included as shown by the shaded area.
The coarse-grain schedule is shorter (38 time units) because communication delays among nodes 12,
13 and 14 within the same node D ( and also the delays among 15, 16 and 17 within the nodeE)
are eliminated after grain packing.
Node Duplication
Grain packing may potentially eliminate interprocessor communication, but it may not always
produce a shorter schedule.
By duplicating nodes (that is, executing some instructions on multiple processors), we may
eliminate some interprocessor communication, and thus produce a shorter schedule.
Figure 2.8a shows a schedule without duplicating any of the 5 nodes. This schedule contains idle
time as well as long interprocessor delays (8 units) between P1 and P2.
In Fig 2.8b, node A is duplicated into A’ and assigned to P2 besides retaining the original copy A
in P1.
Similarly, a duplictaed node C’ is copied into P1 besides the original node C in P2.
The new schedule is shown in Fig. 2.8b is almost 50% shorter than that in Fig. 2.8a. The reduction
in schedule time is caused by elimination of the (a, 8) and (c, 8) delays between the two processors.
Grain packing and node duplication are often used jointly to determine the best grain size and
corresponding schedule.
Four major steps are involved in the grain determination and the process of scheduling optimization:\
Step 1: Construct a fine-grain program graph
Step 2: Schedule the fine-grain computation
Step 3: Perform grain packing to produce the coarse grains.
Step 4: Generate a parallel schedule based on the packed graph.
Control flow machines used shared memory for instructions and data.
Since variables are updated by many instructions, there may be side effects on other instructions.
These side effects frequently prevent parallel processing.
Single processor systems are inherently sequential.
Instructions in dataflow machines are unordered and can be executed as soon as their operands are
available; data is held in the instructions themselves. Data tokens are passed from an instruction to
its dependents to trigger execution.
• Control flow mechanism: Conventional machines used control flow mechanism in which order of
program execution explicitly stated in user programs.
• Reduction machines trigger an instruction’s execution based on the demand for its results.
Control flow machines used shared memory for instructions and data. Since variables are updated by
many instructions, there may be side effects on other instructions. These side effects frequently prevent
parallel processing. Single processor systems are inherently sequential.
Instructions in dataflow machines are unordered and can be executed as soon as their operands are
available; data is held in the instructions themselves. Data tokens are passed from an instruction to its
dependents to trigger execution.
No need for
• shared memory
• program counter
• control sequencer
A Dataflow Architecture
• The Arvind machine (MIT) has N PEs and an N-by-N interconnection network.
• Each PE has a token-matching mechanism that dispatches only instructions with data tokens
available.
• Each datum is tagged with
o address of instruction to which it belongs
o context in which the instruction is being executed
• Tagged tokens enter PE through local path (pipelined), and can also be communicated to other PEs
through the routing network.
• Instruction address(es) effectively replace the program counter in a control flow machine.
• Context identifier effectively replaces the frame base register in a control flow machine.
• Since the dataflow machine matches the data tags from one instruction with successors,
synchronized instruction execution is implicit.
• An I-structure in each PE is provided to eliminate excessive copying of data structures.
• Each word of the I-structure has a two-bit tag indicating whether the value is empty, full or has
pending read requests.
• This is a retreat from the pure dataflow approach.
• Special compiler technology needed for dataflow machines.
Demand-Driven Mechanisms
• Graph-reduction model:
o expression graph reduced by evaluation of branches or subgraphs, possibly in parallel,
with demanders given pointers to results of reductions.
o based on sharing of pointers to arguments; traversal and reversal of pointers continues
until constant arguments are encountered.
Various types of interconnection networks have been suggested for SIMD computers. These are
basically classified have been classified on network topologies into two categories namely
1. Static Networks
2. Dynamic Networks
The topology of an interconnection network can be either static or dynamic. Static networks are
formed of point-to-point direct connections which will not change during program execution.
Dynamic networks are implemented with switched channels, which are dynamically configured to
match the communication demand in user programs.
Packet switching and routing is playing an important role in modern multiprocessor architecture.
Bisection Width:
When a given network is cut into two equal halves, the minimum number of edges (channels) along the
cut is called the bisection width b. In the case of a communication network, each edge may correspond
to a channel with w bit wires.
To summarize the above discussions, the performance of an interconnection network is affected by the
following factors:
Functionality: refers to how the network supports data routing, interrupt handling, synchronization,
request-"message combining, and coherence.
Network Latency:- This refers to the worst-ease time delay for a unit message to be transferred
through the network.
Bandwidth: This refers to the maximum data transfer rate, in terms of Mbps or Gbps transmitted
through the network.
Hardware Complexity'—This refers to implementation costs such as those for wires, switches,
connectors, arbitration, and interface logic.
Scalability—This refers to the ability ofa network to be modularly expandable with a scalable
performance with increasing machine resources.
Digital Buses
• A bus system (contention bus, time-sharing bus) has
o a collection of wires and connectors
o multiple modules (processors, memories, peripherals, etc.) which connect to the wires
o data transactions between pairs of modules
The degree of parallelism reflects the extent to which software parallelism matches hardware parallelism.
Degree of Parallelism The execution of‘ a program on a parallel computer may use different numbers
of processors at diffcrcnt time pcriocls during thc cxocution cycle. For each tirnc period, thc number of
processors used to execute a program is defined as the
The plot of the D-OP as a function ol" time is called the parallelism pm-fitle of a given program.
Fluctuation of the profile during an observation period depends on the algoritlunic structure, program
optimization, resource utilization, and run-time conditions of a computer system. The DOP was defined
under the assumption ofhat.-‘ing an unbounded number of available processors and other neoessary resources.
The DOP may not always be achievable on a real computer with limited resources.
Average Fturallelism
Parallelism profiles
Asymptotic speedup factor
System efficiency,
utilization and quality
Standard performance measures
Degree of parallelism
Reflects the matching of software and hardware parallelism Discrete time function – measure, for each
time period, the # of processors used Parallelism profile is a plot of the DOP as a function of time
Ideally have unlimited resources
Factors affecting parallelism profiles
• Algorithm structure
• Program optimization
• Resource utilization
• Run-time conditions
• Realistically limited by # of available processors,
• memory, and other non processor resources
• Average parallelism variables
n – homogeneous processors
• m – maximum parallelism in a profile
• D - computing capacity of a single processor (execution rate only, no overhead)
• DOP=i – # processors busy during
• an observation period
Scalability Metrics
Machine size (n) : # of processors
• Amdahl’s Law
• [based on fixed problem size or fixed work load]
• Gustafson’s Law
• [for scaled problems, where problem size increases with machine size
• i.e. the number of processors]
•
Amdahl’s Law (1967)
• For a given problem size, the speedup does not increase linearly as the number of processors increases.
In fact, the speedup tends to become saturated.
• This is a consequence of Amdahl’s Law.
• According to Amdahl’s Law, a program contains two types of operations:
• Completely sequential Completely parallel
• Let, the time Ts taken to perform sequential operations be a fraction α (0<α≤1) of the total execution
time T(1) of the program, then the time Tp to perform parallel operations shall be (1-α) of T(1)
Amdahl’s Law:
Thus, Ts = α.T(1) and Tp = (1-α).T(1)
Assuming that the parallel operations achieve linear speedup
(i.e. these operations use 1/n of the time taken to perform on each processor), then
T(n) = Ts + Tp/n =
Thus, the speedup with n processors will be:
Amdahl’s Law
Sequential operations will tend to dominate the speedup as n becomes very large.
As n à ∞, S(n) à 1/α
This means, no matter how many processors are employed, the speedup in this problem is limited to 1/α.
This is known as sequential bottleneck of the problem.
It relaxed the restriction of fixed size of the problem and used the notion of fixed execution time for getting
over the sequential bottleneck.
According to Gustafson’s Law,
If the number of parallel operations in the problem is increased (or scaled up) sufficiently,
Then sequential operations will no longer be a bottleneck.
In accuracy-critical applications, it is desirable to solve the largest problem size on a larger machine rather
than solving a smaller problem on a smaller machine, with almost the same execution time.
As the machine size increases, the work load (or problem size) is also increased so as to keep the fixed
execution time for the problem.
Let, Ts be the constant time tank to perform sequential operations; and
Tp(n,W) be the time taken to perform parallel operation of problem size or workload W using n processors;
Then the speedup with n processors is: