IV. Physical Organization and Models: March 9, 2009
IV. Physical Organization and Models: March 9, 2009
IV. Physical Organization and Models: March 9, 2009
Dis:
interconnection network which is often
far slower than the fully integrated ones
in parallel machines.
the connection of each node to the
network is done through the connection
bus of that node which often has
restricted bandwidths and/or latencies
Distributed Clusters - Grids
the cluster idea has been
expanded to connecting computers
all over the world via the Internet
a logical result of
the great improvements in distant
networks during the last few years
the stronger and stronger demand
for more and more powerful
computational systems.
Adv:
gathering very large no. of
machines => larger computational
power & larger memory capacity.
Drawbacks:
communications between clusters
are generally much slower than
those inside local clusters
security issues
Trends of used configurations
Hierarchical parallel systems, mixing shared and distributed memory
E.g. IBM BlueGene
Each of the theoretical processors can access the global shared memory
in one uninterruptible unit of time
Each processor can perform various arithmetic& logical ops in parallel.
The PRAM model has both concurrent and exclusive read algorithms.
Concurrent read algorithms are allowed to read the same piece of
memory simultaneously with no data corruption.
Exclusive read algorithms are used to ensure that no two processors ever
read the same memory location at the same time.
The PRAM model also has both concurrent and exclusive write algs.
Concurrent write algorithms allow multiple processors to write to memory
Detecting (CRCW-D):
A special code representing “detected collision” is written.
Common (CRCW-C):
Multiple writes allowed only if all store the same value.
Random (CRCW-R):
The value written is randomly chosen from among those offered.
Priority (CRCW-P):
The processor with the lowest index succeeds in writing its value.
Max/Min (CRCW-M):
The largest/smallest of the multiple values is written.
Reduction:
The arithmetic sum (CRCW-S), logical AND (CRCW-A), logical XOR
(CRCW-X), other combination of the multiple values is written.
Order the submodels
One way to order these submodels is by their computational power.
Two PRAM submodels are equally powerful if each can emulate the other
with a constant-factor slowdown.
A PRAM submodel is (strictly) less powerful than another submodel
(denoted by the “<” symbol) if there exist problems for which the former
requires significantly more computational steps than the latter.
Example:
CRCW-D PRAM submodel < CRCW-M (maxim),
CRCW-M can find the largest number in a vector A of size p in a single step:
Processor i reads A[i] and writes it to an agreed-upon location x, which will then hold
the maximum value for all processors to see
CRCW-D needs at least O(log n) steps.
The “less powerful or equal” relationship “≤” between submodels can be
similarly defined.
EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P
EREW can simulate CRCW submodel with at most logarithmic slowdown.
A p-processor CRCW-P (priority) PRAM can be simulated by a p-
processor EREW PRAM with a slowdown factor of O(log p).
Alg.1: Data broadcasting – how?
One processor needs to send a data value to all other processors.
In the CREW/CRCW, broadcasting trivial: the sending proc write the data value into a
memory location, with all processors reading that data value in the following cycle.
=> simple broadcasting is done in O(1) steps.
Multicasting within groups is equally simple if each processor knows its group
membership(s) & only members of each group read the multicast data
All-to-all broadcasting: each of the p procs needs to send a data value to all other
processors - can be done through p separate broadcast operations in O(p) steps.
The above scheme is clearly inapplicable to broadcasting in the EREW model.
The simplest scheme for EREW broadcasting is
make p copies of the data value, say in a broadcast vector B of length p,
and then let each processor read its own copy by accessing B[j].
Initially, Processor i writes its data value into B[0].
Recursive doubling is used to copy B[0] into all elements of B in O (log 2 p) steps.
Finally, Processor j, 0 ≤j < p, reads B [j] to get the data value broadcast by Processor I
Not optimal in that the O(p²) computational work involved in it is significantly greater
than the O(p log p) work required for sorting p elements on a single processor.
Semigroup and prefix comp.
Semigroup computation or Parallel prefix computations:
fan in computation: consists of the first phase of the semigroup
is define based on associative computation.
binary operator o. The divide-and-conquer paradigm:
trivial for a CRCW PRAM of the Problem as composed of two subproblems:
“reduction” variety 1. computing the odd-indexed results s1,s3,s5,...
Examples: 2. computing the even-index.results s0,s2,s4,...
computing the arithmetic sum The first subproblem is solved as follows.
(logical AND, logical XOR) of p Pairs of consecutive elements in the input list (x0
values, one per processor, and x1, x2 and x3, x4 and x5, and so on) are
trivial for the CRCW-S (CRCW- combined to obtain a list of half the size.
A, CRCW-X) PRAM; Performing parallel prefix computation on this
it can be done in a single list yields values for all odd-indexed results.
cycle by each proc writing
its corresponding value into
The even indexed results are then found in a
a common location that will single PRAM step by combining each even-
then hold the arithmetic sum indexed input with the immediately preceding
of all of the values. odd-indexed result.
The recursive doubling scheme The total computation time is given by the
can be used on an EREW PRAM recurrence T(p) = T(p/2) + 2 whose solution
the only difference appearing in isT(p) = 2 log2 p.
the final broadcasting step.
Matrix multiplication
Given m x m matrices A and B, with elements a[i,j] and b[i,j],, their
product C can be obtained with a O(m³)-step sequential algorithm.
If the PRAM has p = m³ processors, then matrix multiplication can be
done in O(log m) time
one processor compute a[i,k] x b[k,j] and then allow groups of m procs
to perform m-input +s (semigroup comp) in O(log m) time.
Because we are usually not interested in parallel processing for matrix
multiplication unless m is fairly large, this is not a practical solution!
Assume that the PRAM has p = m² processors.
In this case, matrix multiplication can be done in O(m) time by using
one processor to compute each element c[i,j] of the product matrix C.
Processor responsible for computing c[i,j]
reads the elements of Row i in A and the elements of Column j in B,
multiplies their corresponding kth elements, and
adds each of the products thus obtained to a running total t.
Parallelize the i and j loops in the sequential algorithm.
Label the m² processors with two indices (i, j), each ranging from 0 to
m – 1, rather than with a single index ranging from 0 to m² – 1.
CREW implementation of A x B
PRAM matrix multiplication algorithm using m² processors
Processor (i, j), 0 ≤i, j < m, do
begin
t := 0
for k = 0 to m – 1 do
t := t + a[i,k]b[k,j]
endfor
cij := t
end
In a given iteration of the k loop,
all processors (i, y), 0 ≤y < m, access the same element a[i,k] of A