Emulations, Scheduling and Patterns
Emulations, Scheduling and Patterns
Emulations, Scheduling and Patterns
Content
Emulations among
architectures
Task scheduling
problem
Scheduling algorithms
Load balancing
Patterns:
task decomposition,
data decomposition,
group tasks,
order tasks,
data sharing,
design evaluation
Reasons:
desire to quickly develop algorithms for a new architecture without expending the
significant resources that would be required for native algorithm development.
develop algorithms for architectures that are easier to program (e.g., sharedmemory or PRAM) and then have them run on machines that are realizable, easily
accessible, or affordable.
The developed algorithm then serves as a source code that can be quickly
compiled to run on any given architecture via emulation.
Emulation results are sometimes used for purposes other than practical
porting of software.
For example, we know that the hypercube is a powerful architecture and can
execute many algorithms efficiently. Thus, one way to show that a new
architecture is useful or efficient, without a need for developing a large set of
algorithms for it, is to show that it can emulate the hypercube efficiently.
If architecture A emulates architecture B with O(f(p)) slowdown and B in turn
emulates C with O(g(p)) slowdown (assuming, for simplicity, that they all
contain p processors), then A can emulate C with O(f ( p) g (p)) slowdown.
Problem solved with graph theory: embedding one graph in another one.
Given a task system characterizing a parallel computation, determine how the tasks can
be assigned to processing resources (scheduled on them) to satisfy certain optimality
criteria
The task system is usually defined in the form of a directed graph, with nodes specifying
computational tasks and links corresponding to data dependencies or communications
Optimality criteria may include:
Task parameters:
Execution or running time: the worst case, average case, or the probability distribution
Creation: a fixed set of tasks at compile time, or a probability distribution for task creation times
Relationship with other tasks: criticality, priority order, and/or data dependencies.
Start or end time
A tasks release time is the time before which the task should not be executed.
A hard or soft deadline may be associated with each task.
A hard deadline is specified when the results of a task become worthless if not obtained by a certain time.
A soft deadline may penalize late results but does not render them totally worthless.
Preemption.
2.
Granularity.
Complexity
List scheduling
Consider the simple case of unit-time tasks and ignore scheduling and communication overheads.
First the depth Tof the task graph, which is an indicator of its minimum possible execution time, is determin.
Then take T as goal for the total running time Tp with p processors and determine the latest possible time step
in which each task can be scheduled if our goal is to be met.
This is done by layering the nodes beginning with the output node.
The priority of tasks is then assigned in the order of the latest possible times.
Ties can be broken in various ways, e.g. give priority to a task with a larger number of descendants
When the possibility of failure for processors and other resources is considered
in task assignment or reassignment, we have fault-tolerant scheduling.
Distributed:
initially distribute the tasks among the available processors, based on some criteria,
let each processor do its own internal scheduling (ordering the execution of its set of tasks)
according to interdependencies of tasks and the results received from other processors
The advantage of this approach is that most scheduling decisions are performed in a
distributed manner.
A possible drawback is that the results may be far from optimal.
If such a scheme is combined with a method for redistributing the tasks when the initial
distribution proves to be inappropriate, good results may be achieved.
Suppose the tasks are distributed to processors in such a way that the total expected
running times of the tasks assigned to each processor are roughly the same.
Because task running times are not constants, a processor may run out of things to do
before other processors complete their assigned tasks.
Also, some processors may remain idle for long periods of time as they wait for
prerequisite tasks on other processors to be executed.
In these cases, a load balancing policy may be applied in an attempt to make the
load distribution more uniform.
switch as yet unexecuted tasks from an overloaded processor to a less loaded one.
load balancing can be initiated by an idle or lightly loaded processor (receiver-initiated) or
by an overburdened processor (sender-initiated).
Load balancing may involve a great deal of overhead that reduces the
potential gains.
If moving a task from one processor to another means copying a large
program with huge amounts of data and then updating various status tables to
indicate the new location of the task, then communication overhead is
significant
If the tasks belong to a standard set of tasks, each of which is invoked with a
small set of parameters (data) and with copies already available locally to
every processor, then moving the tasks may involve only a small broadcast
message to pass the parameters and update the system status tables.
The excess (deficiency) of work load at some nodes may be viewed as flow sources
(sinks) and the requirement is to allow the excess work load to flow from sources
to sinks via paths that are, to the extent possible, disjoint and thus free from
conflicts.
Self-scheduling systems
Decomposition Patterns.
Group Tasks,
Order Tasks,
Data Sharing
Task Decomposition
Data Decomposition
the operations that make up the task should be largely independent of the
operations taking place inside other tasks.
The computation associated with the data chunks will only be efficient if the
data chunks can be operated upon relatively independently.
To solve this problem, models of how radiation propagates through the body are
used to correct the images.
A common approach is to build a Monte Carlo model.
The images formed from the distribution of emitted radiation are of low resolution, due in part
to the scattering of the radiation as it passes through the body.
It is also difficult to reason from the absolute radiation intensities, because different pathways
through the body attenuate the radiation differently.
Randomly selected points within the body are assumed to emit radiation (usually a gamma
ray), and the trajectory of each ray is followed.
As a particle (ray) passes through the body, it is attenuated by the different organs it
traverses, continuing until the particle leaves the body and hits a camera model, thereby
defining a full trajectory.
To create a statistically significant simulation, thousands or millions of trajectories are
followed.
The basic idea is to treat a molecule as a large collection of balls connected by springs.
Example, molecular dynamics simulations show how a large protein moves around and how
differently shaped drugs might interact with the protein.
Molecular dynamics is extremely important in the pharmaceutical industry.
It is also a useful test problem for computer scientists working on parallel computing: It is
straightforward to understand, relevant to science at large, and difficult to parallelize effectively.
At each time step, the force on each atom is computed and then standard classical mechanics
techniques are used to compute how the force moves the atoms.
This process is carried out repeatedly to step through time and compute a trajectory for the
molecular system.
The forces due to the chemical bonds (the "springs") are relatively simple to compute.
These correspond to the vibrations and rotations of the chemical bonds themselves.
These are short range forces that can be computed with knowledge of the handful of atoms that
share chemical bonds.
The major difficulty arises because the atoms have partial electrical charges.
Hence, while atoms only interact with a small neighborhood of atoms through their chemical bonds,
the electrical charges cause every atom to apply a force on every other atom.
Even though each atom exerts a force on every other atom, this force decreases with the square of the
distance between the atoms.
Hence, it should be possible to pick a distance beyond which the force contribution is so small that it
can be ignored.
By ignoring the atoms that exceed this cutoff, the problem is reduced to one that scales as O(N x n),
where n is the number of atoms within the cutoff volume, usually hundreds.
The computation is still huge, and it dominates the overall runtime for the simulation, but at least the
problem is tractable.
The primary data structures hold the atomic positions (atoms), the velocities of each
atom (velocity), the forces exerted on each atom (forces), and lists of atoms within the
cutoff distance of each atoms (neighbors).
The program itself is a time stepping loop,
In which each iteration computes the short range force terms, updates the neighbor lists, and
then finds the non-bonded forces.
After the force on each atom has been computed, a simple ordinary differential equation is
solved to update the positions and velocities.
Physical properties based on atomic motions are then updated
ensure that the tasks are sufficiently independent so that managing dependencies
takes only a small fraction of the program's overall execution time.
ensure that the execution of the tasks can be evenly distributed among the
ensemble of PEs (the load-balancing problem).
Done by hand based on knowledge of the probl. & code required to solve it.
Tasks can be found in many different places
Another place to find tasks is in distinct iterations of the loops within an algorithm.
Defining a task for each function call leads to what is sometimes called a functional
decomposition.
If the iterations are independent and there are enough of them, then it might work well to
base a task decomposition on mapping each iteration onto a task.
This style of task based decomposition leads to what are sometimes called loop splitting
algorithms.
In this case, a large data structure is decomposed and multiple units of execution
concurrently update different chunks of the data structure.
In this case, the tasks are those updates on individual chunks.
This can be very time consuming and can waste a great deal of memory.
if memory and/or network bandwidth is a limiting factor, a decomposition that focuses on
the data might be more effective.
Natural: the calculation of each element of the product matrix as a separate task
straightforward to implement in a shared memory environment.
Memory access time is slow compared to floating point arithmetic, so the
bandwidth of the memory subsystem would limit the performance.
A better approach would be to design an algorithm that maximizes reuse of data
loaded into a processor's caches.
The key data structures are the neighbor list, the atomic
coordinates, the atomic velocities, and the force vector.
Every iteration that updates the force vector needs the
coordinates of a neighborhood of atoms.
The computation of non-bonded forces, however, potentially
needs the coordinates of all the atoms,
Array-based computations.
For example, decomposing the parallel update of a large tree data structure by
decomposing the data structure into subtrees that can be updated concurrently.
The body model is the large central data structure around which the computation
can be organized.
The model is broken into segments, and one or more segments are associated
with each processing element.
The body segments are only read, not written, during the trajectory
computations, so there are no data dependencies created by the decomposition
of the body model.
Each trajectory passing through the data segment defines a task.
The trajectories are initiated and propagated within a segment.
When a segment boundary is encountered, the trajectory must be passed
between segments.
It is this transfer that defines the dependencies between data chunks.
decompose the product matrix C into a set of row blocks (set of adjacent rows).
an even more effective approach that does not require the replication of the full A
matrix is to decompose all three matrices into submatrices or blocks.
An element of the velocity array is used only by the task owning the atom.
From Newton's third law, the force from atom i on atom j is the negative of the force from
atom j on atom i.
Exploit this symmetry: cut the amount of computation in half as we accumulate the force terms.
The values in the force array are not in the computation until the last steps in which the
coordinates and velocities are updated.
Therefore, the approach used is to initialize the entire force array on each PE and have the
tasks accumulate partial sums of the force terms into this array.
After all the partial force terms have completed, we sum all the PEs' arrays together to
provide the final force array.
Problem: How can the tasks that make up a problem be grouped to simplify
the job of managing dependencies?
the idea is to define groups of tasks that share constraints & simplify the probl
of managing constraints by dealing with groups rather than individual tasks.
Constraints among tasks fall into a few major categories.
Another type of constraint: when a collection of tasks must run at the same time.
Ex: the original problem domain is divided into multiple regions that can be updated in
parallel and the update of any given region requires information about the boundaries of
its neighboring regions.
If all of the regions are not processed at the same time, the parallel program could stall or
deadlock as some regions wait for data from inactive regions.
If task A depends on the results of task B, for example, then task A must wait until task B
completes before it can execute.
The goal of this pattern is to group tasks based on these constraints, because
Ex matrix multiplication
Ex molecular dynamics
Each item in the previous list corresponds to a high level operation in the original problem and
defines a task group.
In each case the updates implied in the force functions are independent - the only
dependency is the summation of the forces into a single force array.
The tasks in first two groups are independent but share the same constraints.
In both cases, coordinates for a small neighborhood of atoms are read and local contributions
are made to the force array=> merge these into a single group for bonded interactions.
The other groups have distinct temporal or ordering constraints =>should not be merged.
Temporal dependencies
Requirements that particular tasks must execute at the same time (ex: because
each requires information that will be produced by the others).
Lack of constraint, that is, total independence.
The purpose of this pattern is to help find and correctly account for
dependencies resulting from constraints on the order of execution of a
collection of tasks.
There are two goals to be met when identifying ordering constraints among
tasks and defining a partial order among task groups.
The ordering must be restrictive enough to satisfy all the constraints so that the
resulting design is correct.
The ordering should not be more restrictive than it needs to be.
Overly constraining the solution limits design options and can impair program efficiency;
The fewer the constraints, the more flexibility you have to shift tasks around to balance
the computational load among PEs.
Problem. Given a data and task decomposition for a problem, how is data shared
among the tasks?
The goal of this pattern is
After the shared data has been identified, it needs to be analyzed to see how it is used.
Shared data falls into one of the following three categories.
Read only
Effectively local.
Read-write.
any access to the data (read or write) must be protected with some type of exclusive access mechanism
(locks, semaphores, etc.), which can be very expensive.
Ex: accumulate or multiple read/single write
1.
2.
The force array, used by each group except for the neighbor list update.
3.
These coordinates are treated as read-only data by the bonded force group, the nonbonded
force group, and the neighborlist update group.
This data is read-write for the position update group.
The position update group executes alone after the other three groups are done
In the first three groups, leave accesses to the position data unprotected or even replicate it.
For the position update group, the position data belongs to the read write category, and access
to this data will need to be controlled carefully.
Is used as read-only data by the position update group and as accumulate data for the bonded
and nonbonded force groups.
This array can be put in the accumulate category for the force groups and in the read-only
category for the position update group.
The standard procedure for molecular dynamics simulations begins by initializing the force array
as a local array on each UE.
Contributions to elements of the force array are then computed by each UE, with the precise
terms computed being unpredictable because of the way the molecule folds in space.
After all the forces have been computed, the local arrays are reduced into a single array, a copy
of which is place on each UE.
The neighbor list, shared between nonbonded force group & neighbor list update group.
Is essentially local data for the neighbor list update group and read only data for the nonbonded
force computation.
The list can be managed in local storage on each UE.
2.
Design quality.
3.
Issues such as no. processors and how data structures are shared will
influence the efficiency of any design,
But the more the design depends on the target architecture, the less flexible
it will be.
Simplicity, flexibility, and efficiency are all desirablebut possibly
conflictingattributes.