Intro To Parallel Computing
Intro To Parallel Computing
Intro To Parallel Computing
Compu)ng
September 2018
Mike Jarsulic
Introduc)on
So Far...
Year Users Jobs CPU Hours Support Tickets
2012 36 * 1.2M *
2013 71 * 1.9M 141
2014 84 739K 5.3M 391
2015 125 1.5M 5.2M 372
2016 153 1.6M 6.1M 363
2017 256 4.8M 14.7M 759
2018 296 3.0M 8.8M 631
TOTAL * 11.7M 40.2M 2657
However...
The BSD research community does not understand how to uClize resources
properly
Example:
• August 1st, 2018
• 1021 cores in use
• 39 jobs had requested a total of 358 cores
• All 39 jobs were single core jobs
• Thus, 319 cores were dedicated, but idle
My takeaways:
• The BSD research community does not really understand parallel compuCng!
• I have not done a good of training the BSD research community in this area!
Office Hours
Held every Tuesday and Thursday at Peck Pavilion N161
Users come in to work one-on-one to beZer uClize our HPC resources
Three common quesCons on parallel compuCng:
1. Is this a worthwhile skill to learn? Aren't single processor systems fast
enough and always improving?
2. Why can't processor manufacturers just make faster processors?
3. Will you write me a program that will automaCcally convert my serial
applicaCon into a parallel applicaCon?
Obligatory Cost per Genome Slide
Mo)va)on
Goals (of this presenta)on)
Understand the basic terminology found in parallel compuCng
Recognize the different types of parallel architectures
Learn how parallel performance is measured
Gain a high-level understanding to the different paradigms in parallel
programming and what libraries are available to implement those
paradigms
Please note that you will not be a parallel programming rock star
acer this three hour introducCon
Overall Goals
Hopefully, there will be enough interest acer this presentaCon to
support a workshop series on parallel compuCng
Possible topics for workshops are:
• ScienCfic Programming
• ParameterizaCon
• OpenMP
• MPI
• CUDA
• OpenACC
Today's Outline
Terminology
Parallel Architectures
Measuring Parallel Performance
Embarrassingly Parallel Workflows
Shared Memory
Distributed Memory
VectorizaCon
Hybrid CompuCng
A Note on Sources
PresentaCon given at SC17 by Stout and Jablonowski from the University o
Michigan
IntroducCon to Parallel CompuCng 2nd EdiCon by Grama, Gupta, et al.
IntroducCon to Parallel Programming by Pacheco
Using OpenMP by Chapman, Jost, and Van Der Pas
Using MPI by Gropp, Lusk, Skjellum
Advanced MPI by Gropp, Hoefler, et al.
CUDA for Engineers by StorC and Yurtoglu
CUDA by Example by Sanders and Kandrot
Parallel Programming with OpenACC by Farber
The LLNL OpenMP and MPI tutorials
A Note on Format
Today's Seminar will be recorded for a ITM/CTSA grant
• QuesCons
• Whiteboard
• Swearing
Terminology
Caveats
There are no standardized definiCons for terms in parallel compuCng,
thus resources outside of this presentaCon may use the terms in a
different way
Many algorithms or socware applicaCons that you will find in the rea
world do not fall neatly into the categories menConed in this
presentaCon since they blend approaches in ways that difficult to
categorize.
Basic Terminology
Parallel CompuFng – solving a problem in which mulCple tasks cooperate
closely and make simultaneous use of mulCple processors
Embarrassingly Parallel – solving many similar, independent tasks;
parameterizaCon
MulFprocessor – mulCple processors (cores) on a single chip
Symmetric MulFprocessing (SMP) - mulCple processors sharing a single
address space, OS instance, storage, etc. All processors are treated equally
by the OS
UMA – Uniform Memory Access; all processors share the memory space
uniformly
NUMA – Non-Uniform Memory Access; memory access Cme depends on
the locaCon of the memory relaCve to the processor
More Basic Terminology
Cluster CompuFng – A parallel system consisCng of a combinaCon of
commodity units (processors or SMPs)
Capacity Cluster – A cluster designed to run a variety types of
problems regardless of the size; the goal is to perform as much
research as possible
Capability Cluster – The fastest, largest machines designed to solve
large problems; the goal is to demonstrate capability of the parallel
system
High Performance CompuFng – Solving large problems via a cluster
of computers that consist of fast networks and massive storage
Even More Basic Terminology
InstrucFon-Level Parallelism (ILP) - Improves processor performance
by having mulCple processor components simultaneously execuCng
instrucCons
Pipelining – Breaking a task into steps performed by different
processors with inputs streaming through, much like an assembly line
Superscalar ExecuFon - The ability for a processor to execute more
than one instrucCon during a clock cycle by simultaneously
dispatching mulCple instrucCons to different execuCon units on the
processor
Cache Coherence - SynchronizaCon between local caches for each
processor
Crash Simula)on Example
Simplified model based on a crash simulaCon for the Ford Motor
Company
Illustrates various aspects common to many simulaCons and
applicaCons
This example was provided by Q. Stout and C. Jablonowski of the
University of Michigan
Finite Element Representa)on
Car is modeled by a triangulated surface (elements)
The simulaCon models the movement of the elements, incorporaCng
the forces on the elements to determine their new posiCon
In each Cme step, the movement of each element depends on its
interacCon with the other elements to which it is physically adjacent
In a crash, elements may end up touching that were not touching
iniCally
The state of an element is its locaCon, velocity, and informaCon such
as whether it is metal that is bending
Car
Serial Crash Simula)on
1. For all elements
2. Read State(element), ProperCes(element), Neighbor_list(element)
3. For Cme=1 to end_of_simulaCon
4. For element=1 to num_elements
5. Compute State(element) for next Cme step, based on the
previous state of the element and its neighbors and the
properCes of the element
Simple Approach to Paralleliza)on
(Concepts)
Distributed Memory – Parallel system based on processors linked
with a fast network; processors communicate via messages
Owner Computes – Distribute elements to processors; each
processor updates the elements which it contains
Single Program MulFple Data (SPMD) - All machines run the same
program on independent data; dominant form of parallel compuCng
Distributed Car
Basic Parallel Version
Concurrently for all processors P
1. For all elements assigned to P
2. Read State(element), ProperCes(element), Neighbor-
list(element)
3. For Cme=1 to end_of_simulaCon
4. For element=1 to num_elements_in_P
5. Compute State (element) for next Cme step, based on
previous state of element and its neighbors, and on
properCes of the element
Notes
Most of the code is the same as, or similar to, serial code.
High-level structure remains the same: a sequence of steps
• The sequence is a serial construct, but
• Now the steps are performed in parallel, but
• CalculaCons for individual elements are serial
QuesCon:
• How does each processor keep track of adjacency info for neighbors in other
processors?
Distributed Car (ghost cells)
Parallel Architectures
Flynn's Taxonomies
Hardware ClassificaCons
{ S, M } I { S, M } D
Single InstrucFon (SI) - System in which all processors execute the same
instrucCon
MulFple InstrucFon (MI) - System in which different processors may
execute different instrucCons
Single Data (SD) - System in which all processors operate on the same data
MulFple Data (MD) - System in which different processors may operate on
different data
M. J. Flynn. Some computer organizaCons and their effecCveness. IEEE
TransacCons on Computers, C-21(9):948–960, 1972.
Flynn's Taxonomies
SISD – Classic von Neumann architecture; serial computer
MIMD – CollecCons of autonomous processors that can execute
mulCple independent programs; each of which can have its own data
stream
SIMD – Data is divided among the processors and each data item is
subjected to the same sequence of instrucCons; GPUs, Advanced
Vector Extensions (AVX)
MISD – Very rare; systolic arrays; smart phones carried by
Chupacabras
CRAY-1 Vector Machine (1976)
Vector Machines Today
SoWware Taxonomies
Data Parallel (SIMD)
• Parallelism that is a result of idenCcal operaCons being applied concurrently
on different data items; e.g., many matrix algorithms
• Difficult to apply to complex problems
DM socware on SM hardware:
• Message Passing Interface (MPI) - designed for DM Hardware but available on
SM systems
SM socware on DM hardware
• Remote Memory Access (RMA) included within MPI-3
• ParCConed Global Address Space (PGAS) languages
• Unified Parallel C (extension to ISO C 99)
• Coarray Fortran (Fortran 2008)
Difficul)es
SerializaCon causes boZlenecks
Workload is not distributed
Debugging is hard
Serial approach doesn't parallelize
Parallel Performance
Defini)ons
SerialTime(n) - Time for a serial program to solve a problem with an input
of size n
ParallelTime(n,p) - Time for a parallel program to solve the same problem
with an input size of n using p processors
SerialTime(n) ≤ ParallelTime(n,1)
Why?
• Parallel computer has p Cmes as much RAM so a higher fracCon of program
memory is in RAM instead of disk.
• Parallel program used a different algorithm which was not possible with the
serial program.
Amdahl's Law
Let f be the fracCon of Cme spent on serial operaCons
Let ParallelOps = 1 – f and assume linear speedup
For p processors:
• ParallelTime(p) ≥ SerialTime(p) * [f + (ParallelOps/p)]
• Speedup(p) ≤ 1 / (f + ParallelOps/p)
Techniques
• Parameter sweeps (modify parameters mathemaCcally for each task)
• List-driven analysis (modify parameters based on a list of values)
Example: List-Driven Analysis
Problem: Calculate the theoreCcal performance for the CRI HPC
clusters (historically)
Rpeak = Nodes * CoresPerNodes* ClockSpeed * InstructPerCycle
List
Cluster Nodes CoresPerNode ClockSpeed InstructPerCycle
ibicluster 99 8 2.66e09 4
tarbell 40 64 2.20e09 8
gardner 126 28 2.00e09 16
Base
#PBS –N Rpeak.@Cluster
#PBS –l nodes=1:ppn=1
#PBS –l mem=1gb
#PBS –l wallCme=20:00
$OMP DO
DO i = 1, n
a(i) = b(i) + c(i)
END DO
$OMP END DO
Parallel Workshares
• Fortran only
• Used to parallelize assignments
Matrix-Vector Mul)plica)on: OpenMP C
#include <stdio.h>
#include <stdlib.h>
void Omp_mat_vect(int m, int n, double * restrict a,
double * restrict b, double * restrict c) {
int i, j;
#pragma omp parallel for default (none) \
shared(m, n, a, b, c) private (i, j)
for (i = 0; i < m; i++) {
a[i] = 0.0;
for (j = 0; j < n; j++)
a[i] += b[i*n+j] * c[j];
} /* End of omp parallel for */
Matrix-Vector Mul)plica)on: OpenMP
Fortran
ubrouCne Pth_mat_vect(m, n, a, b, c)
mplicit none
nteger(kind=4) :: m, n
eal(kind=8) :: a(1:m), b(1:m, 1:n), c(1:n)
nteger :: i, j
$OMP PARALLEL DO DEFAULT(NONE) &
$OMP SHARED(m, n, a, b, c) PRIVATE(i, j)
do i = 1, m
a(i) = 0.0
do j = 1, n
a(i) = a(i) + b(i, j) * c(j)
end do
end do
$OMP END PARALLEL DO
eturn
nd subrouCne Pth_mat_vect
OpenMP Errors
Variable Scoping: Which are shared and which are private
SequenCal I/O in a parallel region can lead to an unpredictable order
False sharing: two or more processors access different variables in
the same cache line (Performance)
Race CondiCons
Deadlock
Summa)on Example
What is with the following code wrong?
sum = 0.0
$OMP PARALLEL DO
DO i = 1, n
sum = sum + a(i)
END DO
DO i = 1, n
sum = sum + a(i)
END DO
Source Avoidance
Non-Blocking Communica)on
Non-Blocking operaCons
• MPI_ISEND()
• MPI_IRECV()
• MPI_WAIT()
The user can check for data at a later stage in the program without
waiCng
• MPI_TEST()
Non-blocking operaCons will perform beZer than blocking operaCons
Possible to overlap communicaCon with computaCon
MPI Program Template
#include "mpi.h"
* IniCalize MPI */
MPI_Init(&argc, &argv);
* Allow each processor to determine its role */
nt num_processor, my_rank;
MPI_Comm_size(MPI_COMM_WORLD, &num_processor);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
* Send and receive some stuff here */
* Finalize */
MPI_Finalize()
MPI Summa)on
MPI Summa)on
* Send and receive some stuff here */
f (my_rank == 0) {
sum = 0.0;
for (source=1; source < num_procs; source++) {
MPI_RECV(&value, 1, MPI_FLOAT, source, tag,
MPI_COMM_WORLD, &status);
sum += value;
}
else {
MPI_SEND(&value, 1, MPI_FLOAT, 0, tag, MPI_COMM_WORLD);
A More Efficient Receive
In the previous example, rank 0 received the messages in order based
on ranks.
Thus, if processor 2 was delayed, then processor 0 was also delayed
Make the following modificaCon:
ReducCon operaCons:
• MPI_SUM, MPI_MIN, MPI_MAX, MPI_PROD
• MPI_LAND, MPI_LOR
Collec)ve Communica)on
Most frequently invoked MPI rouCnes acer send and receive
Improve clarity, run faster than send and receive, and reduce the
likelihood of making an error
Examples
• Broadcast (MPI_Bcast)
• ScaZer (MPI_ScaZer)
• Gather (MPI_Gather)
• All Gather (MPI_Allgather)
Broadcast
Sca`er
Gather
All Gather
MPI Data Types
PrimiCve data types (correspond to those found in the underlying
language)
• Fortran
• MPI_CHARACTER
• MPI_INTEGER
• MPI_REAL, MPI_DOUBLE_PRECISION
• MPI_LOGICAL
• C/C++ (include many more variants than Fortran)
• MPI_CHAR
• MPI_INT, MPI_UNSIGNED, MPI_LONG
• MPI_FLOAT, MPI_DOUBLE
• MPI_C_BOOL
MPI Data Types
Derived data types are also possible
• Vector – data separated by a constant stride
• ConCguous – data separated by a stride of 1
• Struct – mixed types
• Indexed – array of indices
MPI Vector
MPI Con)guous
MPI Struct
MPI Indexed
MPI Synchroniza)on
Implicit synchronizaCon
• Blocking communicaCon
• CollecCve communicaCon
Explicit synchronizaCon
• MPI_Wait
• MPI_Waitany
• MPI_Barrier
Remember, synchronizaCon can hinder performance
Addi)onal MPI Features
Virtual topologies
User-created communicators
User-specified derived data types
Parallel I/O
One-sided communicaCon (MPI_Put, MPI_Get)
Non-blocking collecCve communicaCon
Remote Memory Access
Vectoriza)on
Accelerators
Principle: Split an operaCon into independent parts and execute the
parts concurrently in an accelerator (SIMD)
DO I = 1, 1000
A(I) = B(I) + C(I)
END DO
High-Level Programming
• OpenACC
• OpenMP 4.5
• Allows for a single code base that may be compiled on both mulCcore and
GPUs
CUDA Matrix Mul)plica)on
nt main(void) {
int *a, *b, *c; // CPU copies
int *dev_a, *dev_b, *dev_c; // GPU copies
/* Allocate memory on GPU */
arraysize = n*sizeof(int);
cudaMalloc( (void**)\&dev_a, arraysize);
cudaMalloc( (void**)\&dev_b, arraysize);
cudaMalloc( (void**)\&dev_c, sizeof(in));
use malloc for a, b, c
iniCalize a, b
/* Move a and b to the GPU */
cudaMemcpy(dev_a, a, arraysize, cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, arraysize, cudaMemcpyHosZoDevice);
/* Launch GPU kernel */
dot <<< n/threads_per_block,threads_per_block >>>(dev_a,dev_b,dev_c);
/* Copy results back to CPU
cudaMemcpy(c, dev_c, sizeof(int), cudaMemcpyDeviceToHost)
return 0;
CUDA Matrix Mul)plica)on
global_ void dot(int *a, int *b, int *c) {
_shared_ int temp[threads_per_block];
int index = threadIdx.x + blockIdx.x + blockDim.x;
/* Convert warp index to global */
temp[threadIdx.x] = a[index] * b[index];
/* Avoid race condiCon on sum within block */
_syncthreads();
if (threadIdx.x == 0) {
int sum = 0;
for (int i = 0; i < threads_per_block; i++)
sum += temp[i];
/* add to global sum */
atomicAdd(c, sum)
}
OpenACC and OpenMP
Share the same model of computaCon
Host-centric – host device offloads code regions and data to
accelerators
Device – has independent memory and mulCple threads
Mapping clause – Defines the relaConship between memory on the
host device and memory on the accelerator
OpenACC Matrix Mul)plica)on
nt main(int argc, char** argv) {
int n = atoi(argv[1]);
float *a = (float *)malloc(sizeof(float) * n * n);
float *b = (float *)malloc(sizeof(float) * n * n);
float *c = (float *)malloc(sizeof(float) * n * n);
#pragma acc data copyin(a[0:n*n], b[0:n*n]), copy(c[0:n*n]) {
int i, j, k;
#pragma acc kernels
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
a[i*n+j] = (float) i + j;
b[i*n+j] = (float) i – j;
c[i*n+j] = 0.0;
}
}
OpenACC Matrix Mul)plica)on
#pragma acc kernels
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
c[i * n + j] += a[i * n + k] * b[k * n + j];
}
}
}
}
Speedup?
Some problems can achieve a speedup of 10x-100x by offloading
some of the work to a GPU (e.g. linpack, matrix mulCplicaCon)
In reality:
• Only a small fracCon of applicaCons can uClize a GPU effecCvely
• The average speedup of those applicaCons is ~2x-4x
• OpCmizing your code using mulCcore technologies is probably a beZer use of
your Cme
Problem Areas
Branching can cut your performance by 50% since instrucCons need
to be issued for both branches
Memory bandwidth on GPUs is poor
• Need to make sure your operands are used more than once
• For the NVidia Volta, all operands need to be used 67x to achieve peak FLOPs
Accelerator Trends
Intel has canceled Knight's Landing and Knight's Hill; future Xeon Phi status
is unknown
GPUs are becoming less rigidly SIMD and including beZer memory
bandwidth
More programmers moving from low-level programming like CUDA to
high-level direcCves
PGI Compiler was purchased by NVidia in 2013; cut support for OpenCL
OpenACC and OpenMP sCll have not merged
Efficiency of GPUs is sCll problemaCc and the programming model is a
moving target
New MoCvaCon: Deep Learning
Hybrid Compu)ng
Current Trend
Accelerator-based machines are grabbing the high rankings on the
Top 500, but clusters with standard cores are sCll most important
economically
Economics dictates distributed memory system of shared memory
boards with mulCcore commodity chips, perhaps with some form of
accelerator distributed sparingly
MPI+X
Conclusion
Topics for Possible Future Workshops
ScienCfic CompuCng (C++ or Fortran or Both)
Parallel CompuCng Theory (e.g., Load balancing, measuring performance)
Embarrassingly Parallel
pthreads
OpenMP
MPI
OpenACC
CUDA
Parallel Debugging/Profiling