Intro To Parallel Computing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 127

Introduc)on to Parallel

Compu)ng
September 2018

Mike Jarsulic
Introduc)on
So Far...
Year Users Jobs CPU Hours Support Tickets
2012 36 * 1.2M *
2013 71 * 1.9M 141
2014 84 739K 5.3M 391
2015 125 1.5M 5.2M 372
2016 153 1.6M 6.1M 363
2017 256 4.8M 14.7M 759
2018 296 3.0M 8.8M 631
TOTAL * 11.7M 40.2M 2657
However...
The BSD research community does not understand how to uClize resources
properly
Example:
• August 1st, 2018
• 1021 cores in use
• 39 jobs had requested a total of 358 cores
• All 39 jobs were single core jobs
• Thus, 319 cores were dedicated, but idle

My takeaways:
• The BSD research community does not really understand parallel compuCng!
• I have not done a good of training the BSD research community in this area!
Office Hours
Held every Tuesday and Thursday at Peck Pavilion N161
Users come in to work one-on-one to beZer uClize our HPC resources
Three common quesCons on parallel compuCng:
1. Is this a worthwhile skill to learn? Aren't single processor systems fast
enough and always improving?
2. Why can't processor manufacturers just make faster processors?
3. Will you write me a program that will automaCcally convert my serial
applicaCon into a parallel applicaCon?
Obligatory Cost per Genome Slide
Mo)va)on
Goals (of this presenta)on)
Understand the basic terminology found in parallel compuCng
Recognize the different types of parallel architectures
Learn how parallel performance is measured
Gain a high-level understanding to the different paradigms in parallel
programming and what libraries are available to implement those
paradigms

Please note that you will not be a parallel programming rock star
acer this three hour introducCon
Overall Goals
Hopefully, there will be enough interest acer this presentaCon to
support a workshop series on parallel compuCng
Possible topics for workshops are:
• ScienCfic Programming
• ParameterizaCon
• OpenMP
• MPI
• CUDA
• OpenACC
Today's Outline
Terminology
Parallel Architectures
Measuring Parallel Performance
Embarrassingly Parallel Workflows
Shared Memory
Distributed Memory
VectorizaCon
Hybrid CompuCng
A Note on Sources
PresentaCon given at SC17 by Stout and Jablonowski from the University o
Michigan
IntroducCon to Parallel CompuCng 2nd EdiCon by Grama, Gupta, et al.
IntroducCon to Parallel Programming by Pacheco
Using OpenMP by Chapman, Jost, and Van Der Pas
Using MPI by Gropp, Lusk, Skjellum
Advanced MPI by Gropp, Hoefler, et al.
CUDA for Engineers by StorC and Yurtoglu
CUDA by Example by Sanders and Kandrot
Parallel Programming with OpenACC by Farber
The LLNL OpenMP and MPI tutorials
A Note on Format
Today's Seminar will be recorded for a ITM/CTSA grant
• QuesCons
• Whiteboard
• Swearing
Terminology
Caveats
There are no standardized definiCons for terms in parallel compuCng,
thus resources outside of this presentaCon may use the terms in a
different way
Many algorithms or socware applicaCons that you will find in the rea
world do not fall neatly into the categories menConed in this
presentaCon since they blend approaches in ways that difficult to
categorize.
Basic Terminology
Parallel CompuFng – solving a problem in which mulCple tasks cooperate
closely and make simultaneous use of mulCple processors
Embarrassingly Parallel – solving many similar, independent tasks;
parameterizaCon
MulFprocessor – mulCple processors (cores) on a single chip
Symmetric MulFprocessing (SMP) - mulCple processors sharing a single
address space, OS instance, storage, etc. All processors are treated equally
by the OS
UMA – Uniform Memory Access; all processors share the memory space
uniformly
NUMA – Non-Uniform Memory Access; memory access Cme depends on
the locaCon of the memory relaCve to the processor
More Basic Terminology
Cluster CompuFng – A parallel system consisCng of a combinaCon of
commodity units (processors or SMPs)
Capacity Cluster – A cluster designed to run a variety types of
problems regardless of the size; the goal is to perform as much
research as possible
Capability Cluster – The fastest, largest machines designed to solve
large problems; the goal is to demonstrate capability of the parallel
system
High Performance CompuFng – Solving large problems via a cluster
of computers that consist of fast networks and massive storage
Even More Basic Terminology
InstrucFon-Level Parallelism (ILP) - Improves processor performance
by having mulCple processor components simultaneously execuCng
instrucCons
Pipelining – Breaking a task into steps performed by different
processors with inputs streaming through, much like an assembly line
Superscalar ExecuFon - The ability for a processor to execute more
than one instrucCon during a clock cycle by simultaneously
dispatching mulCple instrucCons to different execuCon units on the
processor
Cache Coherence - SynchronizaCon between local caches for each
processor
Crash Simula)on Example
Simplified model based on a crash simulaCon for the Ford Motor
Company
Illustrates various aspects common to many simulaCons and
applicaCons
This example was provided by Q. Stout and C. Jablonowski of the
University of Michigan
Finite Element Representa)on
Car is modeled by a triangulated surface (elements)
The simulaCon models the movement of the elements, incorporaCng
the forces on the elements to determine their new posiCon
In each Cme step, the movement of each element depends on its
interacCon with the other elements to which it is physically adjacent
In a crash, elements may end up touching that were not touching
iniCally
The state of an element is its locaCon, velocity, and informaCon such
as whether it is metal that is bending
Car
Serial Crash Simula)on
1. For all elements
2. Read State(element), ProperCes(element), Neighbor_list(element)
3. For Cme=1 to end_of_simulaCon
4. For element=1 to num_elements
5. Compute State(element) for next Cme step, based on the
previous state of the element and its neighbors and the
properCes of the element
Simple Approach to Paralleliza)on
(Concepts)
Distributed Memory – Parallel system based on processors linked
with a fast network; processors communicate via messages
Owner Computes – Distribute elements to processors; each
processor updates the elements which it contains
Single Program MulFple Data (SPMD) - All machines run the same
program on independent data; dominant form of parallel compuCng
Distributed Car
Basic Parallel Version
Concurrently for all processors P
1. For all elements assigned to P
2. Read State(element), ProperCes(element), Neighbor-
list(element)
3. For Cme=1 to end_of_simulaCon
4. For element=1 to num_elements_in_P
5. Compute State (element) for next Cme step, based on
previous state of element and its neighbors, and on
properCes of the element
Notes
Most of the code is the same as, or similar to, serial code.
High-level structure remains the same: a sequence of steps
• The sequence is a serial construct, but
• Now the steps are performed in parallel, but
• CalculaCons for individual elements are serial

QuesCon:
• How does each processor keep track of adjacency info for neighbors in other
processors?
Distributed Car (ghost cells)
Parallel Architectures
Flynn's Taxonomies
Hardware ClassificaCons
{ S, M } I { S, M } D
Single InstrucFon (SI) - System in which all processors execute the same
instrucCon
MulFple InstrucFon (MI) - System in which different processors may
execute different instrucCons
Single Data (SD) - System in which all processors operate on the same data
MulFple Data (MD) - System in which different processors may operate on
different data
M. J. Flynn. Some computer organizaCons and their effecCveness. IEEE
TransacCons on Computers, C-21(9):948–960, 1972.
Flynn's Taxonomies
SISD – Classic von Neumann architecture; serial computer
MIMD – CollecCons of autonomous processors that can execute
mulCple independent programs; each of which can have its own data
stream
SIMD – Data is divided among the processors and each data item is
subjected to the same sequence of instrucCons; GPUs, Advanced
Vector Extensions (AVX)
MISD – Very rare; systolic arrays; smart phones carried by
Chupacabras
CRAY-1 Vector Machine (1976)
Vector Machines Today
SoWware Taxonomies
Data Parallel (SIMD)
• Parallelism that is a result of idenCcal operaCons being applied concurrently
on different data items; e.g., many matrix algorithms
• Difficult to apply to complex problems

Single Program, MulFple Data (SPMD)


• A single applicaCon is run across mulCple processes/threads on a MIMD
architecture
• Most processes execute the same code but do not work in lock-step
• Dominant form of parallel programming
SISD vs. SIMD
MIMD Architectures (Shared Memory)

Uniform Memory Access (UMA) Non-Uniform Memory Access (NUMA)


More MIMD Architectures

Distributed Memory Hybrid Memory


Shared Memory (SM)
AYributes:
• Global memory space
• Each processor will uClize it's own cache for a porCon of global memory;
consistency of the cache is maintained by hardware
Advantages:
• User-friendly programming techniques (OpenMP and OpenACC)
• Low latency for data sharing between tasks
Disadvantages:
• Global memory space-to-CPU path may be a boZleneck
• Non-Uniform Memory Access
• Programmer responsible for synchronizaCon
Distributed Memory (DM)
AYributes:
• Memory is shared amongst processors through message passing
Advantages:
• Memory scales based on the number of processors
• Access to a processor's own memory is fast
• Cost effecCve
Disadvantages:
• Error prone; programmers are responsible for the details of the
communicaCon
• Complex data structures may be difficult to distribute
Hardware/SoWware Models
Socware and hardware models do not need to match

DM socware on SM hardware:
• Message Passing Interface (MPI) - designed for DM Hardware but available on
SM systems

SM socware on DM hardware
• Remote Memory Access (RMA) included within MPI-3
• ParCConed Global Address Space (PGAS) languages
• Unified Parallel C (extension to ISO C 99)
• Coarray Fortran (Fortran 2008)
Difficul)es
SerializaCon causes boZlenecks
Workload is not distributed
Debugging is hard
Serial approach doesn't parallelize
Parallel Performance
Defini)ons
SerialTime(n) - Time for a serial program to solve a problem with an input
of size n
ParallelTime(n,p) - Time for a parallel program to solve the same problem
with an input size of n using p processors

SerialTime(n) ≤ ParallelTime(n,1)

Speedup(n,p) = SerialTime(n) / ParallelTime(n,p)


Work(n,p) = p * ParallelTime(n,p)
Efficiency(n,p) = SerialTime(n) / [p * ParallelTime(n,p)]
Defini)ons
ExpectaCons:
• 0 < Speedup ≤ p
• SerialWork ≤ ParallelWork < ∞
• 0 < Efficiency ≤ 1

Linear Speedup: Speedup = p, given some restricCon on the


relaConship between n and p
In The Real World
Superlinear Speedup
Superlinear Speedup: Speedup > p, thus Efficiency > 1

Why?
• Parallel computer has p Cmes as much RAM so a higher fracCon of program
memory is in RAM instead of disk.
• Parallel program used a different algorithm which was not possible with the
serial program.
Amdahl's Law
Let f be the fracCon of Cme spent on serial operaCons
Let ParallelOps = 1 – f and assume linear speedup

For p processors:
• ParallelTime(p) ≥ SerialTime(p) * [f + (ParallelOps/p)]
• Speedup(p) ≤ 1 / (f + ParallelOps/p)

Thus, Speedup(p) ≤ 1/f, no maZer the number of processors used


Maximum Possible Performance
Pessimis)c Interpreta)on
Op)mis)c Interpreta)on
We should be more opCmisCc than Amdahl's Law because:
• Algorithms with much smaller values of f
• More Cme spent in RAM than disk
• Time spent in f is a decreasing fracCon of the total Cme as problem size
increases
Common Program Structure
Scaling Strategy
Increase n as p increases
• Amdahl considered strong scaling where n is fixed

UClize weak scaling


• The amount of data per processor is fixed
• Efficiency can remain high if communicaCon does not increase excessively
Embarrassingly Parallel Jobs
Parameteriza)on
Requirements
• Independent tasks
• Common applicaCons
• One job submission per task

Techniques
• Parameter sweeps (modify parameters mathemaCcally for each task)
• List-driven analysis (modify parameters based on a list of values)
Example: List-Driven Analysis
Problem: Calculate the theoreCcal performance for the CRI HPC
clusters (historically)
Rpeak = Nodes * CoresPerNodes* ClockSpeed * InstructPerCycle
List
Cluster Nodes CoresPerNode ClockSpeed InstructPerCycle
ibicluster 99 8 2.66e09 4
tarbell 40 64 2.20e09 8
gardner 126 28 2.00e09 16
Base
#PBS –N Rpeak.@Cluster
#PBS –l nodes=1:ppn=1
#PBS –l mem=1gb
#PBS –l wallCme=20:00

/rpeak -n @Nodes –c @CoresPerNode \


–s @ClockSpeed -i @InstructPerCycle
Shared Memory Parallelism
Shared Memory
All processors can directly access all the memory in the system
(though access Cmes may be different due to NUMA)
Socware portability – ApplicaCons can be developed on serial
machines and run on parallel machines without any changes
Latency Hiding – Ability to mask the access latency for memory
access, I/O, and communicaCons
Scheduling– Supports system-level dynamic mapping of tasks to
processors
Ease of programming – Significantly easier to program (in my opinion
using a shared memory model over a distributed memory model
Paralleliza)on Techniques: pthreads
POSIX threads
Standard threading implementaCon for UNIX-like operaCng systems
(Linux, Mac OS X)
Library that can be linked with C programs
Some compilers include a Fortran version of pthreads
Knowledge of pthreads can easily be transferred to programming
other widely used threading specificaCons (e.g., Java threads)
Matrix-Vector Mul)plica)on (pthreads)
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
* Global variable: accessible to all threads */
nt thread_count;
oid* Pth_mat_vect(void* rank); /* Thread funcCon */
nt main(int argc, char* argv[]) {
long thread; /* Use long in case of a 64-bit system */
pthread_t* thread_handles;
* Get number of threads from the command line */
hread_count = strtol(argv[1], NULL, 10);
hread_handles = malloc(thread_count*sizeof(pthread_t));
Matrix-Vector Mul)plica)on (pthreads)
* Create a thread for each rank */
or (thread=0; thread < thread_count; thread++)
pthread_create(&thread_handles[thread], NULL,
Pth_mat_vect, (void*) thread);
* Join the results from each thread */
or (thread=0; thread < thread_count; thread++)
pthread_join(thread_handles[thread], NULL);
ree(thread_handles);
eturn 0;
/* main */
Matrix-Vector Mul)plica)on (pthreads)
void* Pth_mat_vect(void* rank) {
long my_rank = (long) rank;
int i, j;
int local_m = m/thread_count;
int my_first_row = my_rank * local_m;
int my_last_row = (my_rank + 1) * local_m – 1;
for (i = my_first_row; i <= my_last_row; i++) {
y[i] = 0.0;
for (j = 0; j < n; j++)
y[i] += A[i][j]*x[j];
}
return NULL;
/* Pth_mat_vect */
Paralleliza)on Techniques: OpenMP
OpenMP is sort of an HPC standard for shared memory programming
OpenMP version 4.5 released in 2015 and includes accelerator
support as an advanced feature
API for thread-based parallelism
Explicit programming model, compiler interprets direcCves
Based on a combinaCon of compiler direcCves, library rouCnes, and
environment variables
Uses the fork-join model of parallel execuCon
Available in most Fortran and C compilers
OpenMP Goals
StandardizaCon: standard among all shared memory architectures
and hardware plaƒorms
Lean: simple and limited set of compiler direcCves for shared memory
programming. Ocen significant performance gains using just 4-6
direcCves in complex applicaCons.
Ease of use: supports incremental parallelizaCon of a serial program,
not an all-or-nothing approach.
Portability: supports Fortran, C, and C++
OpenMP Building Blocks
Compiler DirecCves (embedded in code)
• Parallel regions (PARALLEL)
• Parallel loops (PARALLEL DO)
• Parallel workshare (PARALLEL WORKSHARE)
• Parallel secCons (PARALLEL SECTIONS)
• Parallel tasks (PARALLEL TASK)
• Serial secCons (SINGLE)
• SynchronizaCon (BARRIER, CRITICAL, ATOMIC, …)
• Data structures (PRIVATE, SHARED, REDUCTION)
Run-Cme library rouCnes (called in code)
• OMP_SET_NUM_THREADS
• OMP_GET_NUM_THREADS
UNIX Environment Variables (set before program execuCon)
• OMP_NUM_THREADS
Fork-Join Model
Parallel execuCon is achieved by generaCng threads which are
executed in parallel
Master thread executes in serial unCl the first parallel region is
encountered
Fork – The master thread created a team of threads which are
executed in parallel
Join – When the team members complete the work, they synchronize
and terminate. The master thread conCnues sequenCally.
Fork-Join Model
Work-Sharing Constructs
Barriers
Barriers may be needed for
correctness
SynchronizaCon degrades
performance
OpenMP Nota)on
OpenMP recognizes the following compiler direcCves that start with:
• !$OMP (in Fortran)
• #pragma omp (in C/C++)
Parallel Loops
Each thread executes part of the loop
The number of iteraCons is staCcally assigned to the threads upon
entry to the loop
Number of iteraCons cannot be changed during execuCons
Implicit BARRIER at the end of the loop
High efficiency
Parallel Loop Example (Fortran)
$OMP PARALLEL

$OMP DO

DO i = 1, n
a(i) = b(i) + c(i)
END DO

$OMP END DO

$OMP END PARALLEL


Parallel Sec)ons
MulCple independent secCons can be executed concurrently by
separate threads
Each parallel secCon is assigned to a specific thread which executes
the work from start to finish
Implict BARRIER at the end of each SECTION
Nested parallel secCons are possible, but impracCcal
Parallel Sec)ons Example (Fortran)
$OMP PARALLEL SECTIONS
$OMP SECTION
DO i = 1, n
a(i) = b(i) + c(i)
END DO
$OMP SECTION
DO i = 1, k
d(i) = e(i) + f(i)
END DO
$OMP END PARALLEL SECTIONS
Parallel Tasks and Workshares
Parallel Tasks
• Unstructured parallelism
• Irregular problems like:
• Recursive algorithms
• Unbounded loops

Parallel Workshares
• Fortran only
• Used to parallelize assignments
Matrix-Vector Mul)plica)on: OpenMP C
#include <stdio.h>
#include <stdlib.h>
void Omp_mat_vect(int m, int n, double * restrict a,
double * restrict b, double * restrict c) {
int i, j;
#pragma omp parallel for default (none) \
shared(m, n, a, b, c) private (i, j)
for (i = 0; i < m; i++) {
a[i] = 0.0;
for (j = 0; j < n; j++)
a[i] += b[i*n+j] * c[j];
} /* End of omp parallel for */
Matrix-Vector Mul)plica)on: OpenMP
Fortran
ubrouCne Pth_mat_vect(m, n, a, b, c)
mplicit none
nteger(kind=4) :: m, n
eal(kind=8) :: a(1:m), b(1:m, 1:n), c(1:n)
nteger :: i, j
$OMP PARALLEL DO DEFAULT(NONE) &
$OMP SHARED(m, n, a, b, c) PRIVATE(i, j)
do i = 1, m
a(i) = 0.0
do j = 1, n
a(i) = a(i) + b(i, j) * c(j)
end do
end do
$OMP END PARALLEL DO
eturn
nd subrouCne Pth_mat_vect
OpenMP Errors
Variable Scoping: Which are shared and which are private
SequenCal I/O in a parallel region can lead to an unpredictable order
False sharing: two or more processors access different variables in
the same cache line (Performance)
Race CondiCons
Deadlock
Summa)on Example
What is with the following code wrong?

sum = 0.0
$OMP PARALLEL DO

DO i = 1, n
sum = sum + a(i)
END DO

$ OMP END PARALLEL DO


Summa)on Example Correc)on
sum = 0.0
$OMP PARALLEL DO REDUCTION(+,sum)

DO i = 1, n
sum = sum + a(i)
END DO

$ OMP END PARALLEL DO


Distributed Memory Parallelism
Message Passing
CommunicaCon on distributed memory systems are a significant
aspect of performance and correctness
Messages are relaCvely slow, with startup Cmes (latency) taking
thousands of cycles
Once message passing has started, the addiConal Cme per byte
(bandwidth) is relaCvely small
Performance on Gardner
Intel Xeon E5-2683 Processor (Haswell)
Processor speed: 2,000 cycles per microsecond (µsec)
16 FLOPs/cycle: 32,000 FLOPs per µsec
MPI message latency = ~2.5 µsec or 80,000 FLOPs
MPI message bandwidth = ~7,000 bytes/µsec = 4.57 FLOPs/byte
Reducing Latency
Reduce the number of messages by mapping communicaCng enCCes
onto the same processor
Combine messages having the same sender and desCnaCon
If processor A has data needed by processor B, have A send it to B,
rather than waiCng for B to request it. Processor A should send as
soon as the data is ready, processor B should read it as late as
possible to increase the probability that the data has arrived
Other Issues With Message Passing
Network CongesCon
Deadlock
• Blocking: a processor cannot proceed unCl the message is finished
• With blocking communicaCon, you may reach a point where no processor can
proceed
• Non-blocking communicaCon: easiest way to prevent deadlock; processors
can send and proceed before receive is finished
Message Passing Interface (MPI)
InternaConal Standard
MPI 1.0 released in 1994
Current version is 3.1 (June 2015)
MPI 4.0 standard in the works
Available on all parallel systems
Interfaces in C/C++ and Fortran
Bindings for MATLAB, Python, R, and Java
Works on both distributed memory and shared memory hardware
Hundreds of funcCons, but you will need ~6-10 to get started
MPI Basics
Two common MPI funcCons:
• MPI_SEND() - to send a message
• MPI_RECV() - to receive a message
FuncCon like write and read statements
Both are blocking operaCons
However, a system buffer is used that allows small messages to be
non-blocking, but large messages will be blocking
The system buffer is based on the MPI implementaCon, not the
standard
Blocking communicaCon may lead to deadlocks
Small Messages
Large Messages
Deadlocks

Source Avoidance
Non-Blocking Communica)on
Non-Blocking operaCons
• MPI_ISEND()
• MPI_IRECV()
• MPI_WAIT()
The user can check for data at a later stage in the program without
waiCng
• MPI_TEST()
Non-blocking operaCons will perform beZer than blocking operaCons
Possible to overlap communicaCon with computaCon
MPI Program Template
#include "mpi.h"
* IniCalize MPI */
MPI_Init(&argc, &argv);
* Allow each processor to determine its role */
nt num_processor, my_rank;
MPI_Comm_size(MPI_COMM_WORLD, &num_processor);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
* Send and receive some stuff here */

* Finalize */
MPI_Finalize()
MPI Summa)on
MPI Summa)on
* Send and receive some stuff here */
f (my_rank == 0) {
sum = 0.0;
for (source=1; source < num_procs; source++) {
MPI_RECV(&value, 1, MPI_FLOAT, source, tag,
MPI_COMM_WORLD, &status);
sum += value;
}
else {
MPI_SEND(&value, 1, MPI_FLOAT, 0, tag, MPI_COMM_WORLD);
A More Efficient Receive
In the previous example, rank 0 received the messages in order based
on ranks.
Thus, if processor 2 was delayed, then processor 0 was also delayed
Make the following modificaCon:

MPI_RECV(&value, 1, MPI_FLOAT, MPI_ANY_SOURCE, tag,


MPI_COMM_WORLD, &status);

Now processor 0 can process messages as soon as they arrive


MPI Reduc)on
Like OpenMP, MPI also has reducCon operaCons that can be used for
tasks such as summaCon
ReducCon is a type of collecCve communicaCon

MPI_REDUCE(&value, &sum, 1, MPI_FLOAT, MPI_SUM,


0, MPI_COMM_WORLD);

ReducCon operaCons:
• MPI_SUM, MPI_MIN, MPI_MAX, MPI_PROD
• MPI_LAND, MPI_LOR
Collec)ve Communica)on
Most frequently invoked MPI rouCnes acer send and receive
Improve clarity, run faster than send and receive, and reduce the
likelihood of making an error
Examples
• Broadcast (MPI_Bcast)
• ScaZer (MPI_ScaZer)
• Gather (MPI_Gather)
• All Gather (MPI_Allgather)
Broadcast
Sca`er
Gather
All Gather
MPI Data Types
PrimiCve data types (correspond to those found in the underlying
language)
• Fortran
• MPI_CHARACTER
• MPI_INTEGER
• MPI_REAL, MPI_DOUBLE_PRECISION
• MPI_LOGICAL
• C/C++ (include many more variants than Fortran)
• MPI_CHAR
• MPI_INT, MPI_UNSIGNED, MPI_LONG
• MPI_FLOAT, MPI_DOUBLE
• MPI_C_BOOL
MPI Data Types
Derived data types are also possible
• Vector – data separated by a constant stride
• ConCguous – data separated by a stride of 1
• Struct – mixed types
• Indexed – array of indices
MPI Vector
MPI Con)guous
MPI Struct
MPI Indexed
MPI Synchroniza)on
Implicit synchronizaCon
• Blocking communicaCon
• CollecCve communicaCon
Explicit synchronizaCon
• MPI_Wait
• MPI_Waitany
• MPI_Barrier
Remember, synchronizaCon can hinder performance
Addi)onal MPI Features
Virtual topologies
User-created communicators
User-specified derived data types
Parallel I/O
One-sided communicaCon (MPI_Put, MPI_Get)
Non-blocking collecCve communicaCon
Remote Memory Access
Vectoriza)on
Accelerators
Principle: Split an operaCon into independent parts and execute the
parts concurrently in an accelerator (SIMD)

DO I = 1, 1000
A(I) = B(I) + C(I)
END DO

Accelerators like Graphics Processing Units are specialized for SIMD


operaCons
Popularity
In 2012, 13% of the performance share of the Top 500
Supercomputers was provided through accelerators
Today, the performance share for accelerators is ~38%
Trend: Build diverse compuCng plaƒorms with mulCprocessors, SMP
nodes, and accelerators
Caveat: Difficult to use heterogenous hardware effecCvely
Modern Accelerators

NVidia Volta Intel Xeon Phi Knight's Landing


7.5 TFLOPs double precision • 3 TFLOPs double precision
30 TFLOPs half precision • 72 cores (288 threads)
5120 cores • 16 GB RAM
16 GB RAM • 384 GB/s memory bandwidth
900 GB/s memory bandwidth • Includes both a scalar and a
vector unit
NVidia Architecture
SIMT: Single InstrucCon,
MulCple Thread (Similar to
SIMD)
Thread blocks are executed by a
warp of 32 cores
Performance is dependent on
memory alignment and
eliminaCng shared memory
access
Programming GPUs
Low-Level Programming
• Proprietary – NVidia's CUDA
• Portable – OpenCL

High-Level Programming
• OpenACC
• OpenMP 4.5
• Allows for a single code base that may be compiled on both mulCcore and
GPUs
CUDA Matrix Mul)plica)on
nt main(void) {
int *a, *b, *c; // CPU copies
int *dev_a, *dev_b, *dev_c; // GPU copies
/* Allocate memory on GPU */
arraysize = n*sizeof(int);
cudaMalloc( (void**)\&dev_a, arraysize);
cudaMalloc( (void**)\&dev_b, arraysize);
cudaMalloc( (void**)\&dev_c, sizeof(in));
use malloc for a, b, c
iniCalize a, b
/* Move a and b to the GPU */
cudaMemcpy(dev_a, a, arraysize, cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, arraysize, cudaMemcpyHosZoDevice);
/* Launch GPU kernel */
dot <<< n/threads_per_block,threads_per_block >>>(dev_a,dev_b,dev_c);
/* Copy results back to CPU
cudaMemcpy(c, dev_c, sizeof(int), cudaMemcpyDeviceToHost)
return 0;
CUDA Matrix Mul)plica)on
global_ void dot(int *a, int *b, int *c) {
_shared_ int temp[threads_per_block];
int index = threadIdx.x + blockIdx.x + blockDim.x;
/* Convert warp index to global */
temp[threadIdx.x] = a[index] * b[index];
/* Avoid race condiCon on sum within block */
_syncthreads();
if (threadIdx.x == 0) {
int sum = 0;
for (int i = 0; i < threads_per_block; i++)
sum += temp[i];
/* add to global sum */
atomicAdd(c, sum)
}
OpenACC and OpenMP
Share the same model of computaCon
Host-centric – host device offloads code regions and data to
accelerators
Device – has independent memory and mulCple threads
Mapping clause – Defines the relaConship between memory on the
host device and memory on the accelerator
OpenACC Matrix Mul)plica)on
nt main(int argc, char** argv) {
int n = atoi(argv[1]);
float *a = (float *)malloc(sizeof(float) * n * n);
float *b = (float *)malloc(sizeof(float) * n * n);
float *c = (float *)malloc(sizeof(float) * n * n);
#pragma acc data copyin(a[0:n*n], b[0:n*n]), copy(c[0:n*n]) {
int i, j, k;
#pragma acc kernels
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
a[i*n+j] = (float) i + j;
b[i*n+j] = (float) i – j;
c[i*n+j] = 0.0;
}
}
OpenACC Matrix Mul)plica)on
#pragma acc kernels
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
c[i * n + j] += a[i * n + k] * b[k * n + j];
}
}
}
}
Speedup?
Some problems can achieve a speedup of 10x-100x by offloading
some of the work to a GPU (e.g. linpack, matrix mulCplicaCon)
In reality:
• Only a small fracCon of applicaCons can uClize a GPU effecCvely
• The average speedup of those applicaCons is ~2x-4x
• OpCmizing your code using mulCcore technologies is probably a beZer use of
your Cme
Problem Areas
Branching can cut your performance by 50% since instrucCons need
to be issued for both branches
Memory bandwidth on GPUs is poor
• Need to make sure your operands are used more than once
• For the NVidia Volta, all operands need to be used 67x to achieve peak FLOPs
Accelerator Trends
Intel has canceled Knight's Landing and Knight's Hill; future Xeon Phi status
is unknown
GPUs are becoming less rigidly SIMD and including beZer memory
bandwidth
More programmers moving from low-level programming like CUDA to
high-level direcCves
PGI Compiler was purchased by NVidia in 2013; cut support for OpenCL
OpenACC and OpenMP sCll have not merged
Efficiency of GPUs is sCll problemaCc and the programming model is a
moving target
New MoCvaCon: Deep Learning
Hybrid Compu)ng
Current Trend
Accelerator-based machines are grabbing the high rankings on the
Top 500, but clusters with standard cores are sCll most important
economically
Economics dictates distributed memory system of shared memory
boards with mulCcore commodity chips, perhaps with some form of
accelerator distributed sparingly
MPI+X
Conclusion
Topics for Possible Future Workshops
ScienCfic CompuCng (C++ or Fortran or Both)
Parallel CompuCng Theory (e.g., Load balancing, measuring performance)
Embarrassingly Parallel
pthreads
OpenMP
MPI
OpenACC
CUDA
Parallel Debugging/Profiling

You might also like