OpenMP Presentation

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 51

OpenMP

Basic Architecture

Older processor had only one


cpu core to execute instructions

processor

core

MEMORY
Basic Architecture

Older processor had only one Modern processors have 4 or more


cpu core to execute instructions independent cpu cores to execute instructions

processor processor

core core core

core core

MEMORY MEMORY
Basic Architecture (EOS)
An EOS node consists of two processors (4 cpu cores each) and total memory of 24GB

NODE

processor processor

core core core core

core core core core

24GB MEMORY
Sequential Program

instructions
When you run sequential program

Instructions executed on 1 core


Other cores are idle core core

core core

MEMORY
Sequential Program

instructions
When you run sequential program

Instructions executed on 1 core


Other cores are idle core core

core core
Waste of available resources. We
want all cores to be used to
execute program.
MEMORY
HOW?
What is OpenMP?

Defacto standard API for writing shared memory parallel


applications in C, C++, and Fortran

OpenMP API consists of:

Compiler Directives
Runtime subroutines/functions
Environment variables
HelloWorld
#include <iostream>
#include omp.h int main() {
#pragma omp parallel
{
std::cout << Hello World\n
}
return 0;
}

intel: icc -openmp -o hi.x hello.f pgi:


pgcpp -mp -o hi.x hello.f gnu:
g++ -fopenmp -o hi.x hello.f

Export OMP_NUM_THREADS=4
./hi.x
HelloWorld
#include <iostream>
#include omp.h
int main() {
#pragma omp parallel
{
std::cout << Hello World\n
}
return 0;
}
OMP COMPILER DIRECTIVES

intel: icc -openmp -o hi.x hello.f pgi:


pgcpp -mp -o hi.x hello.f gnu:
g++ -fopenmp -o hi.x hello.f

Export OMP_NUM_THREADS=4
./hi.x

OMP ENVIRONMENTAL VARIABLE


C/C++

C/C++ directive format is:

New line required


#pragma omp parallel [clauses]
{
:
}

All OpenMP directives follow this format.


At run time
Program Hello

Runtime creates 3 additional worker threads at start


of openmp region

Thread #0 Thread #1 Thread #2 Thread #3

OMP region, every


thread executes all
the instructions in
the OMP block
Print *,hello Print *,hello Print *,hello Print *,hello

end
Fork/Join Model

OpenMP follows the fork/join model:

OpenMP programs start with a single thread; the master thread (Thread #0)
At start of parallel region master creates team of parallel worker threads (FORK)
Statements in parallel block are executed in parallel by every thread
At end of parallel region, all threads synchronize, and join master thread (JOIN)
Implicit barrier. Will discuss synchronization later
Fork/Join Model

OpenMP follows the fork/join model:

OpenMP programs start with a single thread; the master thread


At start of parallel region master creates team of parallel worker threads (FORK)
Statements in parallel block are executed in parallel by every thread
At end of parallel region, all threads synchronize, and join master thread (JOIN)

Thread #0

F J
Thread #1
O O
R Thread #2
I
K N
Thread #3
OpenMP Threads versus Cores

What are threads, cores, and how do they relate?

Thread is independent sequence of execution of program code


Block of code with one entry and one exit
For our purposes a more abstract concept Unrelated to
Cores/CPUs
OpenMP threads are mapped onto physical cores
Possible to map more than 1 thread on a core
In practice best to have one-to-one mapping.
More About OpenMP Threads
Number of openMP threads can be set using:

Environmental variable OMP_NUM_THREADS


Runtime function omp_set_num_threads(n)

Other useful function to get information about threads:

Runtime function omp_get_num_threads()


Returns number of threads in parallel region Returns 1
if called outside parallel region
Runtime function omp_get_thread_num()
Returns id of thread in team
Value between [0,n-1] // where n = #threads Master
thread always has id 0
HelloThreads Exercise

Extend the program below to make it parallel where


every thread prints out it's id and total number of threads

C++

int main() {
int threads = 100; int id = 100;

std::cout << hello from thread: ,id << out of ;


std::cout << threads << \n;

return 0;
}
Shared Memory Model
The memory is (logically) shared by all the cpu's

Thread Thread Thread Thread


0 1 2 3 All threads try to access the
same variable (possibly at the
same time). This can lead to a
race condition. Different runs
of same program might give
var: THREADS
Var: ID different results because of
race conditions
Shared memory
Shared and Private Variables

OpenMP provides a way to declare variables private or shared within an


OpenMP block.This is done using the following OpenMP clauses:

SHARED ( list )
All variables in list will be considered shared.
Every openmp thread has access to all these variables

PRIVATE ( list )
Every openmp thread will have it's own private copy of variables in list
No other openmp thread has access to this private copy

#pragma omp parallel private(a,b,c) (C/C++)

By default most variables are considered shared in OpenMP. Exceptions


include index variables (C/C++) and variables declared inside parallel region
(C/C++)
HelloThreads Exercise (2)

Adapt the program below such that all relevant


variables are correctly classified as OpenMP
private/shared
C++

int threads = 100; int id = 100;


#pragma omp parallel
{
threads = omp_get_num_threads()
id = omp_get_thread_num()
std::cout << hello from thread: ,id << out of ;
std::cout << threads << \n;
}
return 0;
Revisit Shared Memory Model
The memory is (logically) shared by all the cpu's
There is also private memory for every openmp thread

THREADS ID THREADS ID THREADS ID THREADS ID

shared variables threads


and id still exist, but every
Thread Thread
thread also has a private copy
Thread Thread
0 1 2 3 of variablesthreads and
id. There will not be any
race condition for these
private variables

var: THREADS
var: ID
Shared memory
Work Sharing (manual approach)
So far only discussed parallel regions that all did same work. Not very
useful. What we want is to share work among all threads so we can solve
our problems faster.

#pragma omp parallel for shared(sum) num_threads(4)

for (i = 0; i < 1024; i++)


{

#pragma omp critical

sum = sum + i;
}
Exercise

Create a program that computes a simple matrix


multiplication in C/C++. Use OpenMP directives to make it
run in parallel.
Reductions
A common type of computation is something like:

DO i=1,10 for (int i=0;i<10;++i) {


a = a op expr a = a op expr
ENDDO }

a = a op expr is called a reduction operation. Obviously, there is a loop


carried flow dependence for variable 'a' which prohibits parallelization.

For these kind of cases OpenMP provides the REDUCTION(op:list) clause.


This clause can be applied when the following restrictions :are met:
a is a scalar variable in the list

expr is a scalar expression that does not reference a

Only certain kind of op allowed; e.g. +,*,-


Reductions
Exercise

Create a program that computes the sum of all the elements


in an array A (C/C++) or a program that finds the largest
number in an array A (Fortran) . Use OpenMP directives to
make it run in parallel.
TIP: OpenMP Scoping Rules

So far we have all the directives nested within the same Routine
(!$OMP PARALLEL outer most). However, OpenMP provides
more flexible scoping rules. E.g. It is allowed to have
routine with only !$OMP DO In this case we call the !$OMP DO an
orphaned directive.

Note: There are some rules (e.g. When an !$OMP DO directive is


encountered program should be in parallel section)
How OMP schedules iterations?
Although the OpenMP standard does not specify how a loop should be partitioned
most compilers split the loop in N/p (N #iterations, p #threads) chunks by default. This
is called a static schedule (with chunk size N/p)

For example, suppose we have a loop with 1000 iterations and 4 omp threads. The loop is
partitioned as follows:
1000

THREAD 1 THREAD 2 THREAD 3 THREAD 4

1 250 500 750 1000


Static Schedule
To explicitly tell the compiler to use a static schedule (or a different
schedule as we will see later) OpenMP provides the SCHEDULE clause
$!OMP DO SCHEDULE (STATIC,n) // (n is chunk size)
For example, suppose we set the chunk size n=10
1000

1 2 3 4 1 2 3 4 .. 1 2 3 4

1 10 20 30 40 50 60 70 80 960 970 980 990 1000

NOTE: example static_schedule, vary chunk size


Issues with Static schedule
With static scheduling the number of iterations is evenly distributed among
all openmp threads (i.e. Every thread will be assigned similar number of
iterations). This is not always the best way to partition. Why is This?
Issues with Static schedule
With static scheduling the number of iterations is evenly distributed among
all openmp threads (i.e. Every thread will be assigned similar number of
iterations). This is not always the best way to partition. Why is This?

Iterations

0 1 2 3 4 5 6 7
Time per iteration

thread2 thread3 thread4


Dynamic Schedule
With a dynamic schedule new chunks are assigned to threads when they
come available. OpenMP provides two dynamic schedules:

$!OMP DO SCHEDULE(DYNAMIC,n) // n is chunk size Loop iterations are


divided into pieces of size chunk. When
a thread finishes one chunk, it is dynamically assigned another.
$!OMP DO SCHEDULE(GUIDED,n) // n is chunk size
Similar to DYNAMIC but chunk size is relative to number of iterations left.

Keep in mind: although Dynamic scheduling might be the prefered choice to


prevent load inbalance in some situations, there is a significant overhead
involved compared to static scheduling.
Data Dependencies

Not all loops can be parallelized. Before adding OpenMP


directives need to check for any dependencies:

We categorize three types of dependencies:

Flow dependence: Read after Write (RAW)


Anti dependence: Write after Read (WAR)
Output dependence (Write after Write (WAW)

FLOW ANTI OUTPUT


X = 21 PRINT *, X X = 21
PRINT *, X X = 21 X = 21
Data Dependencies (2)
For our purpose (openMP parallel loops) we only care about loop carried
dependencies (dependencies between instructions in different iterations of the loop)

Let's find the dependencies in the following loop?

S1: DO I=1,10
S2: B(i) = temp
S3: A(i+1) = B(i+1)
S4: temp = A(i)
S5: ENDDO
Data Dependencies (2)
For our purpose (openMP parallel loops) we only care about loop carried
dependencies (dependencies between instructions in different iterations of the loop)

What are the dependencies in the following loop?


S1: DO I=1,10
S2: B(i) = temp 1: S3 S2 anti (B)
S3: A(i+1) = B(i+1)
S4: temp = A(i)
S5: ENDDO
Data Dependencies (2)
For our purpose (openMP parallel loops) we only care about loop carried
dependencies (dependencies between instructions in different iterations of the loop)

What are the dependencies in the following loop?


S1: DO I=1,10
S2: B(i) = temp 1: S3 S2 anti (B)
S3: A(i+1) = B(i+1) 2: S3 S4 flow (A)
S4: temp = A(i)
S5: ENDDO
Data Dependencies (2)
For our purpose (openMP parallel loops) we only care about loop carried
dependencies (dependencies between instructions in different iterations of the loop)

What are the dependencies in the following loop?


S1: DO I=1,10
S2: B(i) = temp 1: S3 S2 anti (B)
S3: A(i+1) = B(i+1) 2: S3 S4 flow (A)
S4: temp = A(i)
S5: ENDDO 3: S4 S2 flow (temp)
Data Dependencies (2)
For our purpose (openMP parallel loops) we only care about loop carried
dependencies (dependencies between instructions in different iterations of the loop)

What are the dependencies in the following loop?


S1: DO I=1,10
S2: B(i) = temp 1: S3 S2 anti (B)
S3: A(i+1) = B(i+1) 2: S3 S4 flow (A)
S4: temp = A(i)
S5: ENDDO 3: S4 S2 flow (temp)
4: S4 S4 output (temp)
Data Dependencies (2)
For our purpose (openMP parallel loops) we only care about loop carried
dependencies (dependencies between instructions in different iterations of the loop)

What are the dependencies in the following loop?


S1: DO I=1,10
S2: B(i) = temp 1: S3 S2 anti (B)
S3: A(i+1) = B(i+1) 2: S3 S4 flow (A)
S4: temp = A(i)
S5: ENDDO 3: S4 S2 flow (temp)
4: S4 S4 output (temp)
Sometimes it helps to unroll part of the loop to
see loop carried dependencies more clear
S2: B(1) = temp
S3: A(2) = B(2)
S4: temp = A(1)

S2: B(2) = temp


S3: A(3) = B(3)
S4: temp = A(2)
Data Dependencies (2)
For our purpose (openMP parallel loops) we only care about loop carried
dependencies (dependencies between instructions in different iterations of the loop)

What are the dependencies in the following loop?


S1: DO I=1,10
S2: B(i) = temp 1: S3 S2 anti (B)
S3: A(i+1) = B(i+1) 2: S3 S4 flow (A)
S4: temp = A(i)
S5: ENDDO 3: S4 S2 flow (temp)
4: S4 S4 output (temp)
Sometimes it helps to unroll part of the loop to
see loop carried dependencies more clear
S2: B(1) = temp S3: A(2) = B(2)

2 S4: temp = A(1) 1


3
4 S2: B(2) = temp
S3: A(3) = B(3)
S4: temp = A(2)
Case Study: Jacobi

Implement a parallel version of the Jacobi algorithm using


OpenMP. A sequential version is provided.
Data Dependencies (3)

Loop carried anti- and output dependencies are not true


dependencies (re-use of the same name) and in many
cases can be resolved relatively easily.

Flow dependencies are true dependencies (there is a


flow from definition to its use) and in many cases
cannot be removed easily. Might require rewriting the
algorithm (if possible)
Resolving Anti/Output Deps
Use PRIVATE clause:
Already saw this in example hello_threads

Rename variables (if possible):


Example: in-place left shift
!$OMP PARALLEL DO DO
DO i=1,n-1 DO i=1,n-1 i=1,n-1
A(i)=A(i+1) ANEW(i) = A(i+1) ANEW(i) = A(i+1)
ENDDO ENDDO ENDDO
!$OMP END PARALLEL DO
If has to be in-place can do it in two steps:
!$OMP PARALLEL
!$OMP DO T(i) = A(i+1)
!$OMP END DO
!$OMP DO A(i) = T(i)
!$OMP END DO
!$OMP END PARALLEL
More about shared/private vars

Besides the clauses described before OpenMP provides some


additional datascope clauses that are very useful:

FIRSTPRIVATE ( list ):
Same as PRIVATE but every private copy of variable 'x' will be initialized with
the original value (before the omp region started) of 'x'
LASTPRIVATE ( list ):
Same as PRIVATE but the private copies of the variables in list
from the last work sharing will be copied to shared version. To be used with
!$OMP DO Directive.
DEFAULT (SHARED | PRIVATE | FIRSTPRIVATE | LASTPRIVATE ):
Specifies the default scope for all variables in omp region.

NOTE: example data scope


Case Study: Removing Flow Deps
Y = prefix(X) Y(1) = X(1); Y(i) = Y(i-1) + X(i)
X Y
1 1 1 1 1 2 3 4

SEQUENTIAL

Y[1] = X[1] DO
i=2,n,1
Y[i] = Y[i-1] + X[i]
ENDDO
Case Study: Removing Flow Deps
Y = prefix(X) Y(1) = X(1); Y(i) = Y(i-1) + X(i)
X Y
1 1 1 1 1 2 3 4

SEQUENTIAL PARALLEL

Y[1] = X[1] DO Y[1] = X[1]


i=2,n,1 !$OMP PARALLEL DO
Y[i] = Y[i-1] + X[i] DO i=2,n,1
ENDDO Y[i] = Y[i-1] + X[i] ENDDO
!$OMP END PARALLEL DO
Case Study: Removing Flow Deps
Y = prefix(X) Y(1) = X(1); Y(i) = Y(i-1) + X(i)
X Y
1 1 1 1 1 2 3 4

SEQUENTIAL PARALLEL

Y[1] = X[1] DO Y[1] = X[1]


i=2,n,1 !$OMP PARALLEL DO
Y[i] = Y[i-1] + X[i] DO i=2,n,1
ENDDO Y[i] = Y[i-1] + X[i] ENDDO
!$OMP END PARALLEL DO

WHY?
Case Study: Removing Flow Deps
REWRITE ALGORITHM
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

STEP 1: split X among threads; every thread computes its own (partial) prefix sum

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

STEP 2: create array T T[1]=0, T[i] = X[(length/threads)*(i-1)], perform simple prefix sum on T
(will collects last element from every thread (except last) and perform simple prefix sum)

0 4 4 4 0 4 8 12
STEP 3: every thread adds T[theadid] to all its element

+0 +4 +8 +12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

STEP 4: Finished; we rewrote prefex sum by removing dependencies.


Prefix Sum Implementation

How to implement the algorithm on the previous slide?

Three separate steps


Steps 1 and 3 can be done in parallel
Step 2 has to be done sequential
Step 1 has to be performed before step 2
Step 2 has to be performed before step 3

NOTE: For illustration purposes we can assume array length is multiple of #threads

NOTE: exercise prefix


Case Study: Removing Flow Deps

This Case study showed an example of an algorithm with real


(flow) dependencies

Sometimes we can rewrite algorightm to run


parallel Most of the time this is not trivial
Speedup much less impressive (often)
OpenMP Sections
Suppose you have blocks of code that can be executed in parallel (i.e.
No dependencies). To execute them in parallel OpenMP provides the
!$OMP SECTIONS directive. The syntax will look something like:

!$OMP PARALLEL
!$OMP SECTIONS Combined version
!$OMP PARALLEL SECTIONS
!$OMP SECTION !$OMP SECTION
// WORK 1 // WORK 1
!$OMP SECTION !$OMP SECTION
// WORK 2 // WORK 2
!$OMP END PARALLEL SECTIONS
!$OMP END SECTIONS
!$OMP END PARALLEL

This will execute WORK 1 and WORK 2 in parallel


Exercise

Create a program that computes the adjusted prefix sum below


for 4 different arrays and after that adds all of them up. Use
OpenMP directives to make it run in parallel.

A(1) = A(1)
A(2) = A(2) + A(1)
A(i) = A(i-2) + A(i-1)

You might also like