OpenMP Presentation
OpenMP Presentation
OpenMP Presentation
Basic Architecture
processor
core
MEMORY
Basic Architecture
processor processor
core core
MEMORY MEMORY
Basic Architecture (EOS)
An EOS node consists of two processors (4 cpu cores each) and total memory of 24GB
NODE
processor processor
24GB MEMORY
Sequential Program
instructions
When you run sequential program
core core
MEMORY
Sequential Program
instructions
When you run sequential program
core core
Waste of available resources. We
want all cores to be used to
execute program.
MEMORY
HOW?
What is OpenMP?
Compiler Directives
Runtime subroutines/functions
Environment variables
HelloWorld
#include <iostream>
#include omp.h int main() {
#pragma omp parallel
{
std::cout << Hello World\n
}
return 0;
}
Export OMP_NUM_THREADS=4
./hi.x
HelloWorld
#include <iostream>
#include omp.h
int main() {
#pragma omp parallel
{
std::cout << Hello World\n
}
return 0;
}
OMP COMPILER DIRECTIVES
Export OMP_NUM_THREADS=4
./hi.x
end
Fork/Join Model
OpenMP programs start with a single thread; the master thread (Thread #0)
At start of parallel region master creates team of parallel worker threads (FORK)
Statements in parallel block are executed in parallel by every thread
At end of parallel region, all threads synchronize, and join master thread (JOIN)
Implicit barrier. Will discuss synchronization later
Fork/Join Model
Thread #0
F J
Thread #1
O O
R Thread #2
I
K N
Thread #3
OpenMP Threads versus Cores
C++
int main() {
int threads = 100; int id = 100;
return 0;
}
Shared Memory Model
The memory is (logically) shared by all the cpu's
SHARED ( list )
All variables in list will be considered shared.
Every openmp thread has access to all these variables
PRIVATE ( list )
Every openmp thread will have it's own private copy of variables in list
No other openmp thread has access to this private copy
var: THREADS
var: ID
Shared memory
Work Sharing (manual approach)
So far only discussed parallel regions that all did same work. Not very
useful. What we want is to share work among all threads so we can solve
our problems faster.
sum = sum + i;
}
Exercise
So far we have all the directives nested within the same Routine
(!$OMP PARALLEL outer most). However, OpenMP provides
more flexible scoping rules. E.g. It is allowed to have
routine with only !$OMP DO In this case we call the !$OMP DO an
orphaned directive.
For example, suppose we have a loop with 1000 iterations and 4 omp threads. The loop is
partitioned as follows:
1000
1 2 3 4 1 2 3 4 .. 1 2 3 4
Iterations
0 1 2 3 4 5 6 7
Time per iteration
S1: DO I=1,10
S2: B(i) = temp
S3: A(i+1) = B(i+1)
S4: temp = A(i)
S5: ENDDO
Data Dependencies (2)
For our purpose (openMP parallel loops) we only care about loop carried
dependencies (dependencies between instructions in different iterations of the loop)
FIRSTPRIVATE ( list ):
Same as PRIVATE but every private copy of variable 'x' will be initialized with
the original value (before the omp region started) of 'x'
LASTPRIVATE ( list ):
Same as PRIVATE but the private copies of the variables in list
from the last work sharing will be copied to shared version. To be used with
!$OMP DO Directive.
DEFAULT (SHARED | PRIVATE | FIRSTPRIVATE | LASTPRIVATE ):
Specifies the default scope for all variables in omp region.
SEQUENTIAL
Y[1] = X[1] DO
i=2,n,1
Y[i] = Y[i-1] + X[i]
ENDDO
Case Study: Removing Flow Deps
Y = prefix(X) Y(1) = X(1); Y(i) = Y(i-1) + X(i)
X Y
1 1 1 1 1 2 3 4
SEQUENTIAL PARALLEL
SEQUENTIAL PARALLEL
WHY?
Case Study: Removing Flow Deps
REWRITE ALGORITHM
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
STEP 1: split X among threads; every thread computes its own (partial) prefix sum
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
STEP 2: create array T T[1]=0, T[i] = X[(length/threads)*(i-1)], perform simple prefix sum on T
(will collects last element from every thread (except last) and perform simple prefix sum)
0 4 4 4 0 4 8 12
STEP 3: every thread adds T[theadid] to all its element
+0 +4 +8 +12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
NOTE: For illustration purposes we can assume array length is multiple of #threads
!$OMP PARALLEL
!$OMP SECTIONS Combined version
!$OMP PARALLEL SECTIONS
!$OMP SECTION !$OMP SECTION
// WORK 1 // WORK 1
!$OMP SECTION !$OMP SECTION
// WORK 2 // WORK 2
!$OMP END PARALLEL SECTIONS
!$OMP END SECTIONS
!$OMP END PARALLEL
A(1) = A(1)
A(2) = A(2) + A(1)
A(i) = A(i-2) + A(i-1)