Lab1 PAR
Lab1 PAR
Lab1 PAR
Fall 2024–2025
2
Contents
1 Experimental setup 7
1.1 Accessing the boada cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Node architecture and memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Execution modes: interactive vs. queued . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Serial compilation and execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Compilation and execution of OpenMP programs . . . . . . . . . . . . . . . . . . . 11
1.6 Strong vs. weak scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.7 Discussion at the laboratory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
B Modelfactors 35
B.1 Modelfactors script mfLite.py: extracting information . . . . . . . . . . . . . . . . 35
B.2 Computing the Modelfactors metrics for OpenMP . . . . . . . . . . . . . . . . . . 35
3
D.2 Hint OpenMP taskloop/in taskloop construction . . . . . . . . . . . . . . . . . 39
D.3 Hint OpenMP tasking/explicit tasks duration . . . . . . . . . . . . . . . . . . 39
D.4 More hints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4
Objectives
5
6
Chapter 1
Experimental setup
The objective of this laboratory session is to familiarise yourself with the hardware and software
environments that you will be using during this semester to carry out all the laboratory assignments
in PAR.
In this course you are going to use nodes boada-6 to boada-8 interactively and nodes boada-11
to boada-14 through the execution partition, as explained in subsequent sections.
All the nodes have access to a shared network attached storage (NAS) disk. You can access it
through
cd /scratch/nas/1/par????
7
which corresponds to your home directory (also referred to as ~/).
All the necessary files to carry out each laboratory assignment will be posted in
/scratch/nas/1/par0/sessions
For today’s session, copy the file lab1.tar.gz from that location into your home directory:
cp /scratch/nas/1/par0/sessions/lab1.tar.gz ~/
and, from your home directory, uncompress it with this command line:
tar -zxvf lab1.tar.gz
Finally, in order to set up all the environment variables, you have to process the environment.bash
file now available in your home directory with:
source ~/environment.bash
Note that you must load this environment every time you log in to boada. Alternatively, you can
add this command into your .bashrc so that the environment will be automatically load every
time you access the cluster. You can achieve this by executing the following command from your
home directory:
echo ''source ~/environment.bash'' >> .bashrc
In case you need to transfer files from boada to your local machine (laptop or desktop in
laboratory room), or viceversa, you have to use the secure copy scp command. For example, if you
type the following command:
scp [email protected]:lab1/pi/pi_seq.c .
in your local machine you be copying the source file pi seq.c located in directory lab1/pi
of your home directory in boada to the current directory, represented with the ".", in the local
machine, with the same name.
In some cases, you will need to add -O option to be able to run scp:
scp -O [email protected]:lab1/pi/pi_seq.c .
due to diferent versions of the scp protocol.
8
map.fig map.pdf" command to convert from .fig to .pdf; look for alternative output graphic
languages by typing "man fig2dev".
Alternatively, remember that you can also take screenshots by pressing the dedicated Prtsc
keyboard button.
The three generated files will help you to figure out:
the number of sockets, cores per socket, and threads per core in a specific node;
the amount of main memory in a specific node, and each NUMA node;
the cache memory hierarchy (i.e., L1, L2, and L3), private or shared to each core/socket.
We suggest you to fill in Table 1.2 that will help you understand some performance degradation
that may occur when increasing the number of threads in future exercises.
Table 1.2: Architecture and memory hierarchy characteristics of any of the boada execution nodes
boada-11 to boada-14.
Characteristic Value
Number of sockets per node
Number of cores per socket
Number of threads per core
Maximum core frequency
L1-I cache size (per-core)
L1-D cache size (per-core)
L2 cache size (per-core)
Last-level cache size (per-socket)
Main memory size (per socket)
Main memory size (per node)
9
squeue -u par????
Finally, you can cancel a previously submitted job via
scancel <jobid>
where <jobid> refers to the job identifier. Every submitted execution generates two additional
files that share the same script name followed by .e and .o together with the job identifier. Such
files contain the messages sent to the standard error and output streams, respectively, which we
encourage you to check.
1
4.0
4.0
∫ (1+x )
0
2 dx = π
of rectangles:
(1+x2)
4.0
2.0
∑ F(x )Δx ≈π
i
i=0
Code 1 shows a simplified version of the pi seq.c code. The variable num steps defines the
number of rectangles, whose area is computed at each iteration of the loop.
void main () {
double x, pi, sum = 0.0, step = 1.0/(double) num_steps;
for(long int i = 0; i < num_steps; ++i) {
x = (i + 0.5)*step;
sum += 4.0/(1.0 + x*x);
}
pi = step*sum;
}
Code 1: Computation of π.
Figure 1.2 shows the compilation and execution flow for a sequential program. Program
compilation and binary generation is done with a Makefile that declares multiple targets with
specific rules. This course uses icx (Intel’s oneAPI C compiler) to generate the binary files.
Now lets compile pi seq.c using the Makefile and execute the binary generated interactively
and through the queueing system:
1. Open the Makefile file and identify the target employed to compile the sequential code.
Observe how the compiler is invoked.
2. Interactively execute the generated binary file using the command:
./run-seq.sh <name> <iterations>
10
Sequential Executable code
source code
run-xxxx.sh
Execute Out
.c compile Exec
Out,
Execute .o, .e
submit-xxxx.sh
with the corresponding executable name and a fixed number of iterations equal to 109 .
The execution returns the user and system CPU time, the elapsed execution time, and the
percentage of CPU used.
3. Look at the source code and identify the function invocations and data structures used to
measure the execution time.
4. Submit the execution script to the queuing system via:
sbatch submit-seq.sh <name> <iterations>
with similar arguments: the binary name and 109 iterations. Use the squeue command to
check your job.
5. Open the generated standard output and error files time-pi seq-boada-??, where ?? refers
to the node where the execution took place.
Take a closer look at the run-seq.sh and submit-seq.sh scripts to understand how the binaries
are executed.
run-xxxx.sh
Execute Out
OpenMP
enable flag
.c compile Exec
Out,
Execute .o, .e
omp submit-xxxx.sh
lib
Figure 1.3: Compilation and execution flow for parallel programs using OpenMP.
Figure 1.3 shows the compilation and execution flow for an OpenMP program. The main
difference with respect to Fig. 1.2 is that now the Makefile includes the appropriate compilation
flag that enables OpenMP.
We propose you to carry out the following steps:
1. Open the parallelised code pi omp.c and figure out what the new added lines of code do.
11
2. Open the Makefile and identify the target associated with the OpenMP code compilation
(the only difference is the -qopenmp compilation flag).
3. Take a look at the script to learn how to specify the number of threads to OpenMP.
4. Interactively execute the OpenMP code with 1, 2, 4, 8, 16, and 20 threads using 109 iterations:
./run-omp.sh <name> <iterations> <threads>
5. What is the time command telling you about the user and system CPU time, the elapsed
time, and the percentage of CPU used?
6. Submit the execution script to the queuing system via:
sbatch submit-omp.sh <name> <iterations> <threads>
specifying the name of OpenMP binary, 109 iterations, and an increasing number of threads:
1, 2, 4, 8, 16, and 20. Look at all the generated files time-pi omp-¿¿-boada-??, being ¿¿
the number of threads and ?? the execution node.
We suggest you to fill in Table 1.3 to compare the differences between interactive and queued
executions.
12
ps2pdf pi_omp-???.ps
xpdf pi_omp-???.pdf
and observe how the execution time and speedup varies with the number of threads.
3. Set np NMAX=40 within submit-strong-omp.sh and repeat the previous steps. Can you
explain the behaviour observed in the scalability plot?
4. Submit the weak scalability script (no arguments required):
sbatch submit-weak-omp.sh
and observe how the parallel efficiency varies with the number of threads.
Remember to take screenshots by pressing the dedicated Prtsc keyboard button.
13
14
Chapter 2
In this laboratory session, you will learn how to use Tareador, a tool that analyses the potential
parallelism that can be obtained when a given task decomposition strategy is applied to your
sequential code. This way the programmer simply needs to identify which are the tasks in that
strategy that wants to evaluate.
These are the main features of Tareador and Tareador GUI:
Tareador traces the execution of the program based on the specification of potential tasks to
be run in parallel;
Tareador records all static/dynamic data allocations and memory accesses in order to build
the task dependency graph (TDG);
Tareador GUI uses Tareador output information to simulate the parallel execution of the
tasks on a certain number of processors in order to estimate the potential speedup.
Tareador .c
tareador … GUI
lib
Makefile run-tareador.sh
Figure 2.1 shows the compilation and execution flow for Tareador, starting from the taskified
source code.
tareador_start_task("NameOfTask");
/* Code region to be a potential task */
tareador_end_task("NameOfTask");
15
where the string NameOfTask identifies that task in the graph produced by Tareador. In order to
enable the analysis with Tareador, the programmer must invoke:
tareador_ON();
/* Main program */
tareador_OFF();
Figure 2.2: Task dependency graph (TDG) for the initial task decomposition expressed in
3dftt tar.c.
The execution of the run tareador.sh script opens a new window in which the TDG is visualised
(see Figure 2.2):
16
each node of the graph represents a task: different shapes and colours are used to identify
task instances generated from the same task definition and each one is labeled with a task
instance number;
each node contains the number of instructions that the task instance has executed, as an
indication of the task granularity;
the size of the node also reflects in some way this granularity.
You can zoom in and out to see the names of the tasks (the same that were provided in the
source code) and the information reported for each node.
Figure 2.3: Variables provoking data dependencies between tasks for a specific or all edges: real
dependency tab.
Edges in the graph represent dependencies between task instances. Different colours/patterns
are used to represent different types of dependencies: e.g., blue colour for data dependencies and
green/yellow for control dependencies. Moreover, Tareador allows you to analyse the variables
whose access provokes data dependencies between a pair of nodes. Try the following:
1. Move the mouse over an edge. e.g., the edge going from the red task (init complex grid)
to the yellow task (ffts1 and transpositions).
2. Right click with the mouse and select Dataview → edge. This will open a window similar
to the one shown in Fig. 2.3.
3. In the Real dependency tab, you can see the variable that causes that dependency (the
access to the array in fftw).
4. Alternatively, right click on a task (e.g., ffts1 and transpositions) and select Dataview
→ Edges-in. This will open a window similar to the previous one.
5. You can do the same for the edges going out of a task by selecting Dataview → Edges-out
when clicking on top of a task.
For each node you can also analyse the variables that are accessed during the execution of the
associated task. For example, with the mouse on node ffts1 and transpositions, carry out the
following steps:
1. Right click with the mouse and select Dataview → node. You can select either the Task
view tab or the Data view tab in that window, as shown in Fig. 2.4.
2. In the Task view tab you can see the variables that are:
read (i.e., with a load memory access) in green colour in the window, as in this case
variable p1d;
written (i.e., with a store memory access) in blue colour in the window;
both read and written (i.e., intersection) in orange, as it is the case for in fftw.
3. For each variable in the list you have its name and its storage:
17
Figure 2.4: Variables provoking data dependencies between tasks for a specific or all edges: dataview
and taskview tabs.
G for global;
H for the heap or dynamically allocated data;
S for the stack or function local variables;
and additional information (size and allocation) is obtained by placing the mouse on the
name.
4. In the Data view tab you can see for each variable (selected in the chooser) the kind of access
(store, load or both, using the same colors) that are performed by the task.
5. Finally, you can save the TDG by clicking on the Save results button in the main Tareador
window.
6. Alternatively, you can simply take a screenshot by pressing the Prtsc keyboard button.
Figure 2.5: Paraver trace of the simulated execution with 4 processors, full view and after zooming
into the initial part of the trace.
You can also simulate the execution of the task graph in an ideal machine with a certain number
of processors by clicking on View Simulation in the main window of Tareador. This will open
a Paraver window similar to Fig. 2.5 showing the timeline for the simulated execution with the
following characteristics:
each horizontal line shows the task(s) executed by each processor (CPU1.x, with x={1,2,3,4});
colours represent different tasks and match those of the TDG;
the number on the lower-right corner of the window indicates the simulated execution time
(in time units, assuming that each instruction takes one time unit) for the parallel execution;
yellow lines show task dependencies (and task creations).
18
You can zoom in the initial part of the timeline by pressing the left button of your mouse
and selecting the zone you want to zoom. You can undo the zooms done by clicking Undo zoom
or Fit time scale on top of the timeline windows of Paraver. You can save the timeline for
the simulated parallel execution by clicking Save → Save image on top of the timeline Paraver
window; alternatively, you can make a screenshot.
tareador_disable_object(&name_var)
/* Code region with memory accesses to variable name_var */
tareador_enable_object(&name_var)
With this mechanism you remove all the dependencies caused by the selected variable in a
specific code region. For example, if you disable the analysis of the variable in fftw, then you will
observe that in the new TDG the tasks init complex grid and ffts1 and transpositions are
executed in parallel.
For this version pay special attention to the data dependencies that appear in the TDG:
analyze the Edges-in for one of the transposition tasks and make sure that you understand
what is reported by Tareador.
1 Herein the word “replace” actually means: firstly, remove the original task definitions; and secondly, add the
19
v3. Starting from v2, replace the definition of tasks associated to function invocations transpose xy
planes and transpose zx planes with fine-grained tasks inside the corresponding body func-
tions and associated to individual iterations of the k loop, as you did in v2 for ffts1 planes.
Try to understand what is causing data dependencies.
v4. Starting from v3, replace the definition of the task for the init complex grid function with
fine-grained tasks inside the body function. For this version, simulate the parallel execution
for 1, 2, 4, 8, 16, and 32 processors, drawing a graph or table showing the potential strong
scalability. What is limiting the scalability of v4?
v5. Create a new version to explore an even finer granularity. Take into consideration that, due
to the large number of tasks, Tareador may take a while to compute and draw the TDG.
Once more, simulate the parallel execution for 1, 2, 4, 8, 16, 32, and 128 (i.e., ∞) processors.
According to the new results, is it worth going to this granularity level?
20
Chapter 3
This chapter presents a methodology to analyze and improve the performance of your parallel
implementation with OpenMP. It is based upon the following three steps that you can repeat until
being satisfied with the performance:
1. Use of Modelfactors in order to analyse the overall performance and scalability, the actual
parallel fraction (φ) of the task implementation, and its efficiency based on load balancing
and overheads.
2. Use of Paraver in order to analyse the traces generated from the parallel execution that will
help you to diagnose the performance inefficiencies previously reported by Modelfactors.
xml
config.
OpenMP .pcf
enable flag .row
.mpit mpi2prv Paraver
.prv
.c compile Exec execute
.cfg
Out,
config.
.o .e
files
omp Extrae Extrae
lib lib omp
lib
The steps described above are based on the generation of traces from the instrumented execution
of a parallel program. Figure 3.1 shows the complete compilation and execution flow that needs to
be taken in order to generate such traces. The environment is mainly composed of:
Paraver. It visualises the trace and analyses the execution of the program.
1 Extrae also collects the values of hardware counters that report the information about the processor activity and
memory accesses.
21
When using Modelfactors, the execution of the Extrae instrumented program is repeated for an
increasing number of processors, generating traces for each of them. Next, Modelfactors analyses
all of them in order to generate the reports.
Table 3.1: Analysis done on Thu Jul 28 07:54:35 AM CEST 2024, par0.
Overview of whole program execution metrics
Number of Processors 1 2 4
Elapsed time (sec) 1.27 0.72 0.41
Speedup 1.00 1.76 3.11
Efficiency 1.00 0.88 0.78
Each table targets a different metric. Starting with the first one (see Table 3.1 obtained from a
parallel program executed on up to 4 processors), Modelfactors provides a summary of the overall
program execution metrics for an increasing number of p threads:
elapsed parallel execution time Tp , measured in seconds;
speedup, calculated as Sp = T1 ÷ Tp ;
efficiency, calculated as Ep = Sp ÷ p.
With this information you can figure out the scalability of your code. If the reported results
seem unsatisfactory, you can go to the next table to better understand the reasons.
Table 3.2: Analysis done on Thu Jul 28 07:54:35 AM CEST 2024, par0.
Overview of the efficiency metrics in parallel fraction, φ = 99.94
Global efficiency 99.97% 87.95% 77.95%
Parallelisation strategy efficiency 99.97% 99.50% 95.91%
Load balancing 100.00% 99.60% 96.60%
In execution efficiency 99.97% 99.91% 99.28%
Scalability for computation tasks 100.00% 88.38% 81.27%
IPC scalability 100.00% 89.43% 85.12%
Instruction scalability 100.00% 100.00% 100.00%
Frequency scalability 100.00% 98.83% 95.48%
The second table of Modelfactors (e.g., Table 3.2) provides a deep dive into the efficiency metric,
but only for the parallel fraction, denoted by φ, of the program, that is, the part of the program
that it currently parallelised. This is a good metric to understand whether the inefficiencies come
from a large serial part in your program that has not yet been parallelised or cannot be further
22
optimised. Note that there are two metrics, namely the “parallelization strategy efficiency” and the
“scalability for computation tasks”, whose product form the “global efficiency” of the program.
The parallelisation strategy efficiency is calculated as the product of the “load balancing” and
the “in execution efficiency”. The former reflects the ratio between the average useful code executed
by the threads and the maximum useful code executed by any of them, while the latter shows the
lack of overheads due to work generation and synchronisation in the critical path of the execution.
The higher and closer to 100% these two values are, the better. With these two metrics it can be
often inferred what to do next with regard to the analysis and optimisation of the code.
The scalability for computation tasks is related to how the processors are actually executing the
computation tasks, in terms of the total number of useful instructions executed, number of useful
instructions executed per cycle (IPC) of operation, and the frequency (number of cycles useful of
operation per second). As with the previous metric, it is calculated as the product of these three
variables. The value of these metrics for p threads is relative to 1 thread and, once more, the closer
to 100% these values are, the better. Note that lower values are to be expected for specific cases in
which there are plenty of synchronization overheads.
Table 3.3: Analysis done on Thu Jul 28 07:54:35 AM CEST 2024, par0.
Statistics about explicit tasks in parallel fraction
Number of explicit tasks executed (total) 16.0 32.0 64.0
LB (number of explicit tasks executed) 1.0 1.0 1.0
LB (time executing explicit tasks) 1.0 1.0 0.98
Time per explicit task (average us) 79070.36 44729.83 24320.53
Overhead per explicit task (synch %) 0.0 0.47 4.23
Overhead per explicit task (sched %) 0.02 0.02 0.02
Number of taskwait/taskgroup (total) 8.0 8.0 8.0
Finally, the third Modelfactors table (see Table 3.3) completes the report. If the code contains
explicit tasks, then details about them will be provided in this table. Otherwise, information
about implicit tasks will be reported. These metrics can be very helpful to detect issues related to
overheads caused by synchronizations or an excessive number of tasks created. LB stands for load
balancing.
You can find more details on how to calculate the metrics of Tables 3.2–3.3 in Annex B.2.
23
tasks that it created.
iii. When all the explicit tasks are executed, the parallel region will be finished
returning to a serial execution until a new parallel region is found, if any in the code.
In your program, three parallel regions are initially defined.2
3. Compile 3dfft omp.c using the appropriate entry in the Makefile:
make 3dfft_omp
4. Submit the execution of the submit-strong-extrae.sh script indicating the name of the
binary program 3dfft omp:
sbatch submit-strong-extrae 3dfft_omp
This may take up to several minutes as the script traces the execution with 1, 4, 8, 12, 16,
and 20 threads and also performs the analysis with Modelfactors.
5. Monitor the status of your submitted job in the queue:
watch squeue
to check whether it has finished (press Ctrl-C keys to stop the command).
6. After the completion of the job, check that the directory 3dftt omp-strong-extrae exists,
and inside it, the modelfactor-tables.pdf or its equivalent text version modelfactors.out.
7. Open the PDF via the following command:
xpdf modelfactor-tables.pdf
Before continuing, make sure to rename the 3dfft omp-strong-extrae directory before running
a new Modelfactors analysis in order to avoid data deletion (since the same directory name will
be reused). Furthermore, beware of the limited disk quota assigned to your account and consider
deleting old traces once they are not longer necessary to prepare your laboratory deliverables.
Table 3.4: Results of the Paraver execution analysis on three versions of the FFT benchmark.
Version φ ideal S12 T1 T12 real S12
Initial version in 3dfft omp.c
New version with reduced parallelisation overheads
Final version with improved φ
24
Let’s dive into the parallel execution of the application by visualizing one of the execution traces
generated by modelfactors. In this guided tour you will learn the basic features of Paraver , the
graphical browser of the traces generated by Extrae. For your informantion, in Atenea (Paraver
package section) we have prepared three packages (for Linux, Windows and Mac) of Paraver
software so that you can locally install Paraver binaries on your computer to analyze the traces
generated. This is usually faster than remotely open the GUI of Paraver when you are outside the
UPC campus.
We suggest you to keep the information of parallel fraction value (found in the second table
of Modelfactors) and capture the Paraver timeline(s) window(s) in order to make appropriate
comparisons between the three proposed versions of the FFT benchmark:
initial,
reduced overhead,
improved parallel fraction,
using the timeline window and profile of thread states. We will ask you to fill in Table 3.4.
Figure 3.2: Paraver’s main control window, timeline, and 2D analyzer windows.
1. Launch Paraver by typing wxparaver on the command line, which will open the so-called
“main control window”, shown in Fig. 3.2 (left).
2. Load the trace corresponding to 12 processors from the main menu, by selecting File → Load
Trace..., and navigate through the directory structure until you find the trace file (.prv)
generated from the instrumented execution of the 3dfft omp binary during the execution of
the Modelfactors tool (they should be inside the directory 3dfft omp-strong-extrae).
3. Once the file is loaded, click on the box New single timeline window (top left icon in the
main window).
4. A new window, similar to the one shown in Fig. 3.2 (top-right), appears showing a timeline
where the horizontal axis represents time (advancing from left to right) and with the state
(encoded in colour) of each thread (vertical axis) in the parallel program.
5. While moving the mouse over the window, a textual description of the meaning of each colour
is shown (at the bottom of the same window):
light blue (idle),
dark blue (running),
25
red (synchronisation),
white (not created),
yellow (scheduling and fork-join),
. . .,
note that the meaning of each color is specific to each window.
6. Right-click at any point inside the window and select Info Panel and then the Colors tab
to see the colouring table for the window.
7. Double click with the left button of your mouse on any point in the red or yellow areas in the
timeline. This action will activate the What/Where tab of the info panel, showing in textual
form information at the selected point in the trace (thread and time where you have clicked,
thread state at this point and how long the time interval with that state is).
8. With Right Button → Info Panel in the trace window you can decide to show or hide this
panel.
It is interesting to see how the OpenMP parallel execution model is visualised:
Firstly, a master thread executes the sequential part of the program (dark blue with all the
other threads not yet created, shown in white) until the parallel region is reached.
At that moment, threads are created and the already mentioned explicit tasks are created
and executed, synchronising when necessary.
Once the parallel region is finished, only the master continues its execution (dark blue) with
all other threads remaining idle (light blue).
26
2. You can observe the dark blue bursts representing the execution of tasks and the red bursts
in between representing synchronisations, with less frequent yellow bursts indicating when
tasks are being generated.
3. Play a little bit more zooming in and out to observe how threads transition between states.
27
4. Both windows will then have the same size and represent different views (metrics) for the
same part of the trace. If you put one above the other there is a one-to-one correspondence
between points in vertical.
5. You can also synchronise different windows by adding them to the same group. For example,
select Synchronise → 1 after a right-click with your mouse on the initial state timeline
window showing the parallel activity.
6. Repeat this with the hint User functions timeline window.
7. Once synchronised, they will continue aligned after zooming, undoing or redoing zoom. With
the two windows synchronised it is easy to correlate the duration of tasks with the implicit
task they belong to, allowing you to observe if parallel regions take the same time and if there
is some work imbalance among threads in any of them.
8. You can always unsynchronise a window by unselecting Synchronise after a right-click with
your mouse on the timeline window. Note that you can create new groups of synchronised
windows at your convenience.
9. Open the hint Useful/Instantaneous parallelism to see the number of threads executing
tasks in parallel. In order to figure out the existing parallelism at any moment of the execution
trace, you can double-click with your mouse and the information panel will show you the
instantaneous parallelism at that moment (a value of 1 means no parallelism at all).
There are two key factors that influence the overall scalability and final performance. Looking
at the two timelines windows we can see that:
there is one function that is not parallelised;
there is a predominant thread state along the entire timeline window.
The second factor seems to have a greater influence because there is a strong scalability problem,
that is, a slowdown from twelve or more threads, due to the number of tasks and synchronisations
in the program.
28
3. Open the Paraver hint OpenMP tasking/Histogram of explicit execution task duration.
4. Another window pops up, showing with a colour gradient the distribution of the different
explicit tasks duration. You can see the histogram of different explicit task executed for each
thread.
There are other hints for both implicit and explicit tasks that you can see in Annexes C–D.
1. Close and reopen Paraver and load the trace corresonding to 12 threads.
3. Go to the very top of the directory list, look for your lab directory, and click on lab1/3dfft/
4. Load one of the two configuration files avalaible (*.cfg) corresponding to either implicit or
explicit tasks.
6. Make sure that the timelines do cover the entire execution time by clicking on Fit time
scale.
Note that a single configuration file can be used to automatically load several hints and
synchronise the corresponding timelines and profiles. In this particular case, the following Paraver
windows were loaded:
Thread states. You will see different states: (a) running, (b) not created, (c) schedule and
fork/join, (d) synchronization, and (e) idle.
Worksharing constructs. The single region is only shown on the thread executing the single
region.
Implicit/Explicit task function creation (line and file). The thread executing the single re-
gion creates all the tasks in the taskloop. If you move the mouse over any of these regions,
you will see the line and file where the taskloop (task) is created.
Implicit/Explicit task function execution (line and file). Any thread can execute a task
created in the taskloop. If you move the mouse over such a region, you will see the first line
and file where each task starts.
Implicit/Explicit tasks executed duration. If you move the mouse over any of these regions,
you will see the execution time (granularity) of each task.
From the previous hints and configurations, and more specifically the histograms and timeline
windows of the explicit task duration, it can be inferred that the fine granularity employed in this
parallel implementation may not be adequate.
Annexes C–D summarise the most important Paraver hints to be used with implicit and
explicit tasks, respectively. We strongly recommend you to try them out now by using the trace
corresponding to the initial version of the FFT problem executed with twelve threads.
29
3.3 Example 3DFFT: Parallel Improvements and Analysis
3.3.1 Reducing the parallelisation overheads
Code Optimization
We ask you to coarsen the current task granularity of the FFT problem. This should reduce
the parallelisation overheads and, as a consequence, have some positive impact in the overall
performance. In order to achieve this:
1. Comment the innermost taskloop inside each parallel region.
2. Uncomment the outermost taskloop inside each parallel region.
30
3.3.3 Improving φ
Code Optimization
Note that the function init complex grid has not yet been parallelised and its execution time is
limiting the overall performance of the FFT program. We propose you another optimisation based
on your previous analysis with both Modelfactors and Paraver:
1. Open the 3dfft omp.c file.
2. Uncomment the pragmas that will allow the parallel execution of the code inside function
init complex grid.
3. At this point, only uncomment the outer loop taskloop together with parallel and single.
31
32
Appendix A
You can access boada from your laptop, booted with Linux, macOS or Windows, if a secure shell
client is installed. However, to be able to access the boada server from a computer outside the UPC
you will need to use a UPC VPN connection.
On Windows you need both putty for secure shell:
https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html
and xming for X11:
https://wiki.centos.org/HowTos/Xming
alternatively, you can use MobaXterm which integrates both secure shell and X11:
https://mobaxterm.mobatek.net/download.html
On macOS you need XQuartz and the option -Y:
ssh -Y [email protected]
where ???? refers to your user number.
33
34
Appendix B
Modelfactors
avg(usef uli )
LB ef f = ,
max(usef uli )
35
while the second factor highlights when threads spend time executing useful code
max(usef uli )
in exec ef f = ,
t
being t the total execution time for the parallel fraction and usef uli the time that thread i executes
useful code. Finally, parallel ef f (p) can be computed as follows:
Σp−1
i=0 (usef uli )
Σp−1 usef uli p max(usef uli )
parallel ef f (p) = i=0 = .
p×t max(usef uli ) t
The scalability for computation tasks can also be expressed as usef ul(1) ÷ usef ul(p), where
usef ul(p) is the total time all threads execute useful code in the parallel fraction, i.e. Σp−1
i=0 usef uli .
In particular, usef ul(p) is equal to inst(p) × CP I(p) ÷ f req(p) where:
inst(p) is the total number of instructions executed in useful code;
CP I(p) = cycles(p)÷inst(p) is the average number of cycles needed to execute one instruction,
being cycles the total number of cycles spent in useful code;
f req(p) = cycles(p) ÷ usef ul(p) is the average frequency of the processors while executing
useful code.
In order words, we can compute the scalability for computation tasks as:
which corresponds to IP C scal(p) × inst scal(p) × f req scal(p) and is precisely the product of the
three rows (“IPC scalability”, “instruction scalability” and “frequency scalability”) of the second
table of Modelfactors. Take into consideration that:
an instruction scalability smaller than unity means that the parallel program needs more
instructions than the sequential code;
an IPC smaller than unity means that the IPC for p threads is worse than its sequential
counterpart;
a frequency scalability smaller than unity means that, when using p threads, the frequency
decreases.
The third table of Modelfactors focuses on the information associated with explicit tasks. In
particular, it exposes different factors that affect the overall load balance and overheads due to
explicit tasks. The factors “LB (number of explicit tasks executed)” and “LB (time executing
explicit tasks)” show the load imbalance and execution time related to the number of explicit tasks,
respectively.
There should be a good correlation between the load balance information (execution time due to
useful code) shown in the second table and the load balance information (explicit tasks) in the third
table. On the other hand, the percentage of overhead due to the synchronisation and scheduling
can be a good indicator of scalability problems of a given parallel strategy, usually related to the
number of tasks and their granularity.
36
Appendix C
Observe that in this trace only one thread encounters the parallel region and that there are
9 parallel regions executed (as delimited by the flags): two of these nine regions are fake parallel
regions intentionally introduced to delimit the start and the end of the main program.
You can check the line number in the original source code to see which one of the three parallel
regions identified in the code is executed. Can we know how much time it takes for a thread to
execute these implicit tasks?
37
C.5 More hints
There are other hints in the Paraver menu gathered in Tables C.1–C.2 associated with timelines
and histograms, respectively.
Table C.1: Paraver hints for implicit tasks that generate timelines.
Paraver hint Timeline content
Parallel constructs When a parallel construct is executed
Implicit tasks The implicit task each thread executes in a parallel region
Implicit tasks duration The duration of the implicit task executed in a parallel region
Worksharing constructs When threads are in worksharings (in this course, only single)
In barrier syncrhonisation When threads are in a barrier synchronisation
In critical syncrhonisation When threads are in/out/entering/exiting critical synchronisations
Table C.2: Paraver hints for implicit tasks that generate histograms.
Paraver hint Histogram content
Thread state profile Thread states (useful, scheduling, synchronisation, . . .)
In critical syncrhonisation profile Different phases a thread goes on when synchronising
in a critical construct
Implicit tasks profile Implicit tasks
Histogram Implicit tasks duration Durations of implicit tasks
38
Appendix D
Remember that these explicit tasks are created via the task or taskloop constructs present in a
parallel program.
39
Table D.1: Paraver hints for explicit tasks that generate timelines.
Paraver hint Timeline content
Tasking/Explicit tasks function created and executed When explicit tasks are created and executed
(file name and line number)
Tasking/Explicit tasks executed duration Duration of explicit task functions executed
Taskloop/In taskloop constructs When the taskloop construct is executed
Tasking/In taskwait constructs When a taskwait construct is executed
Tasking/In taskgroup constructs When a task is starting or waiting in a taskgroup
Table D.2: Paraver hints for explicit tasks that generate histograms.
Paraver hint Histogram content
Tasking/Profile of explicit tasks creation and execution Task creations and executions
Tasking/Histogram of explicit tasks execution duration Durations of the explicit tasks executed
40