GUILLOTEAU 2023 Archivage
GUILLOTEAU 2023 Archivage
GUILLOTEAU 2023 Archivage
J’aimerais commencer par remercier les membres du jury pour leurs retours, et en
particulier Alessandro Papadopoulos et Alexandru Costan pour avoir lu ce document
de bout en bout.
Un immense merci à Eric et Olivier pour m’avoir accueilli dès le stage de M2, et de
m’avoir supporté pendant presque 4 ans. Deux encadrants, deux styles et domaines
d’expertise bien différents, mais tout de même très complémentaires ! Vous m’avez
tous les deux tant appris, scientifiquement et humainement. Merci aux équipes CtrlA,
DataMove, Polaris pour cet environnement de recherche si sain et stimulant. Merci
à Imma, Annie, Maud, sans qui je ne serais pas allé bien loin (géographiquement
et administrativement). Merci aux post-docs, ingénieurs, et jeunes chercheuses et
chercheurs pour vos connaissances et pour avoir montré la voie à suivre avec autant
de bienveillance. Je pense notamment à Raphaël, Sophie, Millian, Danilo, Adrien,
Jonathan. Merci à tous les thésard.e.s avec qui j’ai partagé ces trois années et tous
ces moments de convivialité autour d’un verre et/ou avec des cartes. Merci à tous
les résidents du bureau 431, pour leur sympathie, et avoir supporté toutes mes
tentatives au basket. Merci aux brésiliens pour m’avoir fait découvrir vos coutumes
culinaires (parfois douteuses). Merci à Grid’5000 sans qui cette thèse aurait été bien
fade expérimentalement parlant. Merci à tous les enseignants qui m’ont laissé une
marque pendant mon parcours scolaire. Merci à Frédéric et Thomas pour m’avoir
fait découvrir la beauté du parallélisme et du distribué, et Arnaud et Jean-Marc pour
m’avoir sensibilisé aux subtilités de l’évaluation expérimentale et des statistiques.
Merci à ma famille pour tout leur soutien (Luky, c’est à ton tour maintenant !).
Et enfin, merci infiniment à Sofi pour son support quotidien inconditionnel, et sans
qui ces derniers mois de stress auraient été bien plus difficiles à vivre.
iii
Abstract / Résumé
Abstract
We claim that such regulation challenges can be addressed with tools from Auto-
nomic Computing, and in particular when coupled with Control Theory. This thesis
investigates several regulation problems in the context of CiGri with such tools. We
will focus on regulating the harvesting based on the load of a shared distributed
file-system, and improving the overall usage of the computing resources. We will
also evaluate and compare the reusability of the proposed control-based solutions in
the context of HPC systems.
The experiments done in this thesis also led us to investigate new tools and tech-
niques to improve the cost and reproducibility of the experiments. We will present a
tool named NixOS-compose able to generate and deploy reproducible distributed
software environments. We will also investigate techniques to reduce the number
of machines needed to deploy experiments on grid or cluster middlewares, such as
CiGri, while ensuring an acceptable level of realism for the final deployed system.
v
Résumé
Les systèmes de calcul haute performance (HPC) sont devenus de plus en plus
complexes, et leurs performances ainsi que leur consommation d’énergie les rendent
de moins en moins prévisibles. Cette imprévisibilité nécessite une gestion en ligne
et prudente, afin garantir une qualité de service acceptable aux utilisateurs. Un tel
problème de régulation se pose dans le contexte de l’intergiciel de grille de calcul
CiGri qui vise à récolter les ressources inutilisées d’un ensemble de grappes via
l’injection de tâches faiblement prioritaires. Une stratégie de récolte trop agressive
peut conduire à la dégradation des performances pour tous les utilisateurs des
grappes, tandis qu’une récolte trop timide laissera des ressources inutilisées et donc
une perte de puissance de calcul. Il existe ainsi un compromis entre la quantité de
ressources pouvant être récoltées et la dégradation des performances pour les tâches
des utilisateurs qui en résulte. Ce compromis peut évoluer au cours de l’exécution
en fonction des accords de niveau de service et de la charge du système.
Nous affirmons que de tels défis de régulation peuvent être résolus avec des outils
issus de l’informatique autonomique, et en particulier lorsqu’ils sont couplés à la
théorie du contrôle. Cette thèse étudie plusieurs problèmes de régulation dans le
contexte de CiGri avec de tels outils. Nous nous concentrerons sur la régulation
de la récolte de ressources libres en fonction de la charge d’un système de fichiers
distribué partagé et sur l’amélioration de l’utilisation globale des ressources de calcul.
Nous évaluerons et comparerons également la réutilisabilité des solutions proposées
dans le contexte des systèmes HPC.
Les expériences réalisées dans cette thèse nous ont par ailleurs amené à rechercher
de nouveaux outils et techniques pour améliorer le coût et la reproductibilité des ex-
périences. Nous présenterons un outil nommé NixOS-compose capable de générer et
de déployer des environnements logiciels distribués reproductibles. Nous étudierons
de plus des techniques permettant de réduire le nombre de machines nécessaires
pour expérimenter sur des intergiciels de grappe, tels que CiGri, tout en garantissant
un niveau de réalisme acceptable pour le système final déployé.
Dans le Chapitre 1, nous présentons l’état de l’art dans les différentes facettes de
notre problème (la collecte de ressources de calculs, la gestion autonomique de
systèmes informatiques, et la théorie du contrôle appliquée aux systèmes infor-
matiques). Nous présentons aussi CiGri, qui est l’objet central de cette thèse. Ce
vi
premier chapitre se conclut avec la présentation des questions de recherche qui
seront abordées dans les chapitres qui suivent.
Une des limitations du contrôleur proposé dans le chapitre précédent est son lien
fort avec le système pour lequel il a été conçu, ce qui limite sa réutilisabilité sur
d’autres systèmes. Dans le Chapitre 4, nous considérons 2 autres contrôleurs (3 au
total) et évaluons leur performance et utilisabilité. Nous concluons qu’il n’y a pas de
globalement meilleur contrôleur, et exhibons des situations dans lesquelles certains
contrôleurs sont plus adaptés.
vii
Dans le chapitre 9 nous présentons une technique pour réduire le nombre de
machines à déployer pour représenter un cluster. Nous nous concentrons sur l’impact
sur les performances en I/O (Entrée/Sortie) de cette technique pour juger de ses
limites. Nous évaluons cette technique avec deux systèmes de fichiers distribués de
natures différentes et avec des configurations de cluster différentes.
Le Chapitre 10 présente les premiers pas vers un simulateur pour les boucles de
rétroactions dans CiGri présentées dans la première partie de ce manuscrit. Il contient
une description technique de l’implémentation du simulateur, et d’une comparaison
expérimentale contre un système réel. Nous concluons ce chapitre avec les limites de
simulateurs de l’état de l’art pour l’introduction de boucle de rétroaction complexes,
et pour la simulation de systèmes de fichiers distribués.
viii
Contents
Acknowledgments iii
Abstract / Résumé v
Contents ix
Introduction 1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Work dissemination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
ix
1.4.4. The Need for Regulation . . . . . . . . . . . . . . . . . . . . . 24
1.5. Hypotheses of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.6. Conclusion & Research Questions . . . . . . . . . . . . . . . . . . . . 27
x
4.2. Considered Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.1. Proportional-Integral Controller . . . . . . . . . . . . . . . . . 66
4.2.2. Adaptive PI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.3. Model-Free Controller . . . . . . . . . . . . . . . . . . . . . . 68
4.3. Evaluation and Comparison . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.1. Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.2. Controller configurations . . . . . . . . . . . . . . . . . . . . 70
4.3.3. Experimental protocol . . . . . . . . . . . . . . . . . . . . . . 70
4.3.4. Performance-related comparison . . . . . . . . . . . . . . . . 71
4.3.5. Methodology and Implementation-related comparison . . . . 75
4.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
xi
7.1. Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.1.1. Context & Motivation . . . . . . . . . . . . . . . . . . . . . . 108
7.1.2. Frequent Traps of the reproducibility of software environments109
7.1.3. Functional Package Managers . . . . . . . . . . . . . . . . . . 113
7.1.4. Limits of Functional Package Managers . . . . . . . . . . . . . 115
7.1.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2. Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
xii
10.Towards Simulating CiGri and its Control Loop 157
10.1.Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
10.2.BatCiGri . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
10.2.1. Expected Properties of the Simulation . . . . . . . . . . . . . 158
10.2.2. Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
10.2.3. Batsim in a Nutshell . . . . . . . . . . . . . . . . . . . . . . . 159
10.2.4. Two Schedulers . . . . . . . . . . . . . . . . . . . . . . . . . . 159
10.2.5. Broker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
10.2.6. The CiGri Submission Loop . . . . . . . . . . . . . . . . . . . 161
10.2.7. Workload Adjustments . . . . . . . . . . . . . . . . . . . . . . 162
10.3.Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
10.4.Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Bibliography A3
xiii
Introduction
Background
1
Number of Cores CPU Frequency [MHz] Performance [GFlop/s]
1e+07
1e+08
1000
1e+05
1e+05
1e+03 100
1e+02
1e+01
10
Figure 1.: Evolution of the number of cores, CPU frequency, and the maximum performance
of the Top500 machines. The solid line represents the median, and the ribbons
the 5% and 95%. Updated from [Cor21] with the data from [Len21].
(deploying and starting the computations, as well as cleaning the environment when
they completed).
This reservation process is not perfect, and can lead to the idling of some resources
due to the requirements on the resources. While machines are idle, they still
consume energy, which is around half of a machine used at full power [Hei+17].
This idling can represent a significant loss of energy and thus money. Turning them
on and off is possible but introduces new energetic costs, and degradation of the
Quality-of-Service (QoS) if users need the resources. As HPC jobs are large and
job submission
users RJMS
Figure 2.: Illustration of a Resource and Job Management System (RJMS) [Ble17]
2 Contents
rigid in the number of resources they require, it makes it difficult to improve the
use of the idle CPU time. A lot of effort has been put on the research of better
scheduling algorithms for the RJMS, meaning how to map computation to physical
machines. However, they often require the users to estimate correctly the duration
of their computation, which is notoriously difficult for humans to do correctly [CB01;
MF01]. In practice, despite all this work, and because of the conservative behavior
of the HPC community, the majority of HPC centers use the same, or a variation
of, scheduling policy: EASY Backfilling. This scheduling technique is simple to
implement, to understand, and with good enough performance, which eases its
adoption in HPC centers.
The idle computing power can be harvested by executing smaller and interruptible
jobs seen as second-class citizens. However, those jobs can introduce perturbations
in the system and degrade the performances of the premium users jobs (e.g., via an
overload of the distributed file-system). As the jobs harvesting the idle resources are
interruptible, it is tempting to kill them when the system is about to get overloaded.
However, this killing also represent a lost of computing power as these jobs do not
usually implement any check-pointing mechanism and will thus need to be restarted
from the beginning later on.
In general, HPC systems are very unpredictable [Bha+13; SK05; Dor+14]. The
performance of an HPC application can be affected by many exterior factors: network
contention, file-system load, temperature in the rack, etc. All of these factors are
near impossible to predict correctly, and computational costly to solve optimally, and
thus impossible to incorporate into the decision process of the RJMS.
To deal with such runtime variations of computing systems, the field of Autonomic
Computing proposes a feedback loop point-of-view to regulate the under control
system and to steer it towards a desired behavior/state. The feedback loop of the
Autonomic Computing can be interpreted and implemented in various fashions. One
particularly interesting way is to use tools from the Control Theory field. Control
Theory is usually applied to physical dynamical systems, but very rarely to computing
systems. However, it provides mathematically proven properties on the controlled
system, which makes it quite appealing compared to more black boxes such as
Machine Learning approaches, or to simpler rule based strategies.
Contents 3
is a crucial reproducibility issue, as it means that an experiment done today could
not be reproduced in the future. To still perform the experiment, one might use
simulation techniques. But even state-of-the-art simulators fail to fully reproduce
the behavior of experiment in real conditions.
This Thesis
When designing and implementing feedback loops, especially using Control Theory
tools, the quality and robustness of the experimental setup is crucial. Reproducible
experiments are a must to ensure the validity of the resulting controller. Moreover,
distributed experiments involving computing cluster middlewares (e.g., batch sched-
uler, Parallel File-Systems) are especially tricky. Faithful and realistic experiments
with such middlewares would require deploying clusters at full scale, which cost
is excessive. Simulation techniques can be used but require correct and proven
models. Intermediate strategies to reduce the number of machines to deploy the
experiments and keeping a full scale behavior of the system would allow to reduce
the experimental costs while performing meaningful experiments.
From the above two observations, the work of this thesis will be organized in two
parts: In Part I, we investigate an Autonomic Controller using Control Theory tools
to regulate the harvesting of idle HPC cluster resources.
• In Chapter 1, we give some context to this work, and present the CiGri middle-
ware which will be the central piece of this thesis.
4 Contents
Introduction
1. SotA Harvesting
6. Conclusion Harvesting
• Chapter 10 presents the first step towards a simulator of CiGri and our auto-
nomic controllers of Part I using Batsim.
Some chapters can be read independently. Figure 3 depicts the dependency graph of
the chapters of this thesis.
Contents 5
Work dissemination
International conferences
• Quentin Guilloteau, Bogdan Robu, Cédric Join, Michel Flies and Eric Rut-
ten. “Model-free control for resource harvesting in computing grids”. CCTA
2022 [Gui+22d]
• Quentin Guilloteau, Olivier Richard, Bogdan Robu, and Eric Rutten. “Con-
trolling the Injection of Best-Effort Tasks to Harvest Idle Computing Grid
Resources”. ICSTCC 2021 [Gui+21a]
National conferences
• Quentin Guilloteau, Adrien Faure, Millian Poquet, and Olivier Richard. “Com-
ment rater la reproductibilité de ses expériences ?” COMPAS 2023 [Gui+23a].
• Quentin Guilloteau, Olivier Richard, and Éric Rutten. “Étude des applications
Bag-of-Tasks du méso-centre Gricad”. COMPAS 2022 [GRR22a]
• Quentin Guilloteau, Olivier Richard, Eric Rutten, and Bogdan Robu. “Collecte
de ressources libres dans une grille en préservant le système de fichiers : une
approche autonomique”. COMPAS 2021 [Gui+21b]
Working papers
• Quentin Guilloteau, Olivier Richard, Raphaël Bleuse, and Eric Rutten. “Folding
a Cluster containing a Distributed File-System”. 2023 [Gui+23d]
6 Contents
Tutorials
• Quentin Guilloteau, Sophie Cerf, Eric Rutten, Raphaël Bleuse, and Bogdan
Robu. Introduction to Control Theory for Computer Scientists. [Gui+b]
Contents 7
Part I
1.1 Harvesting
The motivation for harvesting idle computing resources is two-fold. First, as comput-
ers are expensive and are powered on, it would be a waste of energy and money to
leave them idle. Second, sparse idle machines represent a cheap way to expand the
computing power available to users.
In this Section, we present various techniques from the literature to harvest idle
resources of a set of machines in different contexts.
11
new task. In the case where the computer is no more idle during the execution of a
task, the task is killed and the CPU resources given back to the user.
The cloud is also interested in using idle resources as every idle CPU tick is money
lost. Amazon Web Services (AWS) introduced a 90% cheaper deal to use the idle
resources of its EC2 cloud [Amaa; Amab]. Users submit their computations, and AWS
will schedule them on appropriated available idle resources. Those computations
are seen as second class citizens and will be interrupted when EC2 reclaims the
resources. In [SRI16], the authors propose a pricing for idle cloud resources based
on a probabilistic model of the potential revocation of the resources. The goal is
to encourage the usage of the idle resources thanks to a fair pricing based on the
offered guarantees. [Liu+20] approaches the problem with a game-theory strategy
to dynamically define the price of cloud resources. The model the problem as a
non-cooperative game between the cloud provider and the customers. The authors
of [Wan+21] use online learning to predict the demand of users and compute the
amount of resource that second class virtual machines can safely harvest.
HPC systems are also victims of idle resources. One solution to reduce the idle time
would be to have finer, more precise, scheduling algorithm. But as smart as the
scheduling algorithm could be, humans are still submitting the jobs. In particular,
humans are giving the estimation of the duration of their computation (i.e., the job
walltime), and we are notoriously bad at it [CB01; MF01]. Work has been done to
The work on the HPC/Big-Data convergence led to some solutions using Big-Data
workloads to make use of the idle HPC resources.
Bebida [Mer+17] is another approach coupling HPC jobs and Big-Data workflows.
The authors slightly modified the prologue and epilogue scripts of the OAR batch
scheduler [Cap+05] to start and stop Hadoop [Shv+10] workers on the HPC
resources when an HPC job finishes or starts. The idle HPC resources are then
seen as a dynamic resource pool from the point-of-view of the Big-Data resource
manager. Authors showed that Bebida improved the total usage of the machines,
but degraded the mean waiting time of HPC jobs (due to the longer prologue and
epilogue scripts).
Similarly to Bebida, the authors of [Prz+22] propose a solution to use the idle
resource of a cluster using FaaS (Function as a Service), or Serverless, workloads.
The jobs from such workloads last a few seconds, which make them good candidates
to exploit the “holes” in the schedule. The solution couples the resources managers
Slurm and OpenWhisk [Ope]. One limitation of the approach is the actual usage of
FaaS for scientific workflows. The HPC community is conservative and the adoption
of such complementary tools might be long [SMM18].
1.1 Harvesting 13
CiGri [GRC07] is the approach that we will focus on in this thesis. CiGri is a
grid middleware that runs on top of several computing clusters managed by the
OAR resources manager. It harvests the idle resources of the grid’s clusters by
periodically submitting jobs to the different resources managers with the lowest
priority (Best-Effort). Those jobs come from Bag-of-Tasks applications, which are
composed of numerous, small, independent, and similar jobs. Contrary to the
previous solutions, the jobs submitted to CiGri are HPC jobs (just with different
“shapes”), and there is a single scheduler decision at the cluster level. The limitation
of CiGri is its submission algorithm which does not take into account the state of the
cluster. This lack of feedback can lead to suboptimal resource usage and contention
on shared resources of the clusters. A more detailed presentation of CiGri is done in
Section 1.4.
HPC systems are victim to idle resources. Those resources are usually har-
vested by executing smaller, interruptible jobs. Solutions from the literature
usually involve two levels of scheduling and jobs from two different natures.
To deal with the increasing variability of computing systems and the error-prone
manual management, IBM introduced in the early 2000s the notion of Autonomic
Computing (AC) [KC03; HM08]. The main idea is to have systems that can self-adapt
to the variations of their environment. This self-adaptation can take several forms
based on the domain of the application.
• self-protection: the system defends itself against malicious attacks and cascad-
ing failures.
To implement this vision onto a system, the latter needs to have a way to interact with
the autonomic controller. Namely, the system must have a sensor and an actuator
(or knob), which correspond to the interface between the autonomic controller and
the managed element.
The main tool of AC is the MAPE-K loop, displayed in Figure 1.1. The acronym
MAPE-K stands for the names of the different phases of the loop:
• Monitor: during this phase, the autonomic manager queries the sensor on the
managed element.
• Analyze: the value of the sensor is then analyzed by the manager to check if a
change of direction is needed.
• Plan: if there is the need for a new trajectory, the “Plan” phase is responsible
to define this trajectory.
• Execute: finally, the new trajectory is translated into the action to execute on
the knobs of the managed element.
Multiple
Control theory
MAS
Policy
Protocol
Learning
N/A
0 25 50 75
Number of papers
Figure 1.2.: Approaches used to implement an autonomic controller in 210 surveyed papers
of three autonomic computing conferences. Regenerated from [PRD20].
Using Machine Learning techniques is also very popular for implementing an auto-
nomic controller. But it raises several difficult questions. How to be sure that the
model has seen all the possible behaviors during its training to know how to react
accordingly? How to actually be sure that the model will react accordingly? And let
us not forget the computational cost of training the model, especially if the system is
complex and has multiple inputs and outputs. The explainability of a solution is also
a crucial factor for its adoption, and Machine Learning models lack in this aspect as
they can be seen as total black boxes.
This Section summaries the classical methodology of applying tools from Control
Theory to computing systems [Fil+17; Lit+17; RMS17].
For the sake of clarity, we will use the following example. Imagine you have a brand-
new CPU which is quite powerful, but can become very hot. High temperatures can
damage the CPU, and you do not want to buy a new one (it was very expensive).
In this example, we will go through the methodology to set up a controller on your
CPU to avoid breaking it.
Choose a controller
Identify a knob
update form
Figure 1.3 presents the workflow for design a controller. Note that there can be
cycles in the workflow where the objectives, knobs, sensors, etc., can be updated.
The first step is to define what are the objectives of the closed-loop system. What
are the metrics, available sensors on the system, states what we want to regulate,
to control? What are the desired guarantees on the closed-loop system? Should
the system be fast? Should it be allowed to oscillate, or to go over a certain limit?
What about disturbances? What are the nature of the disturbances? Can they be
modelled? Can they be anticipated? In Control Theory jargon, the desired state of
the system is called setpoint or reference value.
In our simple example, the objective is to regulate the temperature. Lucky you, you
have a thermometer on your board! So your sensor will be the value returned by
this thermometer. There do not seem to be a lot of disturbances. You can imagine
that your fan could break at some point and thus the temperature would increase.
Once the goals are described, one needs to identify the knobs of action, also called
actuators. This means finding parts of the system whose variations impact the output
of the system, positively or negatively. There might be several knobs of actions,
or even some indirect knobs. Knowing the range of action of your knob is quite
important in order not to ask for impossible inputs.
This phase consists in identifying the relation between the change in the knobs and
the next value of the sensors. In Control Theory terms, a model is a mathematical
relation between the current state of the system (sensors) and the current input
(knobs) to the next state of the system. If you already have knowledge of the
system you can come up with such a relation. In physical systems, such knowledge
could be physical laws (e.g., Maxwell’s equations, heat equation, etc.). Otherwise,
identification experiments are required. During those experiments, the system is put
under different types of inputs, and the output is observed. One kind of input is a
stair function where the value of the knob is constant for some time, then changes
to a new value (lesser or greater), stays constants for some time, etc. Then, with
classical tools, such as linear regression, one is able to extract a model from those
identification experiments.
For our CPU, we might be able to derive a model from thermodynamic equations,
but we can also perform identification experiments. We can perform stair-shaped
inputs, where the frequency would increase and then decrease and log the value of
the temperature. What we might observe is that there might be a delay between the
action taken on the system, inertia, and its visible reaction. This behavior can be
taken into account in the model.
Once the model has been established, one can use the tools from Control Theory to
design the controller adapted to the problem. There are plenty of controllers in the
literature, all with their pros and cons. They can usually be derived from the model
found previously, either by inserting the model into equations, or by using toolboxes
like Matlab. During this phase, and based on the chosen controller, it might be
possible to choose the closed-loop behavior of the system. The main properties,
illustrated in Figure 1.4, are the following:
1
Dynamic Voltage Frequency Scaling
• Absence of overshooting: the system does not exceed the desired state (set-
point/reference value).
• Settling time: the system reaches the equilibrium point in a guaranteed time.
• Robustness: the system converges to the setpoint despite the imprecision of the
underlying model.
where u is the value to apply to the knob, y the value of the sensor, and yref the
reference value.
Proportional-Integral Controller
k
X
u(k + 1) = Kp × e(k) + Ki × e(j) (1.2)
j=0
The PI controller is the most popular one as it remains simple but offers good
guarantees.
Proportional-Integral-Derivative Controller
The P reacts on the current error, the PI also on the past error, and it is possible to
add a component to react on the future. By adding a derivative term on the error
to a PI, we get a PID controller. The idea of the derivative, is to detect changes
k
X e(k) − e(k − 1)
u(k + 1) = Kp × e(k) + Ki × e(j) + Kd × (1.3)
j=0
∆t
When a system is noisy, the computation of the derivative can lead to undesired
behavior. A solution would be to add a filter after the sensor to smooth out the
value and thus compute the derivative. But this defeats the purpose of the derivative
term as the filter will slow down the reaction to the changes in trajectory. Thus, in
practice, the derivative term is rarely used.
After implementing the controller on the system, one should validate experimentally
its closed-loop behavior compared to the expected ones of the design phase. This
usually requires performing experiments where a set-point is given to the controller,
and then observe its behavior.
This section gives more details about CiGri, the middleware of interest in this
thesis.
The Gricad computing center2 provides computing and storing infrastructures to the
researchers of the region of Grenoble, France. The center is composed of several
2
https://gricad.univ-grenoble-alpes.fr
CiGri [GRC07; oar23a] is a computing grid middleware set up in the Gricad comput-
ing center. It interacts with the OAR schedulers [Cap+05; oar23b] of each cluster.
The goal of CiGri is to use the idle resources of the entire computing grid. Originally,
CiGri was designed to reduce the stress on the different RJMSs of the grid when
users had to execute large campaigns (tens of thousand of jobs or more).
Once the application submitted to the middleware, CiGri will submit batches of
jobs to the clusters of the grid. The jobs are submitted to the schedulers with
the lowest priority (Best-Effort), which allows OAR to kill those Best-Effort job if a
normal/premium user needs the resources. Figure 1.5a summarizes the interaction
between CiGri and the different schedulers of the computing grid.
submit
Local campaign
Users submit n jobs
Compute No
Nodes submit m jobs
Cluster 1 Cluster 2
(a) Graphical representation of the system with
CiGri and the different OAR schedulers of a (b) Graphical representation of the interaction be-
computing grid. tween CiGri and a OAR scheduler.
Figure 1.5.: Interactions between CiGri and the different RJSMs of the grid.
The current submission algorithm of CiGri works as a “tap”. CiGri first submits a
batch of Best-Effort jobs to the cluster scheduler, and then waits for all the submitted
jobs to terminate before submitting again. Figure 1.5b represents the sequence
diagram between CiGri and OAR. The size of the batch is defined by the simple
ad-hoc Algorithm 2.
Algorithm 2: CiGri current job submission. It submits jobs like a tap: opens the
tap and submits jobs to OAR. Then closes the tap and waits for all the jobs to be
executed. Then it opens the tap again and submits jobs. The values of rate and
increase_f actor have been empirically chosen by the administrators of CiGri.
Input : rate (init. 3),
increase_f actor (constant 1.5)
if no running jobs then
rate = min(rate × increase_f actor, 100);
submit rate jobs;
else
submit 0 job;
One can think of scenarios where such a submission algorithm can lead to both the
idling of resources or the unnecessary killing of Best-Effort jobs. For example, we can
think of a situation where there are a lot of idle resources on the cluster, but a single
job from the previous submission is still running and thus CiGri will not submit new
jobs, as it must wait for this last job to terminate.
We believe that the utilization of the system can be improved by introducing feedback
mechanism to the submission decision process. In particular, we think that using
an Autonomic Computing [KC03] point-of-view coupled with tools from Control-
Theory can yield an improvement in the total utilization of the platform. In this
Thesis, we will focus on two regulation problems of CiGri.
One regulation problem arises when considering the file-system of the cluster. As
CiGri jobs are mostly short, and that no useful job does not either read or write to
the file-system, then having too many CiGri jobs running can lead to an overload
and a potential collapse of the distributed file-system. Such a problem is not taken
into account in OAR for the scheduling of job, but could be with solutions such as
[JPV23]. However, the performance of I/O-aware scheduling techniques are not
yet satisfactory, and are not being implemented in production. They usually need
either instrumentation with Darshan [Car+11; Car+09] for example, or guesses
from the users. Previous work has been done in [Yab+19] by implementing a
Model-Predictive controller, but the complexity of the solution, errors and scale of
the experimental setup (10 nodes) yield a model that did not scale correctly. Hence,
there is the need to keep exploring different control-based approaches.
In the case where we do not consider the distributed file-system of the cluster, one
can think that using the all the resources of the cluster is easy and that CiGri “just”
needs to keep enough jobs in the waiting queue of OAR to fill the entire cluster
if available. Even if this represents a regulation problem of its own (addressed
in [Sta+18]), it only considered idle resources to be wasted computing power.
However, killed CiGri jobs also represent wasted computed power as CiGri jobs do
not use check pointing techniques. The campaigns submitted by the users contain a
given work to be executed, and will not be completed until all the work has been
done. Thus, getting CiGri jobs killed is also counterproductive.
CiGri is able to harvest idle resources of a set of clusters, but does not take
into account the state/load of the clusters. Applying Autonomic Computing
with Control Theory tools could improve the resources’ utilization.
There is only one cluster in the grid This hypothesis allows us to focus on a single
cluster. The generalization to several clusters can be done by considering one
autonomic controller per cluster. An interesting path to consider would be the affinity
between CiGri campaigns and the different clusters of the grid, with potentially
similar techniques as in [Cas+00].
There are no other Best-Effort jobs in the system besides CiGri’s Regular users of
the cluster can in practice also submit Best-Effort jobs. Having both Best-Effort jobs
from CiGri and regular users could lead to some hidden competition for the idle
resources, and introduce noise in the signals. As a first step, we consider that regular
users cannot submit Best-Effort jobs.
The total number of resources in the cluster is constant We do not take into ac-
count the variation of the availability of the nodes. Meaning that no node are
removed or added to the set of available nodes. Note that if we can have a dynamic
sensor of the number of available resources in the cluster, the remaining of this
thesis should be easily adaptable.
HPC systems are not making use of all their resources. This is due to the rigid
nature of the jobs. The idle computing power can be harvested by executing smaller
and interruptible jobs. However, those jobs can introduce perturbations in the
system and degrade the performances of the premium users jobs. We believe that the
harvesting of idle resources represents a regulation problem which can be tackled via
an Autonomic Computing approach. To provide a runtime behavior with guarantees,
we think that using tools and methods from Control Theory will help in terms of
performance and adoption. The CiGri middleware represents a good framework to
investigate these ideas and methods.
The remaining of this Part presents several autonomic feedback loops in the CiGri mid-
dleware to harvest the idle resources of an HPC cluster while considering the degra-
dation of premium users’ jobs performance. Chapters of this Part are articulated by
the following research questions (RQ):
• RQ1: Can we characterize the CiGri jobs? How many jobs are in a campaign?
How long is their execution time?
• RQ3: Controllers are usually tightly tuned for a specific system. To what extent
can we plug a controller designed on a system A, on a system B?
• RQ4: Can a controller in CiGri reduce the computing power lost due to
both idle resources and the killing of low priority jobs with some internal
decision-making information from the scheduler?
A statistical study of the CiGri jobs from the last 10 years exploring RQ1 is presented
in Chapter 2. The RQ2 will be addressed in Chapter 3 with the implementation
of a Proportional-Integral controller following the methodology of Control Theory.
Chapter 4 investigates RQ3 by comparing the controller defined in Chapter 3 and
two others types of controllers on their design cost and performance. We present a
first step for answering RQ4 in Chapter 5, and conclude Part I in Chapter 6.
In this section, we are interested in the global characteristics of the CiGri jobs.
Figure 2.1 shows the empirical cumulative distribution function (ecdf) for the
execution times of the CiGri jobs, as well as the Bag-of-Tasks jobs from the DAS2 grid.
DAS2 [DAS] is the second generation of computing grids for Dutch universities. This
is the only available workload containing explicit Bag-of-Tasks applications that we
are aware of. Table 2.1 shows a side-to-side comparison of execution time metrics
between the two considered computing grids. Globally, the execution times of the
Bag-of-Tasks jobs executed on the Gricad center are longer than on DAS2. Indeed,
half of the jobs on DAS2 last less than 30 seconds, whereas half of the jobs on
Gricad last less than a minute.
The longest job on DAS2 runs for a few hours and the longest job for Gricad lasts
several days. This difference could be explained by the usage of the grid by the
users, the type of jobs, the number of available machines, etc.
29
Cumulative distribution function of BoT jobs exec times
for the Gricad and DAS2 grids
1.00
DAS2
0.75
Proportion
0.50
Gricad
0.25
0.00
Figure 2.1.: Empirical Cumulative Distribution Function (ecdf) for the execution times of
the Bag-of-Tasks jobs executed on the grids Gricad and DAS2. Globally, the
Bag-of-Tasks jobs executed on the Gricad center are longer than on DAS2.
Computing Grids
Metrics
Gricad DAS2
Number of BoT jobs ≃ 4.4 × 107 ≃ 105
Number of clusters 8 5
Minimum texec 1s 1s
Average texec 12m 43s 3m 51s
Maximum texec 9d 1h 27m 10h
Median texec 1m 13s 24s
Quantile 75 % texec 5m 25s 2m 51s
Quantile 95 % texec 48m 16s 15m 1s
Quantile 99 % texec 2h 49m 13s 43m 35s
Table 2.1.: Table summarazing the dataset. texec represents the execution times of the
Bag-of-Tasks jobs from the two computing grids Gricad and DAS2.
1e+03
1e+01
13−3
14−4
14−1
14−2
14−3
15−4
15−1
15−2
15−3
16−4
16−1
16−2
16−3
17−4
17−1
17−2
17−3
18−4
18−1
18−2
18−3
19−4
19−1
19−2
19−3
20−4
20−1
20−2
20−3
21−4
21−1
21−2
21−3
22−4
22−1
22−2
22−3
23−4
23−1
−2
13
Year−Quarter
Figure 2.2.: Quarterly evolution of the number of Bag-of-Tasks jobs executed by CiGri on
the Gricad mesocenter. Every quarter, there are at least 100000 jobs being
executed, and in average around a million jobs.
Figure 2.1 shows the aggregated distributions during the last 10 years. The evolution
during the years of the number of jobs executed by CiGri is depicted in Figure 2.2.
We can see that there are at least 100000 jobs executed every quarter and often
around a million Bag-of-Tasks jobs. We do not observe any trend either in increase
or decrease of usage of CiGri.
Figure 2.3 depicts the evolution of the mean and median execution times of the jobs
executed per quarters of year. The median execution time is in the order of a few
minutes, whereas the mean execution time is more in the order of dozen minutes
or one hour. We observe variations in the two metrics, which hints for changes in
usage of CiGri. But, as seen on Figure 2.2, the number of jobs does not vary much,
which indicates that the execution time of the jobs executed varies.
To understand the reason of this change of behavior through the years, let us look
at the projects executed. We remind the reader that a project contains campaigns
which themselves contain jobs. Campaigns from the same project have similar
behavior (number of jobs and execution times). Figure 2.4 shows the evolution
of the distribution of the job execution times per quarter. We can see that there
are often several modes of distributions. These modes correspond to the dominant
usage of jobs from the same projects. For example, from the last quarter of 2019
(19-4) to the third quarter of 2022 (22-3) we can see that there is always a mode of
distribution around one minute.
10m
1m
10s Mean
Median
1s
13−3
14−4
14−1
14−2
14−3
15−4
15−1
15−2
15−3
16−4
16−1
16−2
16−3
17−4
17−1
17−2
17−3
18−4
18−1
18−2
18−3
19−4
19−1
19−2
19−3
20−4
20−1
20−2
20−3
21−4
21−1
21−2
21−3
22−4
22−1
22−2
22−3
23−4
23−1
−2
13
Figure 2.3.: Quarterly evolution of the mean and median execution times of the Bag-of-Tasks
jobs executed by CiGri on the Gricad mesocenter. The median execution time
is in the order of a few minutes, whereas the mean execution time is more in
the order of dozen minutes or one hour.
Figure 2.5 shows the proportion of the Bag-of-Tasks jobs executed by quarter which
belonging to given projects. It shows also the proportion of work (i.e., total execution
time times the number of machine used) for the CiGri projects. We can see that
there is often a project submitting the majority of jobs executed for a given quarter.
For example, during the years 2020 and 2021, the project biggnss was responsible
for the majority of the Bag-of-Tasks jobs executed. However, it is noteworthy that
the project having the majority of jobs executed during a period, is not necessary
the one that consumed the most resources (work). For the same period, between
2020 and 2021, the biggnss project does not have the majority of the CiGri jobs
executed during this period (quarter 4 of 2020 for example). Note that we can now
indeed confirm that from 19-4 to 22-3, the same project (biggnss in this case) had
the majority of jobs being executed.
The project with the most jobs executed is not always the one doing the most
work. This hints that there are different "shapes" of CiGri campaigns: a lot of
small jobs or fewer jobs but longer.
Let us now focus on the distribution of execution times among the projects. Fig-
ure 2.6 depicts the distribution of the execution times of the 10 CiGri projects with
0.25
0.00
17−3 17−4 18−1 18−2
1.00
0.75
0.50
0.25
0.00
18−3 18−4 19−1 19−2
1.00
0.75
0.50
0.25
0.00
19−3 19−4 20−1 20−2
1.00
0.75
0.50
0.25
0.00
20−3 20−4 21−1 21−2
1.00
0.75
0.50
0.25
0.00
21−3 21−4 22−1 22−2
1.00
0.75
0.50
0.25
0.00
22−3 22−4 23−1 23−2
1.00
0.75
0.50
0.25
0.00
1s 10s 1m 10m 1h 10h 1s 10s 1m 10m 1h 10h 1s 10s 1m 10m 1h 10h 1s 10s 1m 10m 1h 10h
Exec times (log)
Median Mean
Figure 2.4.: Distribution of the execution times per quarter. We can see the dominant usage
of CiGri by some project. From 19-4 to 22-3, most of the jobs last around 1
minutes, which hints that most of the jobs come from the same project.
Year−Quarter
Figure 2.5.: Evolution of the proportion of Bag-of-Tasks jobs executed by CiGri on the Gricad
computing center over the years, as well as the evolution of the work (execution
time times the number of resources) of CiGri jobs. We observe that there is
often one project which has the majority of the jobs executed during a quarter.
However, this project does not necessarily perform the most work. For example,
for the year 2020, the biggnss project has the majority of jobs executed, but
not the majority of executed work.
1s 10s 1m 10m 1h 10h 1s 10s1m 10m1h 10h 1s 10s1m10m1h 10h 1s 10s1m 10m1h 10h 1s 10s 1m 10m 1h
Figure 2.6.: Distribution of the execution times for the 10 CiGri projects with the most
jobs. We observe that most projects have a clear unique mode of distribution.
This mode can be very wide, like for f-image and simsert, or thin, like for
biggnss and pr-mdcp.
the most jobs. We can see that for most of the project, there is one clear mode for
the execution time as well as a heavy distribution tail. The projects like teembio
and sdmtk have two modes for the execution times. In this case, there are several
types of campaigns in the same project, one for each mode. The width of the mode
is also different from project to project. For example, biggnss and pr-mdcp have a
very thin mode of a few minutes, where projects like f-image and simsert have a
range from a few seconds to a couple of hours. Note that as the jobs are executed on
a computing grid composed of several clusters, we also plot the clusters on which
the jobs have been executed.
To model the distributions showed in Figure 2.6, we perform a goodness of fit test with
the following long-tail distributions: Normale, Log-Normale, Frechet, Gamma and
Weibull. We consider only the 10 projects with the most jobs, and follow the
methodology presented in [Jav+09; BNW03]. For each campaign of these projects,
we randomly chose 50 jobs. As the behavior of the jobs can vary for different clusters
of the computing grid, we will consider the distributions per pair (project, cluster).
For each of these pairs, and for each of the selected long-tail distributions, we
perform a Cramer-von Mises test [Dar57]. This test computes the distance between
the empirical distribution function of the execution times of the selected jobs and the
empirical distribution function of the selected distributions. The smaller the distance,
the more likely the execution times distribution follows the tested distribution.
As the Cramer-von Mises test generates randomly the empirical distribution function
Mean
0.25
0.00
Figure 2.7.: Empirical cumulative distribution function of the mean execution time of a
campaign. The average mean duration is around 1 hour, but the median mean
duration is around 10 minutes.
of the selected distributions, we repeat this process 30 times and take the mean of
the distances. Globally, the distributions which fit the best the execution times are
the laws Log-Normale, Frechet and Weibull. The quality of the fitting for these laws is
near identical. This concurs with the result of [IE10; Ios+08] for other computing
grids.
Figures 2.7 and 2.8 show respectively the empirical cumulative distribution function
for the mean job duration and the number of jobs in a campaign. Half of the
campaigns have less than 50 jobs, and half of the campaigns have a mean execution
time in the order of the dozen of minutes.
The average campaign has around 2500 jobs with a mean execution time of 1
hour and 15 minutes.
0.50
Mean
0.25
0.00
Figure 2.8.: Empirical cumulative distribution function of the number of jobs per campaign.
Half of the campaigns have less than 50 jobs, but there are in average 2500
jobs per campaign.
Relationship between mean exec. time and number of jobs per campaign
5e+05
4e+05
Number of jobs per campaign (log)
3e+05
2e+05
1e+05
0e+00
1h 10h 1d 2d 3d
Mean execution times
Figure 2.9.: Relation between the number of jobs in a campaign and the mean duration of
its jobs. We observe that large campaigns have short jobs, and small campaigns
have long-lasting jobs.
Large campaigns have short jobs, and small campaigns have long-lasting jobs.
In this thesis, we will represent a campaign as a set of jobs having the exact same
theoretical execution time. In practice, we implement a job using the sleep com-
mand with the desired execution time. Choosing this job representation, instead of
executing real Bag-of-Tasks applications, allows us to better control the characteristics
of the campaign and create all sorts of synthetic scenarios.
There are still some open questions. In particular, the origin of the long-tail of the
execution time distribution is still unknown. We suspect that it can come from the
overhead of the scheduler to start and stop the jobs and/or from the load of the
cluster (e.g., file-system, network) at the moment of the execution.
In this Chapter, we are focusing on the research question RQ2 seen in Section 1.6.
We want to reduce the overhead on the I/O operations of the regular users of the
cluster due to the CiGri jobs. In other words, we want to regulate the impact of the
CiGri jobs on the shared file-system of a cluster.
The overhead on the I/O operations cannot directly be observed easily. One way
would be to submit a specific job periodically and measure its performance to have
an idea of the load of the file-system. But this would mean submitting an extra job,
which might stay in the waiting queue and introduce some delay in the measurement.
Moreover, such a job will also use nodes on the cluster. We instead will measure
indirectly this overhead on I/O operations by using a sensor on the load of the
file-server (i.e., the load of the machine hosting the file-system). We decided to use
the loadavg metric [FZ87], present on every UNIX system. This metric has some
interesting properties. First, it is already well known by system administrators. This
means that they know what values of this metric are acceptable on their system. It
is also available on every Unix machine under /proc/loadavg. The value of this
39
sensor is updated every 5 seconds which is fast enough compared to the duration
of most HPC jobs, see Chapter 2. This metric also carries some inertia due to its
definition with an exponential filter:
Qi = Qi−1 × 1 − e−T + qi × e−T , Q0 = 0 (3.1)
where, Qi is the value of the loadavg at iteration i and qi is the number of processes
running or waiting for the disk at iteration i. There exists several variations of this
metric with different factor of filtering (T ). Those factors correspond to the period
of consideration for the metric.
$ cat /proc/loadavg
0.62 0.63 0.58 1/1805 49884
The first three values returned correspond respectively to the load averaged over
one minute, five minutes, and fifteen minutes.
The longer the period, the smoother the value of the loadavg, but also the slower
the response of the sensor to a variation. In the following, we consider only the first
value of /proc/loadavg, i.e., the load averaged on the last minute.
This variation of the loadavg metric can however be noisy. Indeed, as this is
a system-wide metric, it might capture behaviors that do not belong to the file-
system. Moreover, this metric works nicely for a distributed file-system such as
NFS, where there is a single machine hosting the file-system on the server side.
In the case of parallel file-system, where there are several I/O nodes, meta-data
nodes, etc., it might be more difficult to use this exact metric. One can average
the different loadavg values of the different nodes belonging to the file-system, but
the aggregated metric would probably lose most of its meaning. In our case, we
are interested in small to medium computing centers, where using NFS as a shared
file-system is a perfectly acceptable option. The implementation of this sensor is
done by opening a ssh connection from the machine hosting CiGri to the file-system
server and periodically reading the content of /proc/loadavg:
Only looking at the load of the file-system might lead to some degenerate cases for
the closed-loop system. Imagine the situation where the CiGri jobs perform a very
small quantity of I/O, and that even if all the resources are being used, the load of
the file-server is below the reference value. Then the controller will perceive that it
is possible to keep increasing the number of jobs submitted to reach the reference
value. However, as the cluster is already full, the jobs will go in the waiting queue,
and the only action of the controller is to fill faster and faster the waiting queue
of OAR. Hence, we also need a sensor on the state of the cluster to avoid such
cases. Fortunately, OAR offers an API where we can extract, directly or indirectly,
the number of jobs in waiting queue or currently running.
Note that in order to be reactive to potential overload of the file-system, CiGri cannot
afford to have a lot of jobs in the waiting queue of OAR. Indeed, as OAR has no
notion of file-system load, if there are idle resources, and (fitting) waiting jobs, those
jobs will be executed. The desired value for this sensor would be close to zero.
To sense the load of the distributed file-system, we use the loadavg metric of
the machine hosting the file-system. For information about the state of the
cluster, we use the OAR API.
3.2 Actuators
At first glance, the only actuator, or knob, at CiGri’s disposal to impact the load of
the file-system is the amount of jobs it submits at every iteration (every 30 seconds).
However, we can also consider which jobs are in the submission. Mixing jobs from
different campaigns in a single submission could lead to a better control, where the
3.2 Actuators 41
Task
Campaign Local
Users
Submit
CiGri
Controller
Cluster
Tap
Schedule
OAR Sensors
OAR
loadavg I/O
File-Sys.
additional knob is the proportion of jobs from each campaign. In this thesis, we
consider only the number of jobs submitted by CiGri at each iteration. The work
done in my Master Thesis [Gui20; Gui+21a] explores the proportion of jobs from
two different campaigns to improve the control of the load of the file-server, and
will be summarized in Section 3.7.2. Figure 3.1 summarizes the definition of the
loop in CiGri with the sensors, and the actuator.
As we do not have any a priori knowledge of what would be a model of the system,
we perform identification experiments. The identification aims at finding the relation
governing our system. We intend to determine an expression of the following form:
k
X k
X
y(k + 1) = ai y(k − i) + bj u(k − j) (3.2)
i=0 j=0
Where:
• y(k): output of the system at step k. In our case, the load of the file-system
• u(k): input to the system at step k. In our case, the number of jobs sent from
CiGri to OAR
To find the coefficients (ai , bj ), we will study the system in open loop. This means
without feedback from the system. We change the input and observe the variation in
the output. We submit steps of number of jobs and see the impact on the load of the
file-server. By step of the number of jobs we mean that at time k, ∀k < kstep , u(k) =
u0 and ∀k ≥ kstep , u(k) = ustep , with ustep ≫ u0 . We then look at the behavior of
the file-server load during these steps to extract the coefficients of the model.
In the following, we consider 4 different sizes of files: 25, 50, 75 and 100 MBytes.
For each size of file, we have 6 steps of values: 1, 10, 20, 30, 40 and 50 concurrent
jobs. For each step, we execute 120 consecutive identical submissions of the step.
Note that we will only consider the writing I/O operation as it is the most costly.
Figure 3.2 represents the results the identification phase. Figure 3.2a corresponds
to the time to write the files, and Figure 3.2b depicts the file-server load, with the
black dashed line representing its total number of NFS workers. We observe that
increasing the number of simultaneous write requests increases the time to process
these requests. The load of the file-server follows the same pattern.
Note that for a submission of 50 write requests of 100 MBytes, the processing time
skyrockets, and the approximation of the system having a linear behavior is no
longer valid. We see that for such a submission, the load of the file-server reached
the dashed line, representing the number of workers in the system. In this situation,
the file-system is overloaded and cannot deal with all the requests. This motivates
our goal to regulate the load in order not to reach this limit. Thus, the loadavg
metric also provides a way to detect such an overload.
(b) Load of the machine hosting the NFS file-system through the loadavg metric.
Figure 3.2.: Identification experiments. For the different file sizes, we write n concurrent
files onto the NFS file-system and record the time to process each request
(Figure 3.2a) and the load of the machine hosting the file-system (Figure 3.2b).
The dashed line on Figure 3.2b represents the theoretical maximum load of
that the NFS server can manage based on its number of workers. We can
see that writing 50 concurrent 100Mbytes files overloads the file-system, the
processing time explodes and the load is at the theoretical maximum.
7.5 15
Processing Time [s]
20 200
5.0 10
10 100
2.5 5
0.0 0 0 0
1 10 20 30 40 50 1 10 20 30 40 50 1 10 20 30 40 50 1 10 20 30 40 50
8 8 8 8
Fileserver Load
6 6 6 6
4 4 4 4
2 2 2 2
0 0 0 0
1 10 20 30 40 50 1 10 20 30 40 50 1 10 20 30 40 50 1 10 20 30 40 50
Number of Jobs in the Submission
Figure 3.3.: Processing time (top) and the file-server load (bottom) for different submissions
in number of jobs and I/O loads. It represents the identification phase. We
vary the quantity of I/O (columns) and the number of simultaneous write
requests/jobs in x-axis. We observe that the loadavg sensor captures the
(over)load of the file-system.
3.3.2 Modelling
From the data gathered during the identification experiments in Figure 3.2, we
model the relation between the value of the loadavg (y), the I/O load (f ), and the
number of jobs (u). We consider the upper bound of the loadavg in the modelling
to be conservative. By fitting a linear regression on the open-loop data, we get the
following relation:
y = α + β1 f + β2 u + γf × u (3.3)
To design the controller for our system, we need a relation as showed in Equation
3.2. As a first approach, we suppose that we are looking for a first order system. The
order of the system corresponds to the degree of dependence of y(k + 1) to previous
values of y (first order: y(k), second order: y(k) and y(k − 1), etc.). First order
systems have a limited set of behaviors, which makes their study easier. Higher order
systems have more than one degree of dependence, which increases their complexity
but also their realism. But they can be approximated to a first order system with
some hypothesis on their poles [Hel+04].
In our case, this means that ∀i, j > 0, ai = 0, bj = 0 in Equation 3.2. We are thus
looking for the following relation between the input (u) and the output (y) at step
k:
By definition
of the loadavg metric (Equation 3.1 and [FZ87]), we have a =
5
exp − 60 . The loadavg value updates itself every 5 seconds, faster than a period
of CiGri (∆t = 30s). Thus,
∆t
5
5
a = exp − (3.5)
60
0.075
Estimation of b
0.050
0.025
0.000
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
Number of Jobs in the Submission
Figure 3.4.: Estimation of the parameter b for the model of the system. The bars correspond
to the estimation for a number of jobs (u) and a I/O load (f ). The points
represent the value of the estimation when considering u = +∞. We observe
that the estimations converge towards these limits values. We will take these
limits as values for b.
In order to find the value of b, we must look at the behavior of our system in steady
state, meaning when the system has converged. In a steady state (ss), we have the
following relation:
yss (1 − a)
yss = a × yss + b × uss =⇒ b = (3.6)
uss
Figure 3.4 depicts the estimation of b for the values of uss and yss from the identifi-
cation in Figure 3.2.
(α + β1 f + β2 u + γf × u) (1 − a)
lim b = lim
u→+∞ u→+∞ u (3.7)
= (β2 + γf ) × (1 − a)
This means that the value of b depends on the file size of the current jobs (f ). We
also plot this limit value of b on Figure 3.4 as points. We can see that the estimations
of b converge to this limit value.
∆t
5
5
a = exp −
60
= 0.6065307 (3.8)
b = (β2 + γf ) × (1 − a)
= 0.017761 + 6.4273488 × 10−4 × f
We evaluate our model by executing random steps of concurrent writes on the NFS
file-system, and log the loadavg of the NFS server. Results can be seen on Figure 3.5.
We observe that for 100Mbytes and 49 concurrent writes, the file-system collapses
and fails to continue, which shows the importance of not overloading the distributed
file-system of a cluster.
We assumed in Section 3.3.2 that our system is a first order. We want to design a
Proportional-Integral (PI) controller to be able to regulate the load of the file-system
with precision and robustness. For a PI, there are two gains to set up:
These gains are functions of the model of the system (Equations 3.4 and 3.8) and
two parameters that allow to choose the closed loop behavior of the system:
• ks which represents the maximum time for the closed-loop system to get to
steady state, and which gives the rapidity of the system, i.e., the time to react
to a variation
From these value given by the administrators of the system, we can derive the gains
for the PI controller [Hel+04].
Kp = a−r2
b
(3.9)
K = 1−2r cos θ+r2
i b
Where:
• r = exp − k4s
log r
• θ = π log Mp
ks: 5
4
0
10
8
Output Values
ks
6
ks: 10
Mp
4
0
10
ks: 15
4
0
5 10 15 20 5 10 15 20 5 10 15 20
Iterations
Figure 3.6.: Impact of the ks and Mp parameters on the Closed Loop Behavior of the system.
The smaller ks , the fastest the closed-loop system will converge to the reference
value. But too small values of ks can lead to some overshooting. Mp controls
the allowed overshoot. The greater Mp , the greater the allowed overshoot.
Figure 3.6 shows the influence of the ks and Mp parameters on the closed loop
system. We now pick the values of ks and Mp to meet with the desired behavior
and extract the gains of the controller from Equation 3.9. In our case, we want to
avoid any overshoot, and thus avoid overloading the file-system, but still have a fast
response. Hence, we will take ks = 12 and Mp = 0.
8
7
6
5
4
3
2
1
0
0 2000 4000 6000
15
10
0
0 2000 4000 6000
Time [s]
Figure 3.7.: Response of the closed-loop system to a step perturbation. The Proportional-
Integral controller increases the number of jobs CiGri submits to OAR (bottom)
to get the load to the reference value (top). At t = 2000s we introduce a
step-shape perturbation resulting in an increase of the load. The controller
detects it and decreases the size of the submission to get the load back to the
reference value.
100
75
Idle resources (%)
50
25
Reference Value
Objective 0 2 4 6 8
Figure 3.8.: Overhead on the MADBench2 I/O benchmark [Bor+07] based on the chosen
reference value for our PI controller.
Figure 3.7 represents the response of the closed-loop system to a step perturbation.
We ask the controller to regulate the load of the fileserver around the value 6
(dashed line). At time t = 2000s, we simulate a premium user job that would
produce a disturbance on the load (dotted and dashed line). We took a step-shaped
disturbance to observe the reaction to an very abrupt change in load and then
observe the convergence to a new stable state. If the disturbance was ramp-based, it
will be easier for the controller to cope with it. We repeat this experiment several
times. Each color on Figure 3.7 represents a single experiment. We also plot the
aggregated mean behavior on these experiments with a continuous line.
We can see that the controller manages to get to the reference value. When the
disturbances start, the controllers detects the rise in load due to the step. It then
decreases the number of jobs sent to OAR to get the load back to the reference
value.
The remaining of this chapter presents some modifications of the feedback loop to
improve the behavior of the closed loop system.
In this chapter, we showed that a PI controller was able to track correctly a reference
value for the load of the file-system. As we can see at t = 2500s on Figure 3.7, the
response time could lead to an overload of the file-system if the reference value is
too high. The choice of the reference value is thus crucial. A too low reference value
would lead to a poor harvesting of idle resources, and a high reference value would
increase the overhead on priority jobs while increasing the chances of a collapse of
the file-system due to an overload.
One approach would be to dynamically change the reference value based on the
number of premium jobs currently running on the cluster. If there are no premium
jobs, then the reference value is high to allow CiGri to harvest. If there are some
yprio,k dk
fmodel
ymax + - yref,k+ ek uk
Controller System
-
yk
Figure 3.9.: Feedback loop representing the control scheme with the dynamic reference
value. The current number of premium jobs (dk ) is fed into a model returning
the maximum load that the premium jobs could put on the file-system. This
load is then subtracted to the maximum load for the file-system, and then is
defined as the reference value for the CiGri controller. The value of f¯ represents
a representative file size for the cluster and is chosen by the administrators
of the cluster. This information can be retrieved with Darshan [Car+11] for
example. It could be the mean or median file size, or the 95% percentile to be
more conservative.
premium jobs, then the reference value is decreased to leave more room for the
controller to react. The amount to reduce the reference value depends on how
many nodes are used by priority jobs, and prior global knowledge of the I/O load of
premium jobs.
Figure 3.9 gives an idea of the control loop. The quantity dk represents the number
of resources used by premium jobs at iteration k. fmodel is the estimated maximum
load that can be produced by all the premium jobs simultaneously. It can be found
by performing an identification experiment where, instead of having steps inputs as
previously, we have Diracs.
Figure 3.10 shows the results of this new identification experiment. For different file
sizes, we write concurrently to the distributed file-system and observe the load of
the machine hosting the file-system. We then wait for the load to reach (near) zero
and write again. From this experiment we extract a model of the maximum load
(fmodel ) based on the file size and the number of concurrent writes using a linear
regression. Using this model, we can define the reference value for the load of the
file-system:
yref,k = ymax − ypriok = ymax − max(0, ymax − fmodel (dk , f¯)) (3.10)
The value of f¯ represents a representative file size for the cluster. This information
can be retrieved with Darshan [Car+11] for example. It could be the mean or
Load
Time
Submission Submission
median file size, or the 95% percentile to be more conservative. The administrators
of the cluster would set the value of f¯.
With extra information from the premium jobs, we can adapt the reference
value dynamically to reduce the risks of a collapse of the file-system.
This section summarizes the work done during my master thesis [Gui20] and
published in [Gui+21a]. Originally, CiGri submits batches of jobs from the same
campaign to OAR. If the quantity of I/O done in each job of the same campaign is
similar, it differs between campaigns of different projects. Suppose that there are
two campaigns submitted to CiGri, one with high I/O load jobs and the other with
light I/O load jobs.
Figure 3.11 represents a situation using submissions composed of jobs from the
same campaign. If we suppose that we are able to regulate perfectly the number
of jobs submitted for the red campaign to keep the load of the cluster under the
reference load, there will be a “gap” between the actual load of the cluster and the
reference load. We could exploit this “gap” by submitting jobs that have a smaller
impact on the load (blue jobs). Figure 3.12 shows a situation where we submit a
set of jobs coming from two different campaigns to improve the cluster usage, the
number of resources used while keeping the load under the reference value.
Load
Time
Submission
I/O heavy campaign and one I/O light campaign. This consideration gives us one
additional knob of action on the system: the proportion of jobs from each type of
campaign in the batch of jobs to submit to OAR.
Changing the number of jobs in the submission has a bigger impact than changing
the percentage of I/O heavy jobs in the submission. Thus, to regulate the load of the
file-system precisely, we should do these actions in the following order:
The controller will have two modes running in exclusion. The idea could be summed
up as “big step, small step”. The controller will firstly regulate the number of jobs
sent to OAR (big step). Then, when the load of the file-system is “close” to the
reference value, the controller will regulate the proportion of I/O heavy jobs in
the submission (small step). Figure 3.13 gives a graphical representation of the
controller.
<T
% IO heavy
≥T
Nb jobs
Load
Load Sensor
Figure 3.13.: Representation of the feedback loop for a submission with different campaigns
with different I/O loads.
10
9
9
8
8
7
7
6
Load
6
5
Load
5
4
4
3
3
2
2
1
1
0
0 1 2 3 4 5 6 7 8 9 10 0
Time
0 1 2 3 4 5 6 7 8 9 10
Load Ref Ref − T Ref + T Time
Figure 3.14.: Graphical representation of the potential behavior of the load if the threshold
is too small (Figure 3.14a) or too big (Figure 3.14b).
The choice of the threshold value (T in Figure 3.13) between the two modes is not
easy, but we give some guidelines below.
If T is too small, then it will be more difficult to get into the mode regulating the
percentage of I/O heavy jobs, and there might be an unavoidable static error plus
some oscillations. Figure 3.14a gives a visual representation of this situation.
If T is too large, the controller might get “stuck” in the mode regulating the percent-
age of I/O heavy jobs. This could lead to reaching a non-optimal stationary state.
Figure 3.14b gives a visual representation of this situation, where the percentage of
I/O heavy jobs reached 100% and the controller cannot get closer to the reference
value. In [Gui+21a], we choose a threshold value of 1 (T = 1) as experiments have
shown not to be too small nor too big.
Figure 3.15 depicts and example of regulation of the load of the NFS file-system by
considering I/O heavy jobs and I/O light jobs. The dashed lines on the bottom plot
represent the threshold between the two modes. The solid line is the reference value.
We can see that the controller first enters the threshold zone around 700 seconds,
and then starts regulating the percentage of I/O heavy jobs in the submission (top
plot).
The controller defined in the previous chapter is tightly coupled with its underlying
system. This is due to the identification phase that creates a model of the system.
Implementing such a controller requires a control-theory background. However,
HPC system administrators are not control theory experts. And they cannot develop
on their own a new controller for every new cluster, file-systems, disks, network, in
their platform. To help with the adoption of control-theory based solutions, and with
software engineering concerns in mind, we explore in this Chapter, the reusability of
autonomic controllers using control-theory techniques with the example of CiGri as
a study case. In the following sections, we present the implementation of controllers
of different nature (PI, adaptive PI, and Model-Free Control) to regulate the same
system presented in Chapter 3, and propose a comparative framework to conclude
on their reusability. We further discuss their interdependence and known trade-offs,
from a theoretical point-of-view.
The controllers are compared based on five criteria, each answering a specific design
or evaluation problematic:
• Nominal Performance: To what extent, and how, does the controlled system
meet its objectives, in the experimental conditions in which the controller has
been designed?
• Portability: How will the controller behave if the system varies from the nominal
version considered for design? What set of systems can the controller manage?
• Setup complexity: To what extent is the controller “plug and play”? How easy
is it to set up the controller?
61
• Support Guarantees: Is the controller design backed by a methodology
bringing behavioral guarantees? Under which hypothesis do the guarantees
hold?
The next sections define in details each criterion, discuss their relevance, and propose
metrics to characterize them.
Metrics Classical control theory provide three metrics to characterize the dynamical
behavior of a controlled system. Note that stability is eluded here, as we suppose
all controllers are functional. Precision reflects the ability to reach the desired
target. It is measured using the static error metric, giving the distance between the
measured performance and its reference value, once the system converged. 0 is the
best value: the smaller, the better. Rapidity characterizes the speed of convergence
of the controlled system. It is measured using the response time metric, computed
as the time needed from the performance signal to reach 95% of its final value.
The smaller, the better. Quality of the signal is also considered in cases there are
oscillations in the performance signal. It can be measured via the overshoot metric,
the relative value of the highest peak of performance signal. 0 is the best value: the
smaller, the better.
Definition When used in real conditions, the controller may have to manage a
system that does not correspond to the nominal situation. Portability captures the
ability of a controller to be applied, without or with limited modification, to a
system different from the nominal design one, and to keep adequate performance.
Two cases can be distinguished for the portability conditions. First, the variation
occurs at runtime, i.e., the system deviates from the nominal version, possibly due
to some disturbances or external events. The ability of a controller to keep good
performance in case of runtime variations of the system is to be linked with the well
studied property in Control Theory of robustness. Robust controllers are able to
reject disturbances, i.e., ensuring that the effect of the disturbance on the system
remains limited and mastered. Second, the controller may be applied to a system
with different configurations, i.e., variations occur offline, on different runs. Such
software or hardware configuration changes can be significant, for instance when
different clusters are considered. This property is rarely considered in Control Theory
for physical system, where the notion of different runs does not exist: controllers
continuously monitor the same physical system.
Definition Controllers are implemented and used by system experts, that are not a
priori knowledgeable in Control Theory. A controller with low design complexity is
more likely to be understood, correctly implemented, and tuned. If performances
are acceptable, the simplest controller may be the best option. Note that design
complexity is different from the mathematical complexity of the controller algorithm,
whereas rather addresses the user experience aspect.
Trading support guarantees for low setup complexity and limited control compe-
tences Strong guarantees on the behavior of the controlled system over a wide
variety of conditions can be achieved, however it often requires using advanced
control techniques. Properly designing such controllers assumed significant control
engineering knowledge, and its implementation and usage are consequently more
complex. This discussion on the simplicity of the controller usage, even at the cost
of loss of performance guarantees, is very common in the control field. However, it
reflects the reality of systems’ experts when using controllers in practice.
This Section presents the design of the controllers considered in this study of reusabil-
ity: a PI controller (Section 4.2.1), an adaptive PI (Section 4.2.2), and a Model-Free
controller (Section 4.2.3). Sections 4.2.2 and 4.2.3 summarize, respectively, the
work done during the Master 2 internship of Rosa Pagano [Pag23], and [Gui+22d]
published at CCTA 2022 with Bogdan Robu, Cédric Join, Michel Fliess, Eric Rutten,
and Olivier Richard.
Tuning We have two parameters to choose the closed-loop behavior of our system:
ks and Mp . From those parameters and the model (a and b), we find the gains Kp
and Ki of the PI controller.
4.2.2 Adaptive PI
The PI controller defined above is coupled to one kind of jobs (execution time and
I/O load). An adaptive controller, however, would be able to adapt its configuration
at runtime to fit the jobs running even if they are different from the one used for the
design of the controller. In our case, the parameter b of our model depends on the
I/O load of the jobs (f ), see Figure 3.3 and Equation 3.8, which thus motivates the
use of adaptive control.
We build adaptation on top of the previously defined PI controller, and the adaptive
algorithm adjusts its parameters to account for the unmodeled dynamics beyond the
scope of the linear model [She+17].
There are two primary approaches for the adaptation: updating the parameters of
the controller (direct), or updating the parameters of the model (indirect). In our
case, the indirect approach appears more suitable because of the dependence of
the previous PI controller on the b parameter of the model. We will thus, at each
iteration of the control loop, estimate the value of b(k) of our model, noted b̂(k),
and recompute the value of the gains (Kp and Ki ) accordingly.
Our objective is to estimate b̂(k) using an online algorithm. We define the prediction
error ε representing how far the estimated model is from the reality measured:
Due to the presence of significant noise in the measured data, see for instance
Figure 3.3, the algorithm should be capable to mitigate noise. Therefore, we use a
modified Recursive Least Square (RLS) [ÅW08] algorithm to improve the robustness
against noise. That is, we substitute ε with its filtered version Φ(ε(k)), given by:
ε(k)
Φ(ε(k)) = (4.2)
1 + ϕ|ε(k)|
With V (k + 1):
where µ represents the forgetting factor. The presence of changing parameters over
time, such as b varying with different campaigns, requires the use of forgetting effect
of µ ∈ (0, 1]. However, the µ parameter introduces a significant drawback known
as the “bursting phenomena” [And85; FKY81]: when the system lacks sufficient
excitation, the algorithm may approach a singularity, resulting in large spikes in the
estimated parameter.
The algorithm relies on different parameters: the forgetting factor µ, the smoothing
parameter ϕ, and the initial conditions b̂(0) and V (0). Note also that the adaptive PI
is built on top of the PI, which requires the computation of the model parameter a
and the definition of ks and Mp , two quantities that allow choosing for the close-loop
behavior (see fig. 3.6).
Smoothing parameter ϕ Strong smoothing leads to slow estimation, but also re-
duces overshoots and oscillations. There are no methodology to define ϕ, and we
chose a satisfactory value via an experimental approach.
Initial conditions There are two distinct initial conditions: b̂(0) and V (0). For b̂(0),
a value of 0.5 was chosen. Small values, i.e., underestimations, of b̂(0) might lead to
overshooting due the relation between u and b. The significance of V (0) lies in its
impact on the algorithm’s initial speed. Both initial conditions were chosen from
experiments results.
The third type of controller we will compare is one that does not rely on a model.
Many choices exist such as Active Disturbance Rejection Control [Han09], Model-
Free Control [FJ13], or the ones using learning techniques (e.g., [Pon+18]), each of
them with several variants and improvements.
Model-Free control (MFC) is an approach that has the advantage of not necessitating
the possibly tedious phase of modeling. The main ideas and formulations are
summarized in the following of this section.
The MFC formulation relies on an ultra-local model, but does not require building a
reliable global model. In that sense, it justifies its name of Model-Free Control. The
considered ultra-local model is of the form:
with ẏ(k) the derivative of y(k), α ∈ R a constant, and F (k) a term to be estimated
(see eq. (4.7)), reflecting the unknown structure of the system and its disturbances.
The Model-Free approach allows plugging PID controllers, defining the notion of
intelligent PIDs. For an intelligent Proportional (iP), which is linked to a classical
Proportional-Integral controller, the control action is computed as:
where F̂ (k) is the estimated value of F (k), and Kp ∈ R the proportional gain. [FJ13]
present different techniques to compute F̂ . Yet, as a first approach, we use past
values of y and u to estimate it from (4.5):
Tuning There are two parameters to tune: α and Kp . In [FJ13], the authors
recommend taking α such that ẏ and α × u have the same order of magnitude. Thus,
as in [Gui+22d], we rely on experiments to set the parameter α. The value of the
gain Kp accounts for the closed loop behavior of the system. Small values of Kp yield
conservative and slow controllers, whereas greater values yield more aggressive
controllers prone to overshooting and oscillations. Therefore, the Kp value should
be chosen by trial and error methods or by realizing a Pareto type analysis.
The experiments were carried out on the nodes from the Grisou Cluster of Grid’5000
[Bal+13] which is a shared French testbed for experimental research in distributed
and parallel computing. Each node of this cluster has two Intel Xeon E5-2630 v3 CPU
with eight cores per CPU and 128 GiB of memory. Each server of our system is being
deployed onto a single Grid’5000 node from a Kameleon system image [Rui+15].
We used four nodes for the deployment setup: one for the CiGri server, one for
The PI is tuned based on the model of Equation 3.3. Based on identification from
experimental data (Figure 3.3), we have the following values: α1 = −0.5071484,
β1 = 0.0086335, β2 = 0.0451394, and γ = 0.001633. The gains of the PI controller are
then defined using eq. (3.9), with the design parameters set to ks = 12 and Mp = 0 in
order to avoid any overshoot, but still have a fast response. The adaptive PI relies on
the PI tuning, with additional parameters set as follows: µ = [0.5, 0.6, 0.7, 0.8, 0.9, 1],
ϕ = 2, b̂(0) = 0.5, and V (0) = 104 . For the MFC, refer to [Gui+22d] for the
numerical details of its configuration.
For the three controllers considered in this study, we will evaluate their performance
by varying the characteristics of the CiGri jobs from the identified system. Controllers
were designed for jobs lasting 30 seconds and writing 100 MBytes to the fileserver.
This configuration represents our nominal system. For testing, we consider variations
on two dimensions: the amount of I/O written by the jobs, and the execution time
of the jobs. For the I/O, we consider the following file sizes: 50, 100, 200, and
400MBytes and for the execution times, we consider: 10, 30, 60, and 120 seconds.
For each triplet (I/O load, execution time, controller), we repeat 5 times the following
scenario:
3. we submit a campaign of jobs to CiGri with the desired characteristics (I/O load,
execution time)
0.5
0.2 500
0.0 0.0 0
PI aPI MFC PI aPI MFC PI aPI MFC
Figure 4.1.: Nominal performance of the different controllers through time. The solid line
represents the aggregated behavior from 5 different experiments (light dots).
The bottom plots compare the controllers on the performance metrics defined
in Section 4.1.
Nominal performance
We first compare the performance of controllers on the nominal system, i.e., the
system’s configuration that was used for design and tuning. The three controllers are
executed, and the measure of the fileserver load through time is reported in Figure
4.1. The top plots represent the first 2000 s of the scenario (before the introduction
of the disturbance) for each controller. The solid lines aggregate the behavior of
5 repetitions of each experiment (light dots), and the dashed line represents the
reference value that the controllers follow.
All controllers successfully track the reference value, with overall similar
performances. The adaptive PI is slightly worst in terms of rapidity. The MFC
is fast and precise, but can lead to an overshoot. The PI is fairly performant
on all criteria.
Portability
Variations of I/O Load All controllers manage to track the reference value, however
some configurations lead to large oscillations in the fileserver load. The disturbance
does not seem to have a detrimental impact, it even reduces the oscillations in some
cases.
50 MBytes Files
6
5
4
3
2
1
0
8
3
2
1
0
8
10 sec Jobs
6
5
4
3
2
1
0
8
7
30 sec Jobs
6
5
4
Fileserver Load (y)
3
2
1
0
8
7
60 sec Jobs
6
5
4
3
2
1
0
8
7
120 sec Jobs
6
5
4
3
2
1
0
0 2000 4000 6000 0 2000 4000 6000 0 2000 4000 6000
Time [s]
Figure 4.2.: Global comparison of the controllers’ performance for variations in I/O loads
(fig. 4.2a) and jobs execution times (fig. 4.2b). The solid lines represent the
aggregated behavior from 5 different experiments (light dots).
4 2
2 1
0 0
Precision Precision
1.5
2.0
1.5 1.0
1.0
0.5
0.5
0.0 0.0
Rapidity Rapidity
1500 1500
1000 1000
500 500
0 0
50 100 200 400 10 30 60 120
File size [MBytes] Execution times [s]
(a) Performance with varying I/O of the jobs. (b) Performance with varying jobs execution time.
Figure 4.3.: Comparison of controllers’ portability. The plots represent the performance
metrics computed on various systems between 2000 seconds. The solid lines
link the means over varying configurations for each controller.
For the MFC, the trend is similar than in nominal conditions: fast, but can lead
to large overshoots. The PI is slower but has fewer oscillations when the system
is highly different from the nominal one. The adaptive PI seems to have worse
performance on the nominal system (100MBytes) than the other two controllers,
but then it does not seem to be much impacted by the variations in I/O load.
1. overshoot in the first 2000 seconds is larger for the adaptive PI, except with
400 MBytes (when we are far away from the nominal system), where both PI
and MFC present huge values
2. precision is similar for all controllers, except with 400 MBytes where the
adaptive PI is significantly better rapidity of the adaptive PI is better than for
the other controllers on non-nominal systems
3. rapidity of the adaptive PI is better than for the other controllers on non-
nominal systems (note that for the experiments with large oscillatory behavior,
the rapidity criteria should be mitigated as convergence is not reached)
The MFC and the PI perform very well close to the nominal system, but have
difficulties not overshooting and oscillating when the system is too far from
the nominal system, i.e., 400 MBytes in our scenario. On the other hand, the
adaptive PI has a similar behavior whatever the system configuration.
Variations of Jobs Execution Time When variation in the jobs’ execution time is
introduced, controllers successfully track the reference, with a better behavior than
in the case of I/O variations, see Figures 4.2b and 4.3b.
For the PI and MFC controllers, reducing the job duration to 10 seconds even
improves the performances by reducing the overshoot. When increasing the du-
ration, the PI seems to better control the system, with few oscillations and small
overshoot.
The relative poor performance of the adaptive PI for job duration variations can
be theoretically explained. The increase of jobs duration leads to the addition of
a delay in the system, i.e., it takes more time to see the effect of an action on the
system. The adaptive PI was built to adapt the model parameter b of the model, not
to identify a delay. Thus, the adaptive PI is not designed to deal with such type of
system’s variation, and its adaptation can be detrimental in some cases.
Set-up complexity
In a first attempt to evaluate the setup complexity, we count the number of parame-
ters to be tuned for each controller.
The PI also requires two parameters: Kp and Ki . They are computed as in Equa-
tion (3.9) based on the model of the system (two parameters in the case of the linear
first order model: a and b), that can be derived from the identification experiments.
Moreover, the tuning allows translating the problem of finding Kp and Ki into the
choice of the design quantities ks and Mp . Thus, two parameters are needed for
setting up the PI.
Finally, the adaptive PI is based on the PI: thus it requires the same two parameters.
It additionally requires parameters specific to the adaptive part: µ, ϕ, b̂(0), V (0),
summing up to a total of six parameters.
Besides the raw count of parameters, we can point out that some parameters are
more sensitive than others in terms of precision needed. We therefore see three
cases:
1. the very sensitive parameters for which an ill-tuned value can even make the
system unstable (e.g., Kp )
2. the parameters for which an ill-tuned value makes the system slower (e.g., ϕ)
The controllers studied in this chapter have different guarantees from the point-of-
view of their behavior.
The adaptive PI provides additional guarantees to the PI, by broadening the condi-
tions in which the controller is operational. That is to say that the guarantees that
one has on the nominal system, also extend to systems close to the nominal one, as
seen in Section 4.3.4. However, knowing in advance if a system falls into the region
of stability of a controller is not straightforward for a computing system, especially
without performing a preliminary modeling identification phase. Additionally, the
adaptive PI has the drawback of requiring specific input signals – known as persistent
excitation – to avoid its divergence, namely the bursting phenomena.
In this perspective, MFC has the advantage of very low requirements, as it does not
require specific engineering education. On the other hand, it is a novel, emerging
method, with only little literature and experience from which to draw practical
guidelines.
In the case of the classical PI, there is a need to perform open-loop identification
experiments. It has a practical cost, but there is only relative low mathematical diffi-
culty in the equations at play. Therefore, it is accessible and taught to undergraduate
The adaptive PI builds up on PI: it hence requires a greater set of skills to use [Lan+11].
Moreover, over the years, many upgrading elements and variants have been pro-
posed in order to correct and improve certain aspects in its design (e.g., [ÅW13;
IS96; Mid+88]). In terms of skills, this usually is taught in Master curriculums
dedicated to control engineering or even at the PhD level for the more advanced
variants.
4.4 Conclusion
This chapter addressed the need for the reusability of autonomic controllers in
the context of HPC with the study case of CiGri. Controllers from control-theory
can have a bound too tight to the nominal system, which can make them lose
performance dramatically when they are used in different contexts, with variations
in time or in space. We proposed a comparison framework to evaluate the reusability
of autonomic controllers, defined criteria of interest, and exposed tradeoffs between
criteria. Using the study case of CiGri, we implemented three controllers of different
nature (PI, adaptive PI, and Model-Free control), and compared them with respect
to our criteria.
Table 4.1 summarizes the comparison of the different controllers on the criteria of
the framework of Section 4.1. More general comparative results are still an open
question due to the great variety of cases according to the control problems. As
expected, there is no clear “best controller” from a computer engineer point-of-view,
but the table exhibits several tradeoffs.
In the case where one wants to give away some design complexity and required
competence, but still want acceptable nominal performance, the MFC appears as an
excellent candidate as it has a low setup complexity.
In the case where a well documented and explainable solution is required, then the
PI controller has the advantage of being a very well-known control algorithm.
4.4 Conclusion 79
A Control Theory Approach to
Reduce Wasted Computing
5
Power in HPC
5.1 Introduction
Researchers have been studying the problem of harvesting idle HPC resources
[Mer+17; TTL05; Prz+22] by submitting jobs that are more flexible (smaller, inter-
ruptible). These jobs are viewed as second class citizens by the scheduler and can be
killed if needed. However, these jobs are still being executed on the compute nodes
and impact the shared resources of the cluster (e.g., file-system, communication
network), thus inevitably disturbing the jobs of normal users. Meaning, there is a
trade-off to exploit between the amount of harvesting and the maximum perturba-
tion that these premium users can accept. Unfortunately, these solutions introduce a
new source of computing power waste by killing jobs. Indeed, most of these small
jobs do not implement a check-pointing mechanism, thus, all the computations done
before the jobs are killed will be lost and will need to be started again.
In this chapter, we tackle the problem of harvesting the idle resources of a clus-
ter while reducing the total amount of wasted computing time (from either idle
resources or killed jobs) in the context of CiGri.
81
Table 5.1.: Summary of the notations used.
Notation Definition
uk Number of Best-Effort resources submitted by CiGri at iteration k
wk Number of Best-Effort resources in the OAR waiting queue at iteration k
rk Number of used resources used by Best-Effort jobs in the cluster at itera-
tion k
yk Output of the sensor at iteration k. (yk = rk + wk )
rmax Total number of resources in the cluster (supposed constant)
p¯j Mean processing time of jobs in a CiGri campaign
yref Reference value, or desired state of the system to maintain. (yref =
(1 ± ε) × rmax with ε ≥ 0)
∆t Time between two CiGri submissions (constant, chosen by the system
administrator). 30 seconds in this paper.
ek Control error. The difference between the desired state and the current
state (ek = yref − yk )
Kp , Ki Proportional and Integral gains of a PI Controller
h Horizon value, amount of time in the future to look at the predictive
Gantt of OAR.
One challenge is to keep enough Best-Effort jobs in the waiting queue of OAR so they
can be started immediately when some resources are freed, but not too many to not
overload the waiting queue. An overload of the waiting queue can lead to a longer
response time for the controller, as OAR will schedule the jobs no matter if they are
destined to be killed or not.
Our knob of action is the number of Best-Effort jobs submitted by CiGri at each of
its iterations. To sense the current state of the clusters, we can query the API of
the OAR scheduler to get the number of resources currently used, the number of
Best-Effort jobs waiting.
For the sensor, we will consider resources and not jobs as we want 100% of usage
of the nodes. Considering jobs instead of resources could lead to some degenerate
situations. One example would be Best-Effort jobs using 2 resources each and having
a single idle resource on the cluster. By considering jobs instead of resources, the
controller could continue the injection of Bag-of-Tasks jobs thinking that there is
always some idle resources, even though it cannot be used.
We note:
We thus want to regulate the quantity “number of currently used resources + number
of Best-Effort resources waiting” around the total number of resources available in
the cluster. Meaning that we want to regulate rk + wk around rmax by varying the
value of uk .
We note yk = rk + wk . We also note the reference value yref = rmax , which is the
control theory term for the desired state of the system. The reference value could
also be (1 ± ε) × rmax , with ε ≥ 0, based on the cluster administration preferences.
A reference value greater than rmax could lead to longer response time for the
controller and more killing of Best-Effort jobs, whereas a reference value smaller
than rmax will leave some resources idle but kill less jobs.
We observe that around 1500 seconds, CiGri starts to overflow the waiting queue.
And as a result, the value of the sensor does not stabilize but keeps increasing,
until 2600 seconds where there are no more submission from CiGri. This can be
explained by the fact that CiGri is submitting more jobs than the cluster can process
in one iteration. Let pj be the execution time of the jobs of the campaign. Note
that we supposed that the jobs have the same execution times. Then, the number of
resources freed at each iteration is:
Count
200
150
100
100
50
0
0 1000 2000 3000
Time (s) 0
(a) Identification experiment. We vary the size of the (b) Histogram of the execution times for
batch that CiGri send to OAR as steps. CiGri first a CiGri campaign during the identi-
submits batches of 5 jobs for 20 iterations, then fication experiment. The campaign
batches of 10 jobs for another 20 iterations, then 15 is composed of synthetic jobs of the-
jobs and finally 20 jobs. oretical duration 60 seconds.
Figure 5.1.: Results of the identification of the system. Figure 5.1a depicts the link between
the input and the output of our system. Figure 5.1b shows that the distribution
of execution times is impacted by the commission and decommission of the
nodes by OAR, and thus must be taken into account in the modelling.
∆t
× rmax (5.1)
pj
However, when we look at the actual execution times of the jobs, we remark that the
execution times are not constant. Figure 5.1b shows the histogram of the execution
times of the CiGri campaign used for the identification experiment. As we can see, no
job actually lasted the theoretical 60 seconds. This is due to the node setup and node
cleaning mechanisms of OAR. To improve the precision of the processing rate, we can
replace pj in Equation 5.1 by the mean of the execution times (p¯j ). This modification
leads to a processing rate of about 14.8 jobs per iterations. This corrected rate thus
explains why the value of the sensor starts growing when CiGri starts to submit
batches of 15 jobs.
We want to avoid the overflowing the waiting queue. Indeed, if the queue is not
empty, OAR will try to schedule the Best-Effort jobs on the machines no matter if they
are going to be killed soon or not. By regulating the number of jobs in the waiting
From the discussion in the previous section, we can model the system as:
!
∆t
rk+1 + wk+1 = 1− × rk + wk + uk (5.2)
p¯j
Equation 5.2 describes perfectly the behavior of the system. However, we cannot
transform this equation into a linear model easily exploitable by control theory tools.
More complex models might be possible, but as a first step we considered linear
models. Linear models have the following form:
X X
yk+1 = ai × yk−i + bi × uk−i (5.3)
i i
∆t
rk+1 + wk+1 ≃ rk + wk − rmax + uk
p¯j
(5.4)
∆t
⇔ yk+1 = yk + (uk − rmax )
p¯j
∆t
p¯j rmax is called the operating region of the system.
Knowing the model of the open-loop system, we can design a controller to regulate
the closed-loop system. Similarly to the work in Chapter 3, we decide to use a
Proportional-Integral controller for its precision and robustness. In the following,
we use ks = 10 and Mp = 0.
Job #4
Resources Job #2
Job #3
Job #1
Time
Figure 5.2.: Graphical explanation of the prediction Gantt sensor. The value returned by the
sensor is the number of resources that will be used by normal jobs in horizon
seconds.
5.2.5 Taking the future into account with help from the scheduler
For now, our controller only reacts to the instant changes in the system (arrival
or departure of normal jobs). To reduce the wasted computing time (idle and
killed), we need a way to predict those changes on the system and take them into
account in the controller. Doing this would allow the controller to proactively
increase or decrease the number of CiGri jobs to submit to OAR to avoid the killing
of Best-Effort jobs or the idling of some resources. We thus need a way to query the
provisional schedule of OAR to extract information.
In the current state of OAR, the available information through the API are not enough
to know the provisional schedule. We thus slightly modified OAR (about 30 lines of
code) to implement a new software sensor by introducing a new route in its API that
returns the number of normal jobs that are predicted to be running at a given time
in the future. We call this time the horizon (h), and its value is decided in CiGri.
Figure 5.2 depicts the idea of this sensor.
There are several ways to inject the information returned by this new sensor into the
controller. We took inspiration from the feedforward techniques of Control Theory,
and decided to change dynamically the reference value. If we note dhk the number of
resources that will be used by normal jobs in h seconds (the horizon), then we can
redefine the reference value for the controller as yref,k = rmax − dhk . Roughly, the
reference value is the number of available resources for the CiGri jobs in h seconds.
We also adapt the operating point in Equation 5.4.
rmax + - yref,k+ ek uk
Controller System
-
yk
Figure 5.3.: Feedback loop representing the control scheme. The reference value (yref ) is
proactively changed to take into account the future availability of the resources
(dhk ).
The value of the horizon is a parameter of the controller, which carries some trade
off. A small horizon value makes the sensor highly sensitive to miss evaluation of the
walltimes of the normal jobs, and requires a fast response from the controller to meet
the desired state. This would lead to an increase of the killing of Best-Effort jobs. On
the other hand, a large horizon value might be too conservative and lead to more
idle machines.
Figure 5.3 summarizes the control scheme of our system with this new sensor.
5.3 Evaluation
The experiments were carried on the dahu cluster of Grid’5000 [Bal+13] where
the nodes have 2 Intel Xeon Gold 613 with 16 cores per CPU and 192 GiB of
memory. The reproducibility of the deployed environments is ensured by NixOS
Compose [Gui+22b]. The environments are available at [fee23]. For each exper-
iment, we deploy 3 nodes: one for the OAR server, one for CiGri, and one for a
OAR cluster of 32 resources. Note that we do not deploy 32 nodes for the cluster, but
instead deploy a single node and define 32 OAR resources. This choice is made with
5.3 Evaluation 87
energy concerns in mind, as deploying a full size cluster to perform the following
experiments would be an aberration. We do deploy the real software stack (OAR,
CiGri), but no real job is executed, only sleeps (see discussion about the job model
in Chapter 2). This representation of the jobs allows us to emulate several OAR re-
sources on a single physical machine without introducing noise to the sharing of
computing resources for computation, communication, storage, etc.
We will evaluate the controller on this scenario with different parameters: different
execution times of the CiGri jobs, and different horizon values for the sensor de-
scribed in Section 5.2.5. We consider 3 execution times (pj ) for the CiGri campaigns:
30s, 60s, and 240s. Those execution times correspond respectively to the first
quartile, median, and third quartile of the execution times of the CiGri jobs on the
Gricad platform (see Figure 2.1). For the horizon, we consider 9 different duration:
no horizon (or 0 second), 30s, 60s, 90s, 120s, 150s, 180s, 210s, 240s. The horizon
values are multiples of ∆t = 30s as smaller values would miss some behaviors. The
maximum horizon that we consider is 4 minutes which is the third quartile of the
execution times. Each experiment will be repeated 10 times to reduce the noise due
to the time OAR needs to set up and clean the resources before and after a job.
Figure 5.4a depicts the distribution of the percentage of computing time lost due to
idle resources (left column) and due to killed jobs (right column). We can see that
considering the horizon does not have a noticeable impact on the percentage of lost
time due to idle resources for pj = 30s and 240s. In the case of pj = 60s, we see
that longer horizons lead to more idle resources, which is expected. Note that for
pj = 240s, we see a decrease in the percentage of killed jobs with the increase of the
horizon. This decrease is not noticeable for the other processing times considered.
In our scenario, a constant submission will fill the waiting queue of Best-Effort jobs of
OAR, and the jobs accumulated in the waiting queue will be scheduled by OAR and
get killed when the second job starts.
Taking the future into account in a step-based submission strategy does not
yield any noticeable improvement of the usage of the resources.
Figure 5.4b depicts the distribution of the lost compute time due to idle resources
and killed Best-Effort jobs for a PI Controller with various horizon lengths. We can
see that for small horizons (30 or 60 seconds), having this mechanism allows to
reduce the idle time of the system. However, the longer the horizon, the more the
idle time, but the less killing of the jobs. There is thus a trade-off between killing
less and harvesting more.
Compared to the constant injection depicted in Figure 5.4a, we can see that the
percentages of lost computing powers are slightly different. A PI controller without
horizon will lose around 5% of computing power due to idle resources, while losing
one percent due to killed jobs. On the other hand, the constant controller will lose
less concerning idle resources (around 2.5%), and lose a lot more by killing jobs
(about 2%). This is because of the accumulation of jobs in the waiting queue, which
are scheduled by OAR even if they are destined to be killed. This highlights again
the added value of the feedback regulation of the system.
5.3 Evaluation 89
Idle Resources Killed Jobs
15
pj = 30s
10
0
6
Lost compute time [%]
pj = 60s
2
20
pj = 240s
15
10
0
None 30s 1m 1m30 2m 2m30 3m 3m30 4m None 30s 1m 1m30 2m 2m30 3m 3m30 4m
Horizon
(a) Distribution of the lost compute times for the constant submission with
horizon.
Idle Resources Killed Jobs
15
pj = 30s
10
0
Lost compute time [%]
6
pj = 60s
12
8
pj = 240s
0
None 30s 1m 1m30 2m 2m30 3m 3m30 4m None 30s 1m 1m30 2m 2m30 3m 3m30 4m
Horizon
(b) Distribution of the lost compute times for the PI Controller with horizon.
Figure 5.4.: Distribution of the lost compute times due to idle resources (left column) and
because of killing Best-Effort jobs (right column). The x-axis represents the
horizon of the sensor described in Section 5.2.5. Figure 5.4a presents the
results for the constant submission with horizon, and Figure 5.4b for the PI
Controller with horizon. The dashed line is the mean lost time for the solution
without horizon.
30
20
10
40
20
0
0 1000 2000 3000 4000
Time [s]
Figure 5.5.: Control signals for a scenario with pj = 60s and a horizon of 60s. The top
plot represents that number of resources submitted by CiGri through time. The
bottom plot depicts the value of our sensor, as well as the number of available
resources to CiGri in dashed red.
32
484
484
484
484
484
24 484
Resources
484
484
484
16 478
484
484
478 484
484
484
478 484
484
484
8 484
484
484
478 484
478 484
478 484
478 484
478 484
0
0 1000 2000 3000 4000
Time [s]
CiGri Normal
Figure 5.6.: Gantt chart for a scenario with pj = 60s and a horizon of 60s. The killed jobs
are depicted with a thicker contour.
5.3 Evaluation 91
Figure 5.5 shows the temporal evolution of the control signals for a scenario with a
campaign of one-minute long jobs and a horizon value of 60 seconds. The number
of available resources through time is depicted in dashed red line. We can see that
the controller is able to adapt the number of jobs it submits to OAR (top graph)
to meet the reference value (bottom graph). We do notice some overshooting
around 100s and 3000s. Those are due to a large variation in the reference value,
but the controller manages to stabilize the system in a couple of minutes. The
overshooting could be tamed by changing the gains of the controller, and especially
the Mp parameter presented in Section 5.2.4. This issue might also come from the
imprecision of our model due to the estimation presented in Section 5.2.3. Even if
we designed the controller to reach the reference value within 10 CiGri iterations, it
seems able to reach it less for small variations of reference value.
The Gantt chart of the previous scenario is presented in Figure 5.6. The killed
jobs are depicted with a thicker border. We can see that there are some CiGri jobs
killed at t = 1000s, which is unavoidable for this scenario as the priority job starts
immediately and the sensor controller cannot anticipate it. Only three jobs are killed
when the normal job starts at t = 2000s. We observe that there are some idle time
right after the start of the second normal job. This is due to the reaction of the
controller, and it also can be observed on the bottom plot of Figure 5.5 around 2000
seconds, where the output signal is under the reference value (red dashed line).
When using the additional information about the future schedule with a PI
controller, we are able to improve both the number of idle resources and the
number of killed jobs. For large values of horizon, the controller anticipates
too much and it leads to an increase of the idle time.
From the point-of-view of energy consumption, idle resources and killed jobs do
not consume the same amount of energy. Indeed, in practice, the power used by
a machine when a job is running is about twice more than when the machine is
idle [Hei+17]. In this work, we did not measure the energy consumption of the
jobs during the experiments as we are using sleeps to represent the CPU time.
To estimate the gain in energy consumption of the cluster for each strategy of
submission presented in this paper, we will simply multiply by two the weight of the
computing time lost due to the killing of jobs.
1.0
0.5
0.0
Gain
Energy point−of−view
1.5
1.0
0.5
0.0
None 30s 1m 1m30 2m 2m30 3m 3m30 4m
Horizon
Figure 5.7.: Gain of using the PI controller compared to the constant submission algorithm.
The top plot represents the gain from the point-of-view of the total energy
consumption loss. The bottom plot considers the computing time lost. Values
above one indicate that the PI out-performs the constant submission algorithm.
5.3 Evaluation 93
Figure 5.7 shows the gain of considering the prediction sensor in the PI Controller.
We look at two gains: the one in lost computing time (top plot) and the energy
loss (bottom plot). We can see that for all the considered processing times (pj )
just having a horizon of 30 seconds improves the usage of the cluster. The best
improvement is reached for a processing time of 60 seconds and a horizon of 60
seconds with about 25% gains in computing time and 40% in energy.
We think that it is possible to get the same behavior of the best horizon by changing
the sampling time of CiGri (∆t) to match the characteristics of the CiGri campaign:
p¯j
∆t ≃ and h ≃ p¯j (5.5)
2
This choice of ∆t could be explained by the Nyquist frequency which stays that a
process should be sampled twice faster than its dynamics. Concerning the choice
of the horizon value, the intuition is that values smaller than p¯j could lead to some
killing, and values larger than p¯j could lead to the idling of machines.
Using a PI with information about the future improves the global usage of
resources of 25%. If we consider the energy point-of-view, with a rough
approximation that an idle node consumes twice less than a working node,
then our controller can improve the energy usage by 40%.
This chapter has presented an approach to the harvesting of idle computing resources
in an HPC system using a feedback loop approach coupled with Control-Theory tools.
Contrary to the harvesting solutions of the literature, we also considered the killed
jobs as a source of wasted computing time. We have shown that our solution, a
Proportional-Integral Controller, can reduce the total amount of wasted computing
time compared to an ad-hoc solution. We show that by also taking into account
the predictive Gantt chart of the scheduler, we are able to reduce even more the
wasted computing time. The choice of the horizon value has an impact on the
total usage of the cluster. A horizon value of 30 seconds yields an improvement for
all the considered campaigns. It is however possible to reach better performance
by adapting the sampling time of CiGri to the running Bag-of-Tasks campaign (see
Equation 5.5).
97
rk + wk
Disturbances
dhk
rmax - yref,k + - e uk
+ k
Controller
uk
Min uk System
loadref + ek uk
Controller
-
loadk
Figure 6.1.: Feedback loop gathering the controllers and objectives from Chapter 3 (bottom) and Chapter 5 (top). At every iteration, we ran both
controllers and take the minimum number of jobs in the submission (uk ) to satisfy the objectives. In the case of a PI controller, we
would only add the error to the integral part only if the chosen submission size comes from this controller. It would be also possible to
add the feedback loop presented in Figure 3.13.
Sampling time of CiGri By default, the CiGri’s loop is executed every 30 seconds.
This value is fixed and ad-hoc, but does bear some importance. Indeed, this sampling
time must be small enough to capture the behavior of the system, but not too small
to react on noise. In Chapter 3 we take the CiGri jobs to have an execution time of
30 seconds. This helps the controller as longer lasting jobs would add some delay in
the system, which cannot be capture by a first order model, as used in this thesis.
One immediate solution would be to adapt the sampling time based on the current
campaign being executed. In Chapter 4, we showed that the controllers were quite
resistant to variation of the execution times. But in Chapter 5, we concluded that
adapting the sampling time of CiGri to half of the execution time of the jobs of
the campaigns could lead to significant improvement of the global usage of the
machines.
Sensors for Parallel File-Systems The work presented in this thesis supposes the
use of a centralized distributed file-system like NFS. This assumption makes it easier
to implement a sensor on a file-system. However, in large HPC centers, shared
file-systems are usually parallel with Lustre, BeeGFS, GlusterFS, etc.
Such Parallel File-Systems (PFS) are composed of several nodes to balance the load.
Implementing a sensor on a PFS might be tricky. A simple solution would be to
keep the loadavg and average it over all the nodes of the PFS. But this solution
might lose the meaning of the loadavg, and weird behaviors might happen. Another
solution might be to look at the bandwidth of the PFS. But such metrics could be
greatly varying through time, and would require some sort of filtering to be usable
by a controller.
It might also be interesting to investigate even lower level sensors with the help of
the eBPF technology. The development of such sensors would enable to have very
specific, and less noisy information on the system.
Phase detection HPC applications are usually iterative and thus have repeating
phases where they do computation, then perform I/O operations, then computation
again, etc. Detecting such I/O phases (e.g., [Sto+21; Tar+23]) would be a great
additional information for the CiGri controller. This new sensor can then be coupled
with a feedforward control strategy to proactively adapt the submission of CiGri to
avoid I/O overload of the file-system.
Deployment onto the Gricad computing grid The end goal is of course to deploy
our controllers onto the real platform. From the software management point-of-view,
it should be quite easy as we maintain a fork of CiGri with our modifications [cig23].
It will require the additional work of setting up the controller, which is done with a
single JSON file.
This Section presents some potential interesting systems which suffer from regulation
problems that could be addressed with tools from Autonomic Computing and Control
Theory.
As a first approach, the size of the partitions is fixed. An interesting question would
be to try regulating Quality-of-Service metrics for the HPC jobs based on the size of
the partitions. The actuator would be the size of the partition only of Big-Data jobs,
and the sensor could be the waiting times of jobs, or the bounded slowdown of the
HPC jobs.
Turning on and off machines (Chapter 7 of [Poq17]) When cluster nodes are not
being used, one easy way to save energy is to turn them off. However, when users
need the turned off machines to run their jobs, the switching on process takes time
and energy, which would degrade the Quality-of-Service for the users. A regulation
problem thus arises. The objective would be to decide how many nodes to keep
turned on based on metrics on the jobs (waiting time, bounded slowdown, etc.)
Additional sensors on the platforms like the provisional schedule, or historical usage
metrics could also be used to increase precision. The work from [Cer19] can give
inspiration.
On distributed systems, the delay of communication, the I/O, between actors must
be taken into account. Setting up an experiment on distributed systems is so
complex, that researchers often make experiments directly on computing centers
(e.g., evaluating the parallel file-system of the computing center, instead of deploying
one for the experiment).
In the previous part of this thesis, every experiment deploys the actual CiGri software,
an actual OAR cluster, and set up a NFS file-system. Each piece of the deployment
requires its own software environment and configuration which can be quite complex
to set up correctly, and especially on the first try. Figures 7.1a and 7.1b illustrate the
complexity of a software environment for a grid or cluster middleware, CiGri and
OAR in this case. The usual way to set up experiments on a cluster middleware
105
libunistring-1.0
libidn2-2.3.2
glibc-2.35-163
gzip-1.12 pcre-8.45 attr-2.5.1 libgpg-error-1.45 gcc-11.3.0-lib libffi-3.4.4 bash-5.1-p16 libevent-2.1.12 libtasn1-4.19.0 libnfnetlink-1.0.2 libmnl-1.0.5 keyutils-1.6.3-lib brotli-1.0.9-lib nghttp2-1.49.0-lib
bzip2-1.0.8 gnugrep-3.7 xz-5.2.7 zstd-1.5.2 libxcrypt-4.4.30 acl-2.3.1 db-4.8.30 audit-2.8.5 gmp-with-cxx-stage4-6.2.1 libassuan-2.5.5 libgcrypt-1.10.1 npth-1.6 icu4c-72.1 zlib-1.2.13 gmp-with-cxx-6.2.1 p11-kit-0.24.1 openssl-3.0.7 libpcap-1.10.1 libnetfilter_conntrack-1.0.9 libnftnl-1.2.4
bzip2-1.0.8-bin xz-5.2.7-bin elfutils-0.188 kmod-30-lib zstd-1.5.2-bin kmod-30 glibc-2.35-163-bin coreutils-9.1 linux-pam-1.5.2 ncurses-6.3-p20220507 gnupg-2.3.7 binutils-2.39 ncurses-6.3-p20220507-man nettle-3.8.1 libssh2-1.10.0 iptables-1.8.8 libossp-uuid-1.6.2 libxml2-2.10.3 libkrb5-1.20
kbd-2.5.1 util-linux-minimal-2.38.1-lib libbpf-1.0.1 perl-5.36.0 libcap-ng-0.8.3 shadow-4.11.1 json-c-0.16 readline-8.1p2 icu4c-72.1-dev zlib-1.2.13-dev ncurses-6.3-p20220507-dev dns-root-data-2019-01-11 unbound-1.17.0-lib libkrb5-1.20-dev postgresql-14.6-lib libxml2-2.10.3-bin hook curl-7.86.0
libseccomp-2.5.4-lib openssl-3.0.7-bin util-linux-minimal-2.38.1-bin libcap-2.66-lib getent-glibc-2.35-163 bash-interactive-5.1-p16 kexec-tools-2.0.25 gnutls-3.7.8 libxml2-2.10.3-dev tpm2-tss-3.2.0
cigri-env
wrapped-ruby-cigri-env
cigri-3.0.0
libidn2-2.3.2
glibc-2.35-163
libcap-2.66-lib attr-2.5.1 pcre-8.45 bzip2-1.0.8 gcc-11.3.0-lib ncurses-6.3-p20220507 bash-5.1-p16 keyutils-1.6.3-lib zlib-1.2.13 libyaml-0.2.5 expat-2.5.0 gdbm-1.23 brotli-1.0.9-lib libsodium-1.0.18
zstd-1.5.2 xz-5.2.7 acl-2.3.1 gmp-with-cxx-stage4-6.2.1 gnugrep-3.7 bzip2-1.0.8-bin kexec-tools-2.0.25 libxcrypt-4.4.30 db-4.8.30 audit-2.8.5 readline-8.1p2 libedit-20210910-3.1 openssl-3.0.7 libkrb5-1.20 tzdata-2022f mailcap-2.1.53 readline-6.3p08 sqlite-3.39.4 libffi-3.4.4 zeromq-4.3.4 libossp-uuid-1.6.2 libxml2-2.10.3
libcbor-0.9.0 libseccomp-2.5.4-lib kmod-30-lib kmod-30 coreutils-9.1 xz-5.2.7-bin zstd-1.5.2-bin gzip-1.12 glibc-2.35-163-bin libcap-ng-0.8.3 linux-pam-1.5.2 bash-interactive-5.1-p16 pcsclite-1.9.5 python3-3.10.8 postgresql-14.6-lib
python3.10-tabulate-0.8.10 kbd-2.5.1 util-linux-minimal-2.38.1-lib getent-glibc-2.35-163 shadow-4.11.1 python3.10-future-0.18.2 python3.10-greenlet-1.1.3 python3.10-docutils-0.19 python3.10-markupsafe-2.1.1 python3.10-pathtools-0.1.2 python3.10-pytz-2022.2.1 python3.10-typing-extensions-4.3.0 python3.10-pyyaml-6.0 python3.10-toml-0.10.2 python3.10-zipp-3.10.0 python3.10-six-1.16.0 python3.10-setuptools-65.3.0 python3.10-py-1.11.0 python3.10-pysocks-1.7.1 python3.10-brotli-1.0.9 python3.10-pycparser-2.21 python3.10-psycopg2-2.9.5 python3.10-simplejson-3.17.6 python3.10-psutil-5.9.3
util-linux-minimal-2.38.1-bin python3.10-commonmark-0.9.1 python3.10-pygments-2.13.0 python3.10-Mako-1.2.2 python3.10-SQLAlchemy-1.4.41 python3.10-h11-0.13.0 python3.10-watchdog-2.1.9 python3.10-babel-2.11.0 python3.10-async-timeout-4.0.2 python3.10-importlib-metadata-4.12.0 python3.10-python-dateutil-2.8.2 python3.10-click-8.1.3 python3.10-parso-0.8.3 python3.10-wcwidth-0.2.5 python3.10-tomli-2.0.1 python3.10-mypy-extensions-0.4.3 python3.10-pathspec-0.10.1 python3.10-platformdirs-2.5.2 python3.10-asttokens-2.0.8 python3.10-executing-0.8.2 python3.10-idna-3.4 python3.10-sniffio-1.3.0 python3.10-dnspython-2.2.1 python3.10-urllib3-1.26.12 python3.10-cffi-1.15.1 python3.10-pyzmq-23.2.1
systemd-minimal-251.7 python3.10-rich-12.5.1 python3.10-alembic-1.8.1 python3.10-sqlalchemy-utils-0.38.3 python3.10-uvicorn-0.18.2 python3.10-werkzeug-2.2.2 python3.10-Jinja2-3.1.2 python3.10-asgiref-3.5.2 python3.10-python-dotenv-0.21.0 python3.10-appdirs-1.4.4 python3.10-jedi-0.18.1 python3.10-prompt-toolkit-3.0.31 python3.10-black-22.10.0 python3.10-charset-normalizer-2.1.0 python3.10-certifi-2022.09.24 python3.10-devtools-0.8.0 python3.10-itsdangerous-2.1.2 python3.10-email-validator-1.2.1 python3.10-anyio-3.5.0 python3.10-brotlicffi-1.0.9.2
python3.10-oar-3.0.0
Figure 7.1.: Software dependencies of CiGri (Figure 7.1a) and OAR (Figure 7.1b).
is to create software environments and then deploy them onto physical machines.
Both creating the environment and deploying it is time-consuming (around tens of
minutes for each). Moreover, experimenters always need more than one iteration
before reaching the desired state of the image. A forgotten package, a typo in a
configuration file, a closed port, forgetting to copy an ssh key, such simple mistakes
require to edit the recipe of the image and rebuild again (Figure 7.2). Such iteration
times do not encourage experimenters to set up their environment properly, and they
will often call it “good enough” even if there are still some “dirty hacks” remaining,
which inevitably deteriorates the reproducibility of the experiments, and thus the
quality of the scientific results.
7.1 Reproducibility
This section is based on [Gui+23a] published at ComPAS 2023 with Adrien Faure,
Millian Poquet and Olivier Richard.
The scientific community as a whole has been traversing a reproducibility crisis for
the last decade. Computer science does not make an exception [RW18; Bak16].
The reproducibility of the research work is essential to build robust knowledge, and
it increases the reliability of results while limiting the number of methodology and
analysis bias. In 2015, Collberg et al. [CPW15] studied the reproducibility of 402
experimental papers published in system conferences and journals. Each studied
paper linked the source code used to perform their experiments. On those 402
papers, 46% were not reproducible. The main causes were: (i) the source code
was actually not available, (ii) the code did not compile or did not run, (iii) the
experiments required specific hardware
The term reproducibility is often used in a broad sense and gathers several concepts.
The definitions that we will use in the rest of this thesis are the ones defined by ACM
for the validation of the submitted artifacts [ACM]. It is composed of three levels of
reproducibility:
1. Repeatable: the measures can be obtained again by the people at the origin of
the work.
2. Reproducible: the measures can be obtained again by people who do not belong
to the original work and with the original artifact of the authors.
The evaluation of artifact is a crucial point which allows to guarantee the repro-
ducibility of the experiments and the results. However, this reproducibility is not
sufficient. Even if being able to reproduce an experiment is proof a scientific vali-
dation, the experiment and its environment are often too rigid to be extended by a
third party, or even by the authors themselves. We believe that the notion that should
be pushed by the community is the reproducibility with variation. By “variation” we
mean that a third party is able to easily modify the environment of the experience to
continue the research. This means that the hardware and software environments
as well as the experimental scripts must be correctly defined and can be modified
easily.
This section focuses on the software environment. For a global vision of the repro-
ducibility problems, the readers might be interested in [IT18].
Section 7.1.1 gives the context and the motivation of reproducibility for distributed
systems. Several frequently seen traps that might break the reproducibility of the
software environment of an experiment are presented in Section 7.1.2. Finally,
Section 7.1.3 motivates the use of Functional Package Managers as a solution to
most of the reproducibility problems of the software environments.
Imagine that in your childhood your grandma cooked a delicious chocolate cake,
and that you now want to know how to do it yourself. You even think that you
can improve on it, and make it tastier! You can try to reproduce the cake based on
your far memories and culinary intuition, but the result might be disappointing. . .
Maybe your parents know the recipe! But you might just get some fuzzy instructions.
Maybe your grandma just improvised the recipe! Who knows?
You decide to go through the old stuff of your grandma’s house. By chance, you find
an old recipe book. You open it and from it falls a piece of paper with what looks
like a cake recipe. The handwriting matches the one from your grandma!
The recipe is clear, well detailed, and contains all the quantities for all the ingredients,
the order of the different steps, the cooking time, etc. You decide to follow the recipe
literally, but the final result is not what you remembered. . . Maybe you did not use
Most of the current solutions in terms of “reproducibility” fall under the storage
of artifacts (system images, containers, virtual machines) and replay of experi-
ments [Ros+20; Bra+11; Bri+19]. Even if this is an important part of the re-
producibility spectrum, nothing guarantees that the software environment can be
re-built in the future, and thus nothing guarantees that the experiments can be
re-run if the artifacts disappear.
The step of artifact evaluation for the conferences is done soon after their initial
construction. It is thus very probable that the construction of the artifacts will be
executed in a similar state of the packages mirrors (apt, rpm, etc.). However, what
will happen when someone will try to rebuild the environment in 1 year? 5 years?
10 years? The objective of science is to base itself on robust works to continue to go
forward (Stand on the shoulders of giants). This vision of “short term reproducibility”
is a major obstacle to scientific progress and is in complete opposition to the science
philosophy.
We think that the notion that should be highlighted is the concept of variation [MFR18;
Fei15]. This means allowing a third party to use the environment defined for an
experiment in order to investigate another research idea. An example of variation
would be to change the MPI implementation used in an experiment (e.g., MPICH
instead of OpenMPI). Being able to introduce such a variation is only possible if the
initial environment is correctly defined.
Sharing the Environment An obvious way to fail the reproducibility of its experi-
ments is not to share the used environments, or to share them in a perennial place.
Platforms such as Zenodo [zen] or Software-Heritage [Her] allow users to store
artifacts (scripts, data, environments, etc.) permanently.
Tools such as Spack [Gam+15] have a similar approach as pip but also for all the
system packages and their dependencies. It is possible to export the environment as
a text file and to rebuild it on another machine. However, the produced environment
might not be completely identical. Indeed, Spack uses applications that are already
present on the machine to build the packages from the sources. Especially, Spack
assumes the presence of a C compiler on the system, and will use this C compiler
to build the dependencies of the environment. Hence, if two different machines
have two different C compiler then the resulting environment could differ from
the desired environment. One clear advantage of Spack is the ease to introduce a
variation in an environment through the command line.
Solutions such as Spack, pip, conda only focus on the software stack above the
operating system. However, results from experiments might depend on the version
of the kernel, some drivers, etc. Thus, it is important to capture entirely the software
stack.
Experiments do not only depend on Python packages, but they can also
depend on system packages, or even the version of the Linux kernel. The
software environment must be capture entirely.
Usually, the capture of the entire software stack goes through it encapsulation in
a system image. This image can then be deployed on machines to execute the
experiments. A way to generate a system image is to start from a base image,
deploy this image, execute the commands required to set up the desired envi-
ronment, and finally compress the image. Platforms such as Grid’5000 [Bal+13]
and Chamelon [Kea+20] propose to their users such tools (tgz-g5k [Gri] and
cc-snapshot [Clo] respectively). In the context of repeatability and replicability, if
the image stays available then this way to produce system images is adequate at
A better approach to generate image is via recipes. Those recipes, like Dockerfiles
for Docker containers or Kameleon [Rui+15] recipes for system images, are a
sequence of commands to execute on a base image to generate the desired environ-
ment. The text format of recipes make then much more suitable to version, share,
and reconstruct them. These base images have often several versions which are
identified by labels called tags. In the case of Docker, the tag of the latest version is
often called latest. Basing an environment on this tag breaks the traceability, and
thus the reconstruction of the image itself. Indeed, if a newer version is available
at the time of a future rebuild of the environment, then the image will be based on
this newer version and not the original version. Another important question is to
know if the base image and all the version are themselves reconstructive, and if it is
Be cautious with the longevity of the images on which you base your software
environment. By transitivity, if these images are not reproducible, so are
yours.
Another frequent problem is that the recipe performs an update of the mirror
(e.g., apt get update) before installing the required packages for the environment.
This has the bad property of breaking the reproducibility of the image. Indeed,
the image depends on the state of an external entity which is not controllable. In
order to be sure to use the exact same packages during a rebuild, a solution could
be to use a snapshot of the mirror1 . EnOSlib allows users to use this snapshot for
the construction of the environment. Concerning the introduction of variation, to
base an image on a snapshot is quite constraining and can make impossible the
installation of specific version of packages, or can create conflicts with already
installed packages.
When the recipe must download an object from the outside world, it is crucial to
verify the content of the fetched object and compare it to the expected one. In the
case where the object is not checked, the environment depends on the state of the
source of the object at the time of construction. Hence, if the recipe calls curl to get
a configuration file or a snapshot of a mirror for example, the recipe must also check
the content of the files. The usual way to do it is to compare the cryptographic hash
of the downloaded object and the one of the expected object.
A similar problem arises when the recipe downloads a package via git and build
it from source. In this case, it is paramount to correctly set the commit used in the
recipe of the image. Indeed, not knowing the commit used in the original image
leads to having a dependence to the latest commit of the repository on the main
branch. Setting the commit used allows to know exactly the sources used, and
simplifies the controlled introduction of variation in the environment (by changing
commit for example).
1
Example for debian: http://snapshot.debian.org/
Source 001
(git, tarball...) Mirror
Build script 110
(blackbox)
Nix
Figure 7.3.: Comparison between traditional package managers and Nix. Traditional pack-
age managers fetch a built version of the package from a mirror, but information
on how they have been built is unknown. In the case of Nix, the package is
described as a Nix function that takes as input the source code and how to
build it. If the package with these inputs has already been built and is available
in the Nix caches (equivalents of mirrors) it is simply downloaded to the Nix
Store. Otherwise, it is built locally and added to the Nix Store.
Every object coming from the outside of the environment must be examined
to be sure that it contains the expected content. It is more important that the
image fails to build if the content differs from the expected one, rather than
the image silently builds with a different content.
Tools such as Nix [DJV04] or Guix [Cou13] fix most of the problems described in
the previous section. Nix and Guix share the similar concepts, in the following we
will focus on Nix.
Systems like debian store all the packages in the /usr/bin and /usr/lib directories.
This ordering can lead to conflicts between different versions of the same library, and
it thus limits the introduction of variation in the environment without breaking the
system. On the other hand, Nix creates one directory per package. Each directory
name is prefixed by the hash of its sources. Hence, if a user wants to install a different
version of an already installed package, the sources will be different, thus the hash
will be different, and Nix will then create a new directory to store the new package.
Those individual directories are stored in the Nix Store located at /nix/store, in
a read-only file-system. Figure 7.3 summarizes the differences between traditional
package manager and Nix. The advantage of this fine-grained isolation method,
is the precise definition of the $PATH environment variable to manage software
environments.
The definition of packages through function also eases their sharing and distribu-
tion. There is a large base of package definition done by the community, called
nixpkgs [Nix23]. Users can easily base their new packages, or environment on
those definitions. It is also possible for independent teams and research groups to
have their own base of packages. Guix-HPC [Gui23e], NUR-Kapack [OAR23], or
Ciment-channel [Gri23] are examples of independent packages base for HPC and
distributed systems.
A Nix system profile defines the configuration of the system (packages, initrd,
etc.). Among many features, a profile can define filesystems such as NFS and mount
them automatically at boot time. Figure 7.4 depicts an example of user profile
containing the Batsim application [Dut+16], which requires the SimGrid [Cas+14]
library at runtime. NixOS extend the ideas of Nix to the entire operating system.
A NixOS image can contain several profiles and Nix can switch between them by
modifying symbolic links and restarting services via systemd.
Even though tools like Nix and Guix greatly improve the state of reproducibility for
software environments, it is still possible to go wrong and make a package impure
or to depend on an exterior state. Nix is currently addressing this issue with the
experimental feature Flake [Twe].
As Nix needs to recompile from source the packages are not available in its binary
cache, it is possible that a future rebuild is impossible if the host of the source code
disappear [bli]. However, as Software Heritage now performs frequent archives of
the open source repositories, it should be possible to find the sources of interest if
needed.
These tools also require a change of point-of-view in the way of managing a software
environment, which might make the learning curve intimidating.
7.1.5 Conclusion
The computer science community starts to get interested in the problems of repro-
ducibility of experiments. However, the problems of reproducibility at the software
level are not truly understood. Setting up reproducible experiments is extremely
complex. The management of the software environment illustrate one facet of this
complexity. The usual tools (pip, spack, docker, etc.) do not answer the repro-
ducibility problems without a huge effort by the experimenters, and only allow a
short-term reproducibility. The graal of reproducibility is the precise introduction
of variation in a third party defined environment. This need for variation allows
scientists to use solid contributions to continue research. Even if there are no per-
fect solution yet, the need for a change of practice concerning reproducibility is
needed.
The experiments conducted in this thesis on the CiGri system are relatively long-
lasting. They should be long enough to observe the impact of the controllers,
which can take a few hours. Then, as the system is relatively complex and noisy,
every experiment must be repeated several times in order to compute an average
behavior. They have the advantage of running the actual middlewares (CiGri, OAR,
File-systems).
The experiments of this thesis raised several questions from the experimental point-
of-view.
The RQ1 will be addressed in Chapter 8 with the presentation of NixOS Compose,
a tool based on NixOS aiming at reducing the development time to create a re-
producible distributed environment. Chapter 9 investigates RQ2 by considering a
technique we call folding of defining several computing resources on a single physical
node. We will focus on the performance impact of the distributed file-system of the
folded cluster. Chapter 10 presents the first step to answer RQ3 by implementing
a simplified version of CiGri in Batsim, and comparing to a real experiment the
different behavior from the point-of-view of Control-Theory.
The following chapters aim to explore the different stages of experimentation while
reducing the high experimental cost of distributed experiments on cluster or grid
applications such as CiGri:
2. deploy experiments at partial scale with lower cost and with acceptable loss of
realism (RQ2),
8.1 Introduction
In this chapter, we tackle RQ1. We aim to make the entire software stack involved
in a experiment of distributed systems reproducible. This implies making the
compilation and the deployment of this stack reproducible, allowing one to rerun
the experiment on an identical environment in one week or ten years.
To reach this goal, we exploit the declarative approach for system configuration
provided by the NixOS [DL08] Linux distribution. NixOS makes use of a configu-
ration that describes the environment of the entire system, from user-space to the
kernel. The distribution is itself based on the purely functional Nix package manager
[DJV04]. The definition of packages (or system configuration in the case of NixOS)
are functions without side effects, which enables Nix and NixOS to reproduce the
exact same software when the same inputs are given.
We extend this notion of system configuration for distributed systems in a new tool
named NixOS Compose. NixOS Compose enables to define distributed environments
and to deploy them on various platforms that can either be physical (e.g.,on the
Grid’5000 testbed [Bal+13]) or virtualized (Docker or QEMU [Bel05]). NixOS
Compose exposes the exact same user interface to define and deploy environments
regardless of the targeted platform. We think that this functionality paired with fast
rebuild time of environments improves user experience, and we hope that it will
help the adoption of experimental practices that foster reproducibility.
119
Environment description (3 times)
(1) (2) (3)
Current state
docker-compose Vagrant Kameleon
docker VM image
Local Deployed (g5k)
composition.nix
NixOS Compose
configuration on the machines without rebooting them, which keeps a state on the
machines and is detrimental for reproducibility.
The chapter is structured as follows. Section 8.2 presents NixOS Compose, its main
concepts, the external concepts it relies on, and the users’ utilization workflow
for setting up a complete reproducible system and software stack. Section 8.3
gives technical details on how NixOS Compose works. Section 8.4 presents how
NixOS Compose can be used on a complex example that combines several distributed
middlewares. Section 8.5 offers experimental performance results of NixOS Com-
pose against standard state-of-the-art tools using multiple metrics. Finally, Section
8.6 concludes the section with final remarks and perspectives on reproducible works
and environments.
This section gives the main concepts and terminology of NixOS Compose. A software
environment is a set of applications, libraries, configurations and services. We
define the notion of Transposition as the capacity to deploy a uniquely defined
environment on several platforms of different natures. For example one may want
to deploy an environment on local virtual machines to test and develop, and then
want to deploy onto real machines on a distributed platform with the exact same
1 { pkgs, ... }:
let k3sToken = "..."; in {
3 roles = {
server = { pkgs, ... }: {
5 environment.systemPackages = with pkgs; [
k3s gzip
7 ];
networking.firewall.allowedTCPPorts = [
9 6443
];
11 services.k3s = {
enable = true;
13 role = "server";
package = pkgs.k3s;
15 extraFlags = "--agent-token ${k3sToken}";
};
17 };
agent = { pkgs, ... }: {
19 environment.systemPackages = with pkgs; [
k3s gzip
21 ];
services.k3s = {
23 enable = true;
role = "agent";
25 serverAddr = "https://server:6443";
token = k3sToken;
27 };
};
29 };
}
Listing 8.1: NixOS Compose composition file example for k3s [K3s].
Role A role is a type of configuration associated with the mission of a node. k3s is
a client-server application where clients are named agents. There would therefore
be 2 roles: one for the server and another for the agents. As all the agents have
the same configuration they can use the same role. Note that in cases where there is
one node per role, the notion of role and the notion of node overlap.
Listing 8.2: Deployment file example for the k3s example with 1 server and 3 agents where
the users defined the quantity of nodes per role. The hostnames are generated
by NixOS Compose. In this example there would be 3 agents with the hostnames
agent1, agent2 and agent3.
Listing 8.3: Deployment file example for the k3s example with 1 server and
3 agents where the users defined the hostnames of every node per role.
Flavours A flavour is a target for the deployment of the environment. This notion
includes the (virtual or physical) platform onto which the deployment should be
• g5k-image for full system tarball images that can be deployed on Grid’5000 [Bal+13]
via Kadeploy [Geo+06].
• g5k-ramdisk for initrds that can be quickly deployed in memory without the
need to reboot the host machine on Grid’5000(via the kexec syscall).
During the development phase of the environment, users can deploy locally, lightly
and quickly with the docker and vm-ramdisk flavours. At a later stage, users can test
their environment on real nodes from the Grid’5000 testbed with the g5k-ramdisk,
which is convenient for trial-and-error operations thanks to its fast boot time. Finally,
the environment can be deployed at real scale on Grid’5000 with the g5k-image
flavour. Please note that some flavours have reproducibilitylimitations due to the
underlying technologies. For example, controlling the version of the Linux kernel is
impossible when using the docker flavour.
8.2.1 Workflow
This section presents the workflow of NixOS Compose and how it enables users to
simply transpose their environment from one platform to another.
Local Testing When developing an environment, users can work with the docker
and vm-ramdisk flavours with the following workflow:
2. Deploy the environment: nxc start. By default, NixOS Compose takes the last
composition built.
Distributed Deployment Once the environment has been tested with local flavours,
it can be tested in a distributed system.
2. Reservation of the nodes to use for the deployment: depends on your platform
resource manager. For example salloc for Slurm [YJG03] or oarsub for
OAR [Cap+05].
Figure 8.2 summarizes the NixOS Compose workflow. NixOS Compose aims at making
the transition between platforms as seamless as possible. Thus, the workflow in
a distributed setting (Section 8.2.1) is identical to the workflow in a local one
(Section 8.2.1). The only difference is that in a distributed setting, users need to
first reserve the resources before deploying.
Nix can generate a NixOS configuration from a Nix expression, including the boot
and init phases required to start the kernel. Nix stores those phases in the Nix
Store, which enables NixOS Compose to call them later on. NixOS Compose works
in two steps: Building and Deploying. The building phase is done using Nix tools
wrapped with Python for the command-line interface. The deployment is fully done
with Python and mostly consists in the interaction with the different deployment
tools (Kadeploy, docker-compose, QEMU). The Nix part is around 2000 lines of code,
and the Python part around 4000.
The following section (8.3.1) details how NixOS Compose manages g5k-ramdisk,
the flavour that enables quick in-memory deployment without the need to reboot
host machines on Grid’5000. Details are omitted for the other flavours, but please
1
tmux: https://github.com/tmux/tmux
Phase
Flavour Comments
Building Deployment
docker Generate a docker-compose configuration and Call the docker-compose application with the Fastest and light but limited in application due to
docker containers. right arguments. virtualization.
vm-ramdisk Generate the kernel and initrd for the roles of Create a virtual network with Virtual Distributed Fast but takes a lot of memory. Limited to a couple
the composition. Ethernet (VDE) and starts the Virtual Machines of VMs on a laptop.
with QEMU.
g5k-ramdisk Generate the kernel and initrd for the roles of Use kexec to quicky start the new kernel with- Long to build but fast to deploy. kexec has repro-
the composition. out rebooting. Send the deployment informa- ducibility limitations and consumes a lot of memory
tion through the kernel parameters. which can be limiting for large images.
g5k-image Generate a tarball of the image of the composi- Use Kadeploy to deploy the image to the nodes. Longer to build and deploy, but it has the best re-
tion. Send the deployment information through the producibility properties.
kernel parameters.
refer to Table 8.1 for a summary of the difference in the building and deployment
phases for all supported flavours.
Construction NixOS Compose uses Nix to evaluate the configuration of every role
in the composition. Nix then generates the kernel and the initrd of the profiles.
Deployment NixOS Compose relies on the kexec Linux system call for this flavour.
kexec enables to boot a new kernel from the currently running one. This skips the
initialization of the hardware usually done by the BIOS, which avoids an entire
reboot of the machines and greatly reduces the time to boot the new kernel. The
kexec commands takes as input the desired kernel, the kernel parameters and the
initrd. NixOS Compose passes the kernel and initrd generated in the construction
phase to kexec.
As NixOS Compose produces a single image containing the profiles of all the roles,
NixOS Compose needs at deployment time to tell each node the role it should
take. To achieve this, we pass this information using the kernel parameters to
set up environment variables based on the role. NixOS Compose also uses the
kernel parameters to pass ssh keys and information about the other hosts (e.g., the
/etc/hosts). There is however a size limit of 4096 bytes on the kernel parameters,
which prevents us to use this method to send the deployment information to nodes
when users want to deploy a lot of nodes. To deal with this, NixOS Compose starts a
light HTTP server on the cluster frontend for the duration of the deployment. We
pass the URL of this server using the kernel parameters. Then, the nodes query this
server with wget to retrieve the information associated with their roles. Note that
the deployed images do not include a HTTP server but only the wget application to
fetch the data. Figure 8.3 represents how the nodes get the deployment information
based on the quantity of nodes involved.
This section shows how complex distributed environments can be developed with
NixOS Compose by taking the Melissa [Ter+17] framework as an example.
Boot
Stage1 deployment-infos
/etc/host
Context setup /root/.ssh/
...
Boot
GET /URL
HTTP server Stage1 deployment-infos
deployment-infos
/etc/host
Context setup /root/.ssh/
...
Figure 8.3.: Mechanism for the nodes to get the deployment information. For a few nodes,
the information is passed via the kernel parameters. For a higher number of
nodes, this is not possible due to the size limit on the kernel parameters (4096
bytes). In this case NixOS Compose starts a light HTTP server on the frontend
and passes its URL to the nodes via the kernel parameters. The nodes then
query this server to retrieve the deployment information.
NixOS Compose enables to deploy a resource manager (including all the components
it requires, e.g., a database), Melissa, and all the components required by Melissa at
runtime (e.g., a distributed file system).
In the following example we deploy Melissa with the Slurm resource manager. Four
roles are needed to define the environment.
Some roles share parts of their configuration, like server and computeNode. They
both use Slurm but their configuration differs in terms of services, as they respectively
enable the slurm.server and slurm.client services.
1 { pkgs, ... }:
pkgs.stdenv.mkDerivation rec {
3 pname = "melissa-${version}";
version = "0.7.1";
5 src = pkgs.fetchgit {
url="https://gitlab.inria.fr/melissa/melissa";
7 rev="e6d09...";
sha256="sha256-IiJad...";
9 };
buildInputs = with pkgs; [
11 cmake gfortran python3 openmpi
zeromq pkg-config libsodium
13 ];
# Build phases are omitted
15 }
This section emphasizes the advantages of using NixOS Compose to deploy the
Melissa distributed environment.
NFS Server Setting up a NFS server with tools like Kameleon or EnOSlib is cumber-
some. The users would first need to install the nfs tools on the nodes and define the
NFS server. Then, the users would have to mount the newly defined server on every
client node. These steps can be automated with scripts, but that would be fragile.
Listing 8.6: Definition of the NFS server. It exposes the /srv/shared folder.
Listing 8.7: Mounting of the NFS server for the compute nodes. The local mounting point
is /data.
The declarative definition of NFS with NixOS is based on systemd services (see
Listings 8.6 and 8.7). This makes them easier to define and more robust as they can
be restarted until they perform the mount successfully in the case of the NFS server
starting after the clients.
We have deployed the Melissa composition we have written with NixOS Composeon
13 nodes with the experimental setup described in Section 8.5.1. The deployment
took approximately 2 minutes with the g5k-ramdisk flavour.
NixOS Compose aims to provide the same environment on different target platforms.
This section analyses the content of the Nix Store in the Melissa images generated
for every flavour, as the Nix Store content represents the software environment
available on the node. The docker flavour is omitted here as the containers do not
have a well-defined specific Nix Store but mount the Nix Store of the host machine
instead.
Figure 8.4 presents the content of the Nix Store of the Melissa image for the differ-
ent flavours. The smaller packages are gathered under the others-* name. We
can see that the vast majority of the software stack is shared by several flavours.
There is a common 2 GiB set of packages common to every flavour containing the
NixOS definition and the dependencies of Melissa. The two flavours targeting the
Grid’5000 platform need more packages, for example the firmware to use the nodes’
hardware. Then for each of the flavour, there is about 5 % of the total Nix Store size
for packages specific to the flavour. For example, deploying a g5k-image image
requires a complete reboot of the host and to go through the boot loader, hence the
presence of grub in this image.
The resulting environments for the different flavours differ because of the
particularities of the target platform.
8.5 Evaluation
The following experiments have been carried out on the dahu cluster of the Grid’5000 testbed.
This cluster has machines with 2 Intel Xeon Gold 6130 CPUs with 16 cores per
CPU and 192 Gib of memory. The nodes of this cluster have SSD SATA Samsung
MZ7KM240HMHQ0D3 disks with a capacity of 240 GB formatted in ext4.
Figure 8.4.: Packages present in the Nix Store of the Melissa image for the different flavours.
The colors represent the packages common to the flavours. The smaller pack-
ages are gathered under the others-* name. The docker flavour is omitted as
it mounts the Nix Store of the host machine.
Grid’5000 provides base images for several Linux distributions and versions. Users
need to build their own images if they want to use more complete images. This
is usually done with Kameleon on Grid’5000 — in fact all the images provided by
Grid’5000 are generated by Kameleon recipes.
In this study, we want to compare the performance of NixOS Compose and Kameleon to
build images. We will focus on the image build time, as well as the size of the gener-
ated images. We also want to evaluate whether caching the Nix Store enables an
interesting build time speedup.
We first build a base image with NixOS Compose and Kameleon, measuring its build
time and the size of the generated images. base contains the basic software needed
to conduct a distributed experiment: grid5000/debian11-x64-nfs for Kameleon as
this is the most convenient and common image for distributed experiments on
Grid’5000, and all the packages required by the flavour for NixOS Compose. Then
we add the hello package to the recipes and build a new image (base + hello)
while measuring the same metrics.
This experiment has been executed using Grid’5000’s NFS (mounted on /home) or
without it (using local disks on machines mounted on /tmp), in order to compare
the performance of the tools depending on the filesystem setup used.
We clear the Nix Store before building the base image, but not before building
base + hello, in order to evaluate the impact of cached builds on NixOS Compose.
Kameleon has an indirect caching mechanism via the HTTP proxy Polipo [Chr22].
2
https://zenodo.org/record/6568218
nxc−g5k−ramdisk
nxc−g5k−image
kameleon
0 250 500 750 1000 0 250 500 750 1000 0 250 500 750
Figure 8.5.: Performance comparison between Kameleon and NixOS Compose (nxc). A
base image is first built, then a new image (base + hello) that contains the
additional hello package is built. As Grid’5000 is the targeted platform of this
experiment, images are built with the g5k-ramdisk and g5k-image flavours.
Shown values are averages over 5 repetitions. Error bars represent 99 %
confidence intervals.
However, we did not manage to make it work with Kameleon, and Polipo is no longer
maintained as its utility has become arguable – most of today’s traffic is encrypted,
including fetching packages from mirrors.
Results and Comments As seen on Figure 8.5, NixOS Compose substantially out-
performs Kameleon in terms of image build time. When building from an empty
cache on local disks, NixOS Compose is 11x faster than Kameleon. Moreover, NixOS
Compose uses its local cache efficiently, which enables it to build the image variation
1.7x faster than the initial image build time when the filesystem is used efficiently
(local disks).
Figure 8.5 also shows that NixOS Compose produces bigger images than Kameleon.
This is mainly because we have not optimized the content size of the images as
we write these lines — e.g., many firmwares are kept in the images instead of only
the ones needed on Grid’5000 (see Figure 8.4). Another reason for this image size
comes from our design choice to prioritize compression speed over compression
quality. This is important for NixOS Compose as image variations should be built as
fast as possible to improve user experience. For information, we have measured that
NixOS Compose takes about 25 s to compress each image, which is a non-negligible
portion of a variation build time (30 - 35 %).
Finally, Figure 8.5 shows how the filesystem setup impacts the build time. Here,
using an efficient filesystem setup (local disks) greatly benefits to NixOS Compose as
NixOS Compose builds system images faster than Kameleon, but really shines
when introducing a variation in the environment. Nix is victim of the NFS
slowness, and build times can be even more improved when not building on
NFS.
Case Studies We chose to study the two following distributed applications that are
already implemented in EnOSlib:
Our k3s environment consists in two nodes: one k3s server and one k3s agent. The
agent deploys a nginx web server to the agents. The run part of the experiment
simply consists in querying the webserver to retrieve the nginx web page.
Our flent environment consists in 2 nodes: one server and one client. The run part
of the experiment simply runs the flent benchmark.
For both case studies, we made sure that NixOS Compose and EnOSlib set up similar
environments, and that they execute the same run part of the experiment after the
deployment and provisioning phases have been done.
Time [s]
200
100
0
Build Sub + Deploy Provisioning Run Build Sub + Deploy Provisioning Run
Phases
Figure 8.6.: Time spent in the different phases of the deployment of a distributed experi-
ment (build, submission + deploy, provisioning, run). We compare EnOSlib
and NixOS Compose (with the flavours g5k-ramdisk and g5k-image) on two
examples: a network benchmarking tool (flent) and a containers’ orchestrator
(k3s). The errors bars represent the confidence intervals at 99 %.
In NixOS Compose most of the configuration is done inside the composition and thus
in the image. This enables us to completely skip the provisioning phase for flent. For
k3s, our provisioning phase simply consists in waiting that the web server becomes
available.
For both tools we measure the time to build the image, to deploy it, to execute the
provisioning and to run the experiment script. We use an empty Nix Store for a fair
build time.
Please note that EnOSlib does not directly provide the time taken by the deployment
phase, but only provides the time between the submission and the first command of
From Figure 8.6, it seems that packing part of the provisioning in the image improves
the provisioning time without deteriorating the deployment time. The only draw-
backs are the non-negligible build times. However, those times can be improved
by utilizing the Nix Store as a local cache (see Section 8.5.2). Note that NixOS
Compose proposes several flavours that can be executed locally to develop and test
the environment. The cost of the construction of these images is thus amortized by
the numerous quick and light local deployments. Finally, please also note that the
build time of EnOSlib is null on Figure 8.6 as a pre-built image is used. However,
depending on their scenario users may need to actually build an image via another
technology (e.g.,Kameleon or NixOS Compose).
Using NixOS Compose instead of EnOSlib can reduce the time spent after
deployment to configure services (provisioning) to the build time thanks to
the declarative nature of NixOS services.
8.6 Conclusion
This Chapter has presented NixOS Compose, a free tool, under MIT license [Gui+22a],
that enables the generation of reproducible distributed environments. We have
showed that NixOS Compose deploys the exact same software stack on various
platforms of different natures, without requiring specific work from users. The
software stack is reconstructible by design, as NixOS Compose inherits its repro-
ducibility properties from Nix and NixOS. Our experiments showed that NixOS
Compose’s reproducibility and platform versatility properties are achieved with-
out deployment performance overhead in comparison to the existing solutions
Kameleon and EnOSlib.
The experiments conducted in this article showed that build caches greatly improves
NixOS Compose’s build times, and that properly using the file system is important
for its performance. From a cluster administration perspective, providing a shared
Nix Store between users would be very interesting to avoid data duplication and
to prevent different NixOS Compose users to build the same packages over and
over. There are many ways to implement a distributed shared Nix Store and we
think that exploring their trade-offs would provide valuable insights, as reproducibil-
ity improvements should not be done at the cost of a higher resource waste on
clusters.
User experience is a crucial factor that must be considered for reproducible experi-
mental practices becoming the standard. With this in mind, we think that the notion
of Transposition we have defined in this article and implemented in NixOS Compose is
very beneficial. Transposition reduces the development time of distributed environ-
ments, as it enables users to do most of the trial-and-error parts of this iterative
process with fast cycles, without any reproducibility penalty on real-scale deploy-
ments. However, practitioners that adopt NixOS Compose are likely to experience a
paradigm shift if they are not already accustomed to Nix’s approach. We strongly
believe that the reproducibility and serenity gains it brings are worth it. To help with
the adoption of NixOS Compose a tutorial has been created and presented at two
occasions [Gui+a].
NixOS Compose currently only provides first-class support for Grid’5000. We would
like to support bare-metal and virtualized deployments on other experimental
testbeds such as CloudLab [Dup+19] and Chameleon [Kea+20]. Moreover, the
hand off from Grid’5000 to SLICES(-FR) might require to also support deployments
technologies such as OpenStack [Ope13].
This chapter is based on [Gui+23d] presented at the 15th JLESC workshop with
Olivier Richard, Raphaël Bleuse, and Eric Rutten.
9.1 Introduction
The increase of computing demands from scientists from all fields led to the develop-
ment of computing cluster and grid architectures. And with these new architectures
came new challenges. Due to their high prices, clusters are often shared among
several research laboratories and teams. This sharing motivates the development
of middlewares such as batch schedulers that are responsible to assign the user
jobs to physical machines, manage the state of the resources, deal with reservation,
etc. Such cluster, or grid, middlewares are complex applications of great research
interest, and they must be tested before reaching production. However, they are
usually destined to operate in an environment of hundreds or thousands of machines.
Deploying a full scale environment is too costly to perform simple tests and very
specific evaluations.
We thus want to answer the following question: Can we reduce the number of
physical machines needed to perform a full scale experiment while keeping a
similar behavior as the full scale system?
One solution to reduce the number of machines used for experiments would be to
use simulation techniques. The system, the applications, and the middlewares are
modeled and can then be executed on a single node. Beside reducing the number of
machines used, simulators also reduce the execution time of the experiments. In the
context of distributed systems and applications, projects such as Simgrid [Cas+14]
and Batsim [Dut+16] are leading the way. One drawback of simulation is that the
real middleware is not being executed, or not fully executed, but instead, a partial
141
or modeled version is executed in a modeled environment. In the case of cluster and
grid middlewares, the applications are often far too complex to model them fully
correctly.
Section 9.2 defines notions and concepts. We present the experimental protocol in
Section 9.3 and perform the evaluation in Section 9.4.
This problematic arises in the context of experiments with CiGri where we must
deploy and perform experimentation evaluation of a modified version of CiGri on
a realistic environment. It is unreasonable to deploy on the entire Gricad meso-
center, or using simulation due to the complex software stack (CiGri + OAR + users
jobs). We are thus interested in folding strategies to perform the evaluation of our
CiGri modifications.
The statistical description of the execution times of the CiGri jobs presented in
Chapter 2 allows us to use a sleep model to represent the execution times. As sleep
calls are extremely lightweight for a CPU to deal with, we are able to fold several
virtual resources onto one physical resource. However, no realistic job only does
computation without reading and/or writing data. We can extend the previous job
model by adding I/O operations before or after the sleep with the dd command.
This addition makes it less obvious how much we can afford to fold.
(a) Folding with a factor of 1 (b) Folding with a factor of 2 (c) Folding with a factor of 4
(4 physical resources and (2 physical resources and (1 physical resource and 4
4 virtual resources). 4 virtual resources). virtual resources).
Figure 9.1.: Example of folding a deployment for a system with 4 resources. Figure 9.1a
depicts the system deployed at full scale. Figure 9.1c represent the system
completely folded. And Figure 9.1b shows an intermediate folded deployment.
We also define the folding factor (ff old ) as the division of the number of virtual
resources in the deployment divided by the number of physical resources. Intuitively,
it represents the number of virtual resources for each physical resource:
#resourcevirtual
ff old = ∈ [1, +∞[ (9.1)
#resourcephysical
9.3 Methodology
The following experiments were carried on the gros cluster, located in Nancy, of the
Grid’5000 [Bal+13] French test bed. The machines of this cluster have an Intel Xeon
Gold 5220 CPU with 18 cores, 96 GiB of memory, a 2 x 25 Gbps (SR-IOV) network
and a 480 GB SSD SATA Micron MTFDDAK480TDN disk. The reproducibility of the
deployed environment was ensured by NixOS Compose [Gui+22b]. The environment
definitions are available at [hpc23a; hpc23b], the analysis scripts at [Gui+23b], and
the data on Zenodo [Gui+23c].
To evaluate the performance of the cluster file system, we chose to use the IOR
[Cal23] benchmark. IOR is a MPI application to benchmark I/O performances. It is
the most popular benchmark among research on I/O in HPC [Boi+18], and is also
used in the context of the IO500 list [IO5].
IOR has a multitude of parameters, but give the main ones here. We used the POSIX
protocol (api=POSIX), a transfer size (transferSize) of 1Mbytes and a segment
count of 1 (segmentCount). We assigned a single file per IOR process (filePerProc),
and checked the correct writing and reading afterwards (checkWrite and checkRead).
The number of tasks (numTasks), which represent the number of IOR processes, and
the block size (blockSize), which is the total size of the file to read/write in this
case, are parameters of the following experiments. We used OpenMPI with the TCP
backend.
(a) Distributed File system like NFS. Multiple (b) Parallel File system like OrangeFS. Mul-
clients query a single I/O node that pro- tiple clients query in parallel all the I/O
cess all the requests. nodes.
Figure 9.2.: Architectures for a distributed file system (Figure 9.2a), and parallel file system
(Figure 9.2b).
NFS (v4) NFS [Paw+00] is a popular distributed file-system for small clusters.
There is only one server. Clients mount the file-system and can perform POSIX
operations. Figure 9.2a depicts the simplified architecture of a distributed file system
like NFS. All the clients query the same I/O node for their files. NFS servers do
have several workers that can manage the requests concurrently. The NFS export
options used are: *(rw,no_subtree_check,fsid=0,no_root_squash). The NFS
server runs under the default configuration (8 workers).
Number of I/O nodes In the case of PFS, like OrangeFS, we did not find any
methodology nor “rule of thumb” to define the number of I/O nodes for a computing
cluster. In the following, we will consider OrangeFS file-systems with 1, 2, or 4 I/O
nodes. Note that in the case of NFS there is only one I/O node.
I/O load We consider 5 different sizes of I/O operations to perform, both in writing
and reading: 1Mbytes, 10Mbytes, 100Mbytes, 500Mbytes, and 1Gbytes. These file
sizes will be the values of the IOR blockSize option.
Protocol Let us consider a system with N CPU nodes. We first deploy the system
at full scale, i.e., N machines and the I/O nodes of the file system. As IOR is an MPI
application, we compute the hostfile with one slots per compute node. We then
start the IOR benchmark, that we repeat 5 times (the IOR repetitions option), and
gather the performance reports. We remove one compute node from the hostfile
and recompute the slots for the remaining nodes to keep the number of processes
(numTasks) constant. This protocol is then repeated for all the different variations:
number of CPU, size of I/O operations, type of file system, number of I/O nodes.
Figure 9.3 shows a visual representation of the experimental protocol for an experi-
ment with 4 CPU nodes.
9.4 Evaluation
In this Section, we present the results of the experiments presented in the previous
section. We first consider the file-system of the cluster to be NFS in Section 9.4.1,
and then OrangeFS in Section 9.4.2.
Figure 9.4 shows the results of the experiments when using NFS. We also plot the
95% confidence intervals. Note that we did not remove the outliers. We notice that
the writing performances (bottom row) does not seem to be affected by the folding
of the deployment as it remains flat. However, for the reading performances (top
row), we can see that the reading times increasing for higher folding factor. This
means that the more we fold, the more we degrade the reading performances.
(a) Initial deployment at full scale: one pro- (b) We remove node4 and add one extra slot
cess per node. to node3.
(c) We remove node2 and add one extra slot (d) We remove node3 and reach a fully folded
to node1. deployment.
Figure 9.3.: Graphical representation of the experimental protocol presented in Section 9.3.
We start with a full scale deployment, i.e., one MPI process (viewed as a virtual
resource) per physical node (Figure 9.3a), and remove from the hostfile the
physical nodes one by one while keeping the number of MPI processes (i.e., the
number of virtual resources) constant. We recompute the number of slots per
node to balance the processes (Figures 9.3b and 9.3c). The experiment stops
when there is no more node to remove (i.e., after Figure 9.3d).
1.00
0.75
Reading
0.50
0.25
Time (s)
0.00
40
Writing
20
0
0 2 4 6 8 0 5 10 15 0 5 10 15 20 25 0 10 20 30
Number of MPI processes per machine
Figure 9.4.: Evolution of the reading (top row) and writing (bottom row) times based
on the folding factor (x-axis) for experiments with different cluster size (i.e.,
number of CPU nodes) and different sizes of file to read and write (point
shape). We observe that the writing performances are not affected by the
folding, but that the reading ones are, and that the degradation has quadratic
growth with respect to the folding factor.
1.00
Reading Time (s)
0.75
0.50
0.25
0.00
2 4 6 8 4 8 12 16 0 5 10 15 20 25 0 10 20 30
1.00
0.10
0.01
2 4 6 8 4 8 12 16 0 5 10 15 20 25 0 10 20 30
Number of MPI processes per machine
Figure 9.5.: Linear regression modeling the reading times (y-axis) and the folding factor
(x-axis), file size (point shape). The top row shows the fitting of the model on
the data, and the bottom row the same data but in log scale. We can see that
the model fits correctly the data for file sizes greater than 10M. 1M files does
not seem to be affected by the folding, and their variation in performance seem
to be due to noise.
20
0.25
10
0.00
1K
0K
0K
0M
s
lu
10
−1
−1
−4
00
0−
−1
10
−P
−1
0−
0M
−1
0K
1M
K−
10
1K
4M
1G
M
10
10
10
10
0 I/O Size in bytes
0 10 20 30 40 50
Percentage of degradation compared to the full scaled deployment
(b) Proportion of the size of the read operations
(a) Maximum folding factor (ff old ) based on the on ANL-Theta between 2019 and 2022. The
accepted degradation of the reading time com- majority is smaller than 100K. These val-
pared to the full scale deployment and to the ues are extracted from Darshan [Car+11;
size of the file to read. Car+09] logs.
Figure 9.6.: Figure 9.6a shows the maximum folding factor to use to have a desired over-
head on the reading times on NFS base on the file size. Figure 9.6b shows the
distribution of the number of read requests per size on ANL-Theta.
We modeled the reading performances based on the size of the file to read and the
number of CPU nodes involved. Figure 9.5 shows the results of a linear regression
between the reading time and the folding ratio, file size and number of CPU nodes
involved. We fitted a model with the following form:
tread (ff old , fsize ) ≃ α + β1 ff old 2 + β2 fsize + γff old 2 fsize (9.2)
Figure 9.5 shows the fitting of the model on the NFS data. The bottom row represents
the same information as the top one, but in log scale. We can see that the model fits
all the file sizes but for 1M. We believe that the variations in performance for the
1M files are due to noise.
We are interested in knowing the maximum folding factor (ff old ) possible for a
desired file size (fsize ). Let p > 1 be the percentage of increased reading time
compared to the full scale deployment containing nbcpu CPU nodes. By using the
definition of the model for tread in Equation 9.2, we get:
s
tread (f, fsize ) p × tread (nbcpu , fsize ) − (α + β2 fsize )
< p =⇒ f < (9.3)
tread (nbcpu , fsize ) β1 + γfsize
For files of size 10M, folding 10 resources onto a single physical resource
leads to a degradation of 5%. To reach the same degradation for file size of
100M or 1G, the maximum folding factor would be 5. (Figure 9.6a)
From the model defined previously, Figure 9.6a, and the Darshan [Car+11; Car+09]
logs from ANL-Theta between 2019 and 2022 (Figure 9.6b)1 , we can have an
estimation of the overhead if we decided to rerun these Darshan logs (on NFS and
with the job model considered in Chapter 2) with different folding factors. For
example, a folding factor of 10 would lead to an overhead of 64 hours over 4 years
(i.e., an increase of 0.2%), while requiring 10 times fewer machines.
In the case of OrangeFS, there is an extra dimension to explore: the number of I/O
nodes in the file-system.
Figures 9.7 and 9.8 show respectively the evolution of the performance in reading
and writing time of the IOR benchmark for different number of CPU nodes in
the cluster and I/O nodes in the file-system, as well as different sizes of file to
read/write.
We can see on Figure 9.7 that contrary to NFS, there is a significant loss of per-
formance for the write operations when increasing the folding factor. This loss
of performance appears more significant when there are more I/O nodes in the
file-system.
Concerning the reading performances (Figure 9.8), we observe the same behavior
as for NFS. High folding factors lead to an increase of reading time. The increase
appears greater when there are more I/O nodes in the file-system. The start of this
increase seems to depend on the number of CPU nodes. The more CPU nodes, the
later the increase starts. It is interesting to note that the variation when measuring
the reading time during the experiments is smaller (thus more stable) than for
NFS.
1
This data was generated from resources of the Argonne Leadership Computing Facility, which is a
DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.
20
1 I/O nodes
10
20
2 I/O nodes
Time (s)
10
20
4 I/O nodes
10
0
2 4 6 8 4 8 12 16 0 5 10 15 20 25 0 10 20 30
Number of MPI processes per machine
Figure 9.7.: Evolution of the writing times with OrangeFS based on the folding factor (ff old )
for experiments with different number of CPU and I/O nodes and different
sizes of file to write.
Evolution of the reading times based on the folding for expes with different numbers of IOR MPI processes
8 CPU nodes 16 CPU nodes 24 CPU nodes 32 CPU nodes
15
1 I/O nodes
10
15
2 I/O nodes
Time (s)
10
15
4 I/O nodes
10
0
2 4 6 8 4 8 12 16 0 5 10 15 20 25 0 10 20 30
Number of MPI processes per machine
Figure 9.8.: Evolution of the reading times with OrangeFS based on the folding factor (ff old )
for experiments with different number of CPU and IO nodes and different sizes
of file to read.
20
15
Time (s)
10
0
0 10 20 30 0 10 20 30
Number of MPI processes per machine
Figure 9.9.: Model of the breaking point in behavior of performance in reading (left) and
writing (right) for 32 CPU nodes (nbcpu ) and 4 I/O nodes (nbio ). The model
(dashed line) comes from Equation 9.4.
For both reading and writing performances there seem to be two behaviors. First
a phase where the folding does not affect the performances, and then, from a
given folding factor, the reading/writing times grow linearly. As we want to know
the maximum folding factor until which the folded system behaves like a full scale
system, we are now interested in finding this breakpoint where the behavior changes.
Using segmented regression techniques [Mug03], we found a model that we simplified
to make it a rule of thumb:
where, fbreak,r and fbreak,w are respectively the folding factor of the breakpoint for
the reading and writing performances, nbcpu and nbio are the number of CPU and
I/O nodes in the system.
As fbreak,w (Equation 9.4) will always be greater than fbreak,r , the overall breaking
point in performance for OrangeFS is fbreak,r .
9.5 Conclusion
In this Chapter, we investigated the use of folding techniques to reduce the number
of deployed machines for the large scale evaluation of applications such as grid and
cluster middlewares like CiGri. We have seen that folding requires a change in the
job model We focused on the impact on the performances of the file-system. To do
so, we took a distributed I/O benchmark, IOR, and ran it for several cluster sizes,
I/O loads and distributed file-systems. We analyzed the results of the benchmark
and reached to the following conclusions:
• Write operations on NFS are not subject to folding (Take away 9.1).
• There are breaking points in reading and writing performance for OrangeFS
when folding the cluster. Equation 9.4 gives rules of thumb to estimate these
breakpoints (Take away 9.4).
This study presents some limitations. The studied file-systems are not among the
most popular in large HPC centers. It would be interesting to consider file-systems
such as Lustre [Lus], BeeGFS [Bee], GlusterFS [BBP12], and Ceph [Wei+06]. During
his Master internship, Alexandre Lithaud helped setting up distributed environments
in NixOS Compose containing such PFS, which will allow to extend this study [Lit23].
The reasons of this loss of performance due to folding are still unclear. The main
suspect is the network. We did observe a speed similar (23 Gbps) to the one
advertised on the NIC (25 Gbps). We think that the overhead can be due to the
2
https://www.labri.fr/perso/fzanonboito/
Distributed experiments are complex and often require several machines for several
hours or days. Such experiments are time and resource consuming, but are never-
theless mandatory to validate research work. Deploying and running long-lasting
experiments during the exploring phases of research is an obstacle to careful and
sane work and must be addressed.
Experiments on systems such a grid or cluster middlewares, such as CiGri, are also
victim of high experimental costs and could benefit from simulation techniques.
However, due to this additional layer, the simulators cited above are not directly
equipped to simulate such systems.
For our experiments with CiGri, having access to information about the state of
the machines, the state of the scheduler, etc. would be ideal to implement sensors
and feedback loops. However, simulators usually expose a strict interface and set
of events to the users, without the possibility for the users to obtain more internal
information inside the simulator core.
In this chapter, we present and evaluate BatCiGri, a simulator of the CiGri grid
middleware within Batsim. In Section 10.2 we present the desired properties of the
simulations, the design of the simulation of the middleware as well as its calibration
157
to better match reality. The evaluation of the simulation is performed in Section
10.3.
To test our version of CiGri with our controllers, we deploy a modified CiGri as well as
an instance of OAR and compute nodes. In order to perform faithful evaluations, it is
unreasonable to deploy on the entire Gricad meso-center, and replay long workloads.
We are thus interested in simulation techniques to reduce the experimental costs.
However, one potential limitation of using simulation techniques in our case, is the
inability to obtain the same signals, or for the signals to have a different dynamic or
properties.
Simulators from the state-of-the-art are not directly equipped for simulating
middlewares such as CiGri with the need for internal sensors on the system.
10.2 BatCiGri
For the CiGri simulations to be useful from the point-of-view of Control Theory,
they must have the following properties: (i) jobs must have realistic execution
times, (ii) best-effort jobs must be killed and release resources for the normal jobs,
(iii) the killing and releasing of the resources must happen in a realistic time, and
(iv) information about the usage of the platform and about the inner state of the
scheduler must be accessible.
10.2.2 Hypotheses
We work under the following hypotheses: (i) there is only one cluster in the grid,
and (ii) the only best-effort jobs come from CiGri. Regular users of the cluster cannot
submit best-effort jobs.
Batsim is a batch scheduler simulator which allows users to test their scheduling
algorithms, i.e., how the jobs are mapped to the resources. Batsim relies on Sim-
grid [Cas+14] for sound simulation models. The remaining of this section presents
two important concepts of Batsim: platforms and workloads.
Workloads Workloads contain information about the jobs that will participate in
the simulation. There are two main components: jobs and profiles. Profiles
define the behavior of the jobs, i.e., the underlying simulation to use (delay, parallel
tasks, SMPI, etc.), execution times, SMPI trace to replay, etc. In a Batsim workload,
a job refers to a profile. Each job must have an identifier, a submission time
and a requested number of resources. Listing 10.1 shows a simple example of
Batsim workload.
The study of the CiGri jobs running on the Gricad platform presented in Chapter 2
gives a statistical description of the execution times of those jobs. This allows us to
use a delay model to represent the execution times.
The computing grid, in the context of CiGri, requires two levels of scheduling. The
first level is from CiGri to OAR for best-effort jobs, and then from OAR to the nodes
for normal users. Our simulation needs to capture these two levels.
To do so we will have two Batsim schedulers: one for the CiGri jobs and one for the
priority jobs. Each scheduler will manage their own workload but will schedule on
the same platform.
As best-effort jobs need to have less priority on the normal jobs, we need a way to
kill them. The CiGri scheduler will thus only see the free resources of the cluster to
perform its schedule of best-effort jobs. On the other hand, the priority scheduler
does not see the resources taken by CiGri jobs as occupied, and can decide to
"res": 1,
EXECUTE_JOB
7 "subtime": 0
}, REMOVE_RESOURCES
9 {
"id": 2, KILL_JOB
11 "profile": "cigri", KILL_JOB
"res": 1,
13 "subtime": 0 JOB_KILLED
JOB_KILLED
},
15 {
RESOURCES_ADDED
"id": 3,
EXECUTE_JOB
17 "profile": "cigri",
"res": 1,
19 "subtime": 0 JOB_COMPLETED
} ADD_RESOURCES
21 ],
RESOURCES_ADDED
"nb_res": 32,
23 "profiles": { JOB_COMPLETED
"cigri": {
25 "delay": 235.0,
"type": "delay"
27 } Figure 10.1.: Sequence Diagram represent-
}
ing the killing of best-effort
29 }
jobs when a new priority job
Listing 10.1: Example of Batsim workload is submitted, as well as when
with 3 jobs belonging to the a priority job finishes making
cigri profile. Each job requests its resources idle and thus ex-
one resource and are submitted ploitable by CiGri.
at the start of the simulation.
To be as close to reality, we used the same scheduling algorithms as the real system:
conservative backfilling for the priority jobs, and First-Come-First-Served (FCFS)
for the CiGri jobs. The CiGri scheduler is implemented with the Python interface to
Batsim [bat19].
10.2.5 Broker
Batsim can only communicate with a single scheduler. However, as seen in the
previous section, we have two different schedulers. To deal with this limitation,
we used the work done in [Mer19] which implements a message broker between
Batsim and the schedulers [Mer18; Gui23a].
The two schedulers connect to the broker and the broker connects to Batsim. It
filters and redirect the message between the different actors. The main of the work
is to manage adaptation of the available resources for the CiGri scheduler. When a
priority job is submitted, Batsim sends a JOB_SUBMITTED message to the broker. The
broker will then forward this message to the priority job scheduler. If the allocation
of resources returned by the scheduler contains best-effort jobs, the broker will
inform the CiGri scheduler by sending a REMOVE_RESOURCES message.
In this case, the CiGri scheduler must take care of the killing of the concerned jobs
and their resubmission in its queue. When a priority job terminates, its resources
become free and thus available to the CiGri scheduler. Then, the broker will send
a ADD_RESOURCES message to CiGri to indicate the availability of new resources.
Figure 10.1 depicts the sequence diagram of a killing of a best-effort job due to a
submission of a normal job.
To represent CiGri jobs and regular jobs, we have two different schedulers.
As Batsim can only communicate with a single scheduler, we use a broker to
redirect the messages between Batsim and the correct scheduler.
In our case, the sensor is the number of best-effort resources in waiting queue and
the number of resources used on the platform. The length of the waiting queue
is internal information for the scheduler, whereas the number of resources used is
computed indirectly. Remember that the CiGri scheduler only sees the resources that
are not used by the priority scheduler. Thus, the number of resources currently used
on the cluster is the total number of resources minus the number of resources visible
by CiGri and plus the number of resources used by CiGri jobs.
The remaining of the CiGri cycle is relatively straightforward and is shown in Listing
10.2. All the CiGri jobs are available at the start of the simulation. This means that
in the Batsim workload, they are submitted at time 0.
The CALL_ME_LATER event of Batsim allows use to implement a loop. But the
available sensors are limited to information about the platform.
The synchronization between the real experiments and the simulation is complex,
and thus the simulation workload needs to be adjusted to match the real workload.
Starting Delay of OAR Performed experiments showed that OAR needs about 1
minutes and 30 seconds to start the first jobs after the first submission. This delay
should be taken into account in the simulation. From the point of view of the
CiGri scheduler, this delay can be approximated by not starting the jobs submitting
from the first 3 CiGri cycles.
Listing 10.2: Implementation of the CiGri submission loop in Batsim. It is triggered by the
CALL_ME_LATER event. At the end of each loop, we ask Batsim to notify us for
the next loop (line 18).
Killing of Best-Effort JobsIn Batsim, when priority jobs are submitted, and they
can be scheduled by killing best-effort jobs, the best-effort jobs are immediately
200 0.75
Proportion
Count
0.50
100
0.25
Data
Model
0 0.00
0 2 4 6 0 2 4 6
Job Overhead [s] Job Overhead [s]
(a) Histogram of the distribution of job over- (b) Comparison between the empirical cumu-
head due to the commission and decom- lative distribution function (CDF) of the
mission of resources by OAR. Most of the overhead (solid) and the CDF of the Log-
overhead is around 2 and 3 seconds. normal model identified (dashed).
Figure 10.2.: Distribution of the job overheads due to OAR commissioning and decommis-
sioning the nodes of the cluster. Figure 10.2b shows the comparison between
the data and the identified model.
stopped, and the priority jobs started instantaneously. In practice, the priority jobs
spend some time in the waiting queue while the best-effort jobs are being killed and
the nodes cleaned and set up. This delay can be taken into account in the description
of the priority jobs. The execution time in Batsim must also contain this delay.
OAR introduces variations in the execution times of the jobs due to several
factors. To improve the realism of the simulation, we model this delay and
inject it during the generation of the CiGri workload.
10.3 Evaluation
Experimental Protocol For both the real system and the simulated one we will
conduct the same scenario. There are 500 CiGri jobs with an execution time of 235
seconds. The submission loop of CiGri is called every 30 seconds in order to see how
the system respond to delay in the control input. After 2000 seconds, a priority job
is submitted and takes half of the resources of the cluster for 1800 seconds. The
controller of CiGri aims to regulate the quantity wk + rk around the value 64 (which
is the double of the number of resources in the cluster).
We deploy 3 nodes: one for the OAR server, one for CiGri, and one for the OAR cluster.
We do not deploy 32 nodes for the cluster, but instead deploy a single node and
define 32 OAR resources.
Execution time One of the motivation of this study if the cost in time in resources of
experiments. Real experiments require deploying 3 resources (around 10 minutes),
and then to execute the scenario (around 1h20 minutes). In total, a single execution
of the scenario consumes around 9 CPU hours.
Signals Comparison For the simulation of CiGri to be useful, we need the signals
of interest to have the same properties and behave the same in both simulation and
real experiments. The signals of interest are:
• the dynamic of a CiGri submission (i.e., the time it takes to see the impact of a
submission)
Figure 10.3 shows the comparison of the signals of interest between experiments
of the same scenario executed in simulation (red) and deployed on real machines
(blue). The signals appear to be in sync. The amplitude does differ, as can be
observed around 500 seconds. The real system is obviously more sensitive to noise.
This noise can be noticed when looking at the used resources (top left graph on
Figure 10.3). The cluster in the simulation is always full, whereas the cluster during
real experiments is not (e.g., at 1000, 2000, 4500 seconds).
30
40
20
20
10
0 0
75
40
Ref
50
20
25
0 0
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
Time [s]
Batsim Real
Figure 10.3.: Comparison of the signals of interest for the same experiment executed in
simulation with Batsim (red) and deploy (blue). Signals appear to be in sync,
but some amplitudes might differ.
Gantt Charts Comparison Figure 10.4 compares the resulting Gantt charts of the
experiment for the simulation (top) and real execution (bottom). We notice that
there are “gaps” in the real schedule (e.g., at time 1500 seconds on resource 24).
These gaps create a lag in the schedule which also impact the signals.
This lag comes from OAR scheduling algorithm. Once the OAR decided to start to
compute a scheduler, if any job arrives during the execution of the scheduler, those
jobs will not be taken into account until the next schedule call. Taking into account
this lag in the simulation is complex, as Batsim is responsible for the management
of the simulation time, and because the time “stops” during the computation of the
schedule.
10.4 Conclusion
Distributed experiments are complex and costly. Simulation techniques can help
reduce the cost of such experiments. However, simulators rely on models that can
lose information compared to the real system. In this chapter, we implemented
the essential behavior of CiGri in Batsim. Real experiments with CiGri requires 3
compute nodes for several hours, while simulation last a few seconds on a laptop.
68 43 91 112 147 178 217 246 272 289 231#1 260#1 322 339 357 373 402 437 469
66 41 81 111 154 186 214 240 268 285 300 254#1 318 334 349 366 394 429 464 495
30 64 39 90 131 164 197 230 262 278 294 238#1 310 325 341 356 372 404 436 468 500
62 65 98 114 148 179 204 241 265 282 298 248#1 312 328 344 361 376 423 455 488
60 37 100 128 159 190 223 252 273 288 303 253#1 317 331 347 362 378 424 457 489
58 63 88 121 143 174 208 245 271 287 304 259#1 321 337 354 369 400 435 467 499
56 61 97 129 163 196 226 255 274 291 232#1 307 320 336 352 368 392 432 460 491
54 12 70 102 134 169 201 237 264 280 296 244#1 313 330 346 363 384 428 459 493
52 69 92 132 162 189 222 258 276 290 305 306 323 340 355 371 396 433 465 496
50 35 89 126 161 192 225 257 275 292 233#1 308 324 338 353 370 398 434 466 498
48 33 80 110 145 187 215 243 269 281 301 251#1 316 333 350 365 386 426 456 487
46 59 85 119 157 180 210 236 266 283 299 250#1 314 329 348 364 388 431 462 492
20 44 57 96 124 166 198 229 261 277 293 234#1 309 326 342 358 374 406 438 470
42 55 77 115 155 177 209 247 270 286 302 256#1 319 335 351 367 390 430 463 497
40 31 79 109 141 172 206 242 267 284 297 249#1 315 332 345 359 382 425 458 494
38 53 84 118 150 173 203 235 263 279 295 239#1 311 327 343 360 380 427 461 490
36 8 73 106 151 182 211 249 405 407 445 486
34 51 83 130 160 191 220 256 403 415 452 481
32 67 101 133 165 194 221 251 401 414 439 471
30 29 78 116 156 193 227 259 399 413 451 480
28 49 99 127 158 195 228 260 397 421 454 485
26 27 87 120 142 181 218 248 395 412 442 473
10 24 47 82 117 149 188 216 244 393 411 450 479
22 45 93 122 152 184 224 253 391 422 443 475
1
20 25 86 125 146 176 207 234 389 410 441 476
17 23 75 107 139 170 212 254 387 409 449 483
15 21 74 103 136 167 199 231 385 420 448 478
13 6 72 105 138 175 205 233 383 419 447 484
9 19 76 108 140 171 202 238 381 418 446 477
7 16 95 113 144 183 219 250 379 417 453 482
5 4 71 104 137 168 200 232 377 416 444 474
Resources
Real
32 64 78 117 149 181 202 244 277 293 310 326 342 358 374 390 422 454 486 518
31 63 85 110 148 180 213 245 270 292 306 325 341 357 370 389 421 450 485 514
30 30 62 96 118 160 187 219 253 302 416 448 480 512
29 46 95 128 150 186 218 246 284 300 317 327 346 359 378 391 426 455 490 522
28 61 86 127 159 192 221 251 283 299 311 333 347 365 379 397 427 461 493 525
27 60 94 126 158 191 222 256 285 304 320 336 350 368 384 400 432 462 496 526
26 59 84 116 147 179 212 234 269 288 305 321 337 353 369 385 417 449 481 513
25 58 93 125 157 190 224 254 287 301 319 335 352 366 383 399 430 464 494 528
24 57 92 124 156 189 223 255 286 303 318 334 351 367 382 398 431 463 495 527
23 56 91 123 155 188 220 252 282 294 316 332 343 362 375 394 425 458 487 519
22 55 90 122 154 185 214 250 281 298 315 331 349 363 381 396 429 459 492 523
21 54 89 121 153 184 217 249 280 297 314 330 348 364 380 395 428 460 491 524
20 20 53 88 120 152 183 216 248 279 296 313 329 345 361 377 393 423 457 489 521
19 52 83 115 138 178 211 243 276 291 309 322 340 354 373 386 420 453 482 517
18 51 82 114 146 177 210 242 275 289 308 324 338 356 372 388 418 452 484 516
17 50 81 113 145 176 209 241 274 290 307 323 339 355 371 387 419 451 483 515
16 49 80 112 144 175 208 240 273 278 295 312 328 344 360 376 392 424 456 488 520
15 48 79 111 143 174 207 239 272 302 415 442 479 511
14 33 68 109 142 173 206 238 271 302 414 447 474 510
13 47 87 119 151 182 215 247 302 413 446 478 508
12 45 77 108 134 169 198 233 262 302 412 445 477 509
11 44 76 107 141 172 205 237 268 302 411 444 476 507
10 10 43 67 97 131 161 195 227 257 302 410 443 475 506
9 42 75 106 137 168 201 230 265 302 409 433 471 503 529
8 41 74 105 140 171 204 236 266 302 408 441 472 504
7 40 73 104 139 170 203 235 267 302 407 440 473 505
6 39 72 100 133 164 197 228 261 302 406 439 465 502
5 38 71 101 132 165 196 229 260 302 405 438 470 501
4 37 70 103 136 166 200 232 264 302 404 437 469 497 533
3 36 65 99 130 163 194 226 259 302 403 436 468 500 532
2 35 69 102 135 167 199 231 263 302 402 435 467 499 531
1 34 66 98 129 162 193 225 258 302 401 434 466 498 530
0
0 1000 2000 3000 4000 5000
Time [s]
CiGri Priority
Figure 10.4.: Comparison of the Gantt charts for the simulation (top) and real experiment
(bottom) of the same scenario. We observe a small lag, which is due to OAR,
but both schedules are similar.
169
Being able to simulate the behavior of classical PFS within Simgrid would lead to a
decrease of experimental costs. Integration with Batsim would be a plus. Moreover,
having a framework to test designs of PFS would enable system administrators to
find the best PFS configuration for their system.
Integration between NixOS Compose and EnOSlib As for now, NixOS Compose does
not provide its own engine to conduct experiments, and this is not our desire.
We developed an integration of NixOS Compose into the Execo engine [Imb+13].
Discussions have been started to support deployments with NixOS Compose in
EnOSlib. This integration would benefit NixOS Compose with the large collection
of tools from EnOSlib, and would also benefit EnOSlib with the integration of fully
reproducible environments.
Mixed deployments When deploying for CiGri experiments for example, we do not
deploy a 1-to-1 scale cluster, but instead, we define several computing resources
(from the point-of-view of OAR) on a single machine. This allows us to reduce the
1
Filesystem Hierarchy Standard
A study of the lifetime of the artifacts, meaning the duration during which the artifacts
rebuild from source the correct/intended one, of the top conferences implementing
an artifact reviewing process would most likely reveal the imperfection of the usual
tools, as seen in Chapter 7.
Reproducibility metric and partial reproducibility The current badge systems dur-
ing artifact reviews do not capture all the potential loss of reproducibility presented
in Section 7.1.2. A system indicating the pitfalls of the work, on several dimensions,
might be more insightful for the readers, e.g., “machine dependent”, “using a non-
free library”. Such criteria be evaluated by artifact reviewers with the help of a
“checklist”, without having to replay any experiments, only by inspecting the source
code. The filled checklist would then be attached to the paper after publication. The
creation of such a checklist that would cover most reproducibility cases would be
a great asset for reviewer to ease the reviewing process, readers to know what to
expect from the presented results, and authors to know what to improve.
The potential issue with such an approach would be the backlash from the community.
Today, authors gain badges for doing more in terms of reproducibility, and this is
a way for the publisher to gently invite authors to improve the quality of their
submission. However, as the artifact review is not always mandatory, and that failing
the reproducibility review does not stop the publication of the paper. A system as
presented above takes the problem from the other side and would explicitly point to
the shortfalls of the work, which could be seen as “public shaming” by authors.
The current artifact reviewing system is far from perfect, and must evolve towards
something more strict and precise. Considering that it took 10 years for the artifact
Similar, but at smaller scales, problematics arise from the use of functional package
managers. When replaying an experiment that was packaged with a FPM, the first
step is to set up the software environment. This step can be very long as it will need
to download a lot of packages and their dependencies. Moreover, if the work is
not recent, or is introducing variation deep into the software stack, the packages
required will not be present in the binary caches, and will thus need to be recompiled
from scratch. Storage is also an issue of FPMs as the stores have the tendency to
grow very large with every variation of packages.
We gave some detailed perspectives of this thesis Chapters 6 and 11. In this Chapter
we give an higher-level conclusion.
Regulation problems are frequent in Computer Science, but especially in fields where
there are runtime components or Service-Level Agreements. Scheduling or Machine
learning-based strategies are very popular but have limitations. Scheduling usually
make assumption on the system, information given by the users, etc. This infor-
mation is then processed by complex (cognitively and computationally) algorithms
that return a schedule of the jobs. It is possible to prove performance bounds of
the scheduling algorithm. Machine learning strategies are black boxes. The cost
of training the model is high, and the resulting model does not have any proven
guarantees other than the precision resulting of the training. Control Theory is
a powerful, yet underused, technique for the runtime management of computing
systems. It really excels at regulating trade-offs, degradation levels, SLAs, QoS, etc.
Moreover, using Control Theory tools on computing systems can foster interaction
between research fields, as it requires a knowledge and expertise very rarely seen
among computer scientists and system administrators. We tried to tackle this issue at
our modest scale by creating and presenting a tutorial to introduce Control Theory
to computer scientists [Gui+b].
173
best tools to manage reproducible software environments. We hope that NixOS Com-
pose will help researchers in distributed systems to transition from non-reproducible
tools. In general, there is the need to educate the communities (not only computer
scientists) to these reproducibility questions (e.g., [Gui+a; FPG; LHP; Leg+]), and
develop reproducibility not as a afterthought, but as a fundamental skill. The cost of
experiments in distributed systems, and especially when evaluation grid or cluster
middlewares, is significant. Having realistic experiments implies to perform large
scale deployment, which is expensive. Simulations are a good alternative, but are
limited by the fact that the real software is not executed. Moreover, the state of simu-
lation of parallel file-system is not mature enough to perform the CiGri experiments
in simulation. We believe that an interesting alternative is the use of intermediate
deployment techniques such as the one presented in Chapter 9.
Three concerns of scientific research are: the explainability and the energy efficiency
of the solutions, and the reproducibility of research. This thesis attempted to take
into account these concerns, and to raise awareness about them.
A1
Bibliography
[ÅH06] Karl Johan Åström and Tore Hägglund. Advanced PID Control. English. ISA - The
Instrumentation, Systems and Automation Society, 2006. cit. on p. 78
[And+02] David P. Anderson, Jeff Cobb, Eric Korpela, Matt Lebofsky, and Dan Werthimer.
“SETI@home: an experiment in public-resource computing”. en. In: Communica-
tions of the ACM 45.11 (Nov. 2002), pp. 56–61. DOI: 10.1145/581571.581573.
URL : http://portal.acm.org/citation.cfm?doid=581571.581573 (visited
on June 11, 2020). cit. on p. 11
[And04] D.P. Anderson. “BOINC: A System for Public-Resource Computing and Storage”.
en. In: Fifth IEEE/ACM International Workshop on Grid Computing. Pittsburgh,
PA, USA: IEEE, 2004, pp. 4–10. DOI: 10.1109/GRID.2004.14. URL: http://ie
eexplore.ieee.org/document/1382809/ (visited on June 11, 2020).
cit. on p. 11
[ÅW08] Karl Johan Åström and Björn Wittenmark. Adaptive Control. 2nd ed. Dover
Publications, 2008. cit. on pp. 67, 68
[ÅW13] Karl J Åström and Björn Wittenmark. Adaptive control. Courier Corporation,
2013. cit. on p. 78
[Bak16] Monya Baker. “1,500 scientists lift the lid on reproducibility”. In: Nature
533.7604 (May 2016), pp. 452–454. DOI: 10.1038/533452a. URL: http://www
.nature.com/doifinder/10.1038/533452a (visited on May 3, 2019).
cit. on p. 107
A3
[Bal+13] Daniel Balouek, Alexandra Carpen Amarie, Ghislain Charrier, et al. “Adding
Virtualization Capabilities to the Grid’5000 Testbed”. In: Cloud Computing and
Services Science. Ed. by Ivan I. Ivanov, Marten van Sinderen, Frank Leymann, and
Tony Shan. Vol. 367. Communications in Computer and Information Science.
Springer International Publishing, 2013, pp. 3–20. DOI: 10.1007/978-3-319-0
4519-1\_1. cit. on pp. 69, 87, 110, 119, 124, 144, 165
[bat19] [SW] batsim, 2019. URL: https : / / gitlab . inria . fr / batsim / pybatsim,
SWHID : ⟨swh:1:dir:116fa04d25d22b26f6dcf110f976b681ffa9f027;origin
=https://gitlab.inria.fr/batsim/pybatsim⟩. cit. on p. 161
[BBP12] Eric B Boyer, Matthew C Broomfield, and Terrell A Perrotti. Glusterfs one storage
server to rule them all. Tech. rep. Los Alamos National Lab.(LANL), Los Alamos,
NM (United States), 2012. cit. on p. 154
[Bel05] Fabrice Bellard. “QEMU, a fast and portable dynamic translator.” In: USENIX
annual technical conference, FREENIX Track. Vol. 41. 46. Califor-nia, USA. 2005,
pp. 10–5555. cit. on p. 119
[Bha+13] Abhinav Bhatele, Kathryn Mohror, Steven H Langer, and Katherine E Isaacs.
“There goes the neighborhood: performance degradation due to nearby jobs”.
In: Proceedings of the International Conference on High Performance Computing,
Networking, Storage and Analysis. 2013, pp. 1–12. cit. on p. 3
[BNW03] John Brevik, Daniel Nurmi, and Rich Wolski. Quantifying machine availability
in networked and desktop grid systems. Tech. rep. Technical Report CS2003-37,
Dept. of Computer Science and Engineering . . ., 2003. cit. on p. 35
[Boi+18] Francieli Zanon Boito, Eduardo C Inacio, Jean Luca Bez, et al. “A checkpoint of
research on parallel i/o for high-performance computing”. In: ACM Computing
Surveys (CSUR) 51.2 (2018), pp. 1–35. cit. on p. 144
[Bon+11] Michael Moore David Bonnie, Becky Ligon, Mike Marshall, et al. “OrangeFS:
Advancing PVFS”. In: USENIX Conference on File and Storage Technologies (FAST).
2011. cit. on p. 145
A4 Bibliography
[Bor+07] Julian Borrill, Leonid Oliker, John Shalf, and Hongzhang Shan. “Investigation of
leading HPC I/O performance using a scientific-application derived benchmark”.
In: Proceedings of the 2007 ACM/IEEE conference on Supercomputing. 2007,
pp. 1–12. cit. on pp. 52, 53
[Bra+11] Grant R Brammer, Ralph W Crosby, Suzanne J Matthews, and Tiffani L Williams.
“Paper mâché: Creating dynamic reproducible science”. In: Procedia Computer
Science 4 (2011), pp. 658–667. cit. on p. 109
[Bri+19] Adam Brinckman, Kyle Chard, Niall Gaffney, et al. “Computing environments
for reproducibility: Capturing the “Whole Tale””. In: Future Generation Computer
Systems 94 (2019), pp. 854–867. cit. on p. 109
[Bru23] Samuel Brun. “Étude du phénomène Stat-Storm - Limitation des appels systèmes
pour les systèmes de fichiers distribués de type store”. MA thesis. Université
Grenoble Alpes, Sept. 2023. URL: https://inria.hal.science/hal-0419772
4. cit. on p. 170
[Bur22a] [SW] Sander van der Burg, Disnix Apr. 2022. LIC: LGPL-2.1. URL: https://gi
thub.com/svanderburg/disnix, SWHID: ⟨swh:1:dir:c4f91c498853331fc9a
1c4607d0e6d7c01aa7d6c;origin=https://github.com/svanderburg/disni
x⟩. cit. on p. 119
[Bur22b] [SW] Sander van der Burg, DisnixOS 2022. LIC: LGPL-2.1. URL: https://gith
ub.com/svanderburg/disnixos, SWHID: ⟨swh:1:dir:13bcd4e967b9abf5306
6ebd558b30d8d82920f77;origin=https://github.com/svanderburg/disn
ixos⟩. cit. on p. 119
[Cap+05] N. Capit, G. Da Costa, Y. Georgiou, et al. “A batch scheduler with high level
components”. en. In: CCGrid 2005. IEEE International Symposium on Cluster
Computing and the Grid, 2005. Cardiff, Wales, UK: IEEE, 2005, 776–783 Vol. 2.
DOI : 10.1109/CCGRID.2005.1558641. URL : http://ieeexplore.ieee.org/d
ocument/1558641/ (visited on May 25, 2020).
cit. on pp. 13, 23, 70, 125, 129, 142
[Car+09] Philip Carns, Robert Latham, Robert Ross, et al. “24/7 characterization of
petascale I/O workloads”. In: 2009 IEEE International Conference on Cluster
Computing and Workshops. IEEE. 2009, pp. 1–10. cit. on pp. 25, 150, 151
[Car+11] Philip Carns, Kevin Harms, William Allcock, et al. “Understanding and improving
computational science storage access through continuous characterization”. In:
ACM Transactions on Storage (TOS) 7.3 (2011), pp. 1–26.
cit. on pp. 25, 54, 99, 150, 151
[Cas+00] Henri Casanova, Arnaud Legrand, Dmitrii Zagorodnov, and Francine Berman.
“Heuristics for scheduling parameter sweep applications in grid environments”.
In: Proceedings 9th Heterogeneous Computing Workshop (HCW 2000)(Cat. No.
PR00556). IEEE. 2000, pp. 349–363. cit. on p. 26
Bibliography A5
[Cas+14] Henri Casanova, Arnaud Giersch, Arnaud Legrand, Martin Quinson, and Frédéric
Suter. “Versatile, Scalable, and Accurate Simulation of Distributed Applications
and Platforms”. In: Journal of Parallel and Distributed Computing 74.10 (June
2014), pp. 2899–2917. URL: http://hal.inria.fr/hal-01017319.
cit. on pp. 105, 115, 141, 159
[CB01] Walfredo Cirne and Francine Berman. “A comprehensive model of the super-
computer workload”. In: Proceedings of the Fourth Annual IEEE International
Workshop on Workload Characterization. WWC-4 (Cat. No. 01EX538). IEEE.
2001, pp. 140–148. cit. on pp. 3, 12, 95
[CDR14] Julio Cano, Gwenaël Delaval, and Eric Rutten. “Coordination of ECA rules by
verification and control”. In: Coordination Models and Languages: 16th IFIP
WG 6.1 International Conference, COORDINATION 2014, Held as Part of the
9th International Federated Conferences on Distributed Computing Techniques,
DisCoTec 2014, Berlin, Germany, June 3-5, 2014, Proceedings 16. Springer. 2014,
pp. 33–48. cit. on p. 17
[Cer+21] Sophie Cerf, Raphaël Bleuse, Valentin Reis, Swann Perarnau, and Eric Rutten.
“Sustaining performance while reducing energy consumption: a control theory
approach”. In: Euro-Par 2021: Parallel Processing: 27th International Conference
on Parallel and Distributed Computing, Lisbon, Portugal, September 1–3, 2021,
Proceedings 27. Springer. 2021, pp. 334–349. cit. on pp. 17, 18, 101
[Cer19] Sophie Cerf. “Control Theory for Computing Systems: Application to big-data
cloud services & location privacy protection”. Theses. UNIVERSITÉ GRENOBLE
ALPES, May 2019. URL: https://hal.science/tel-02272258. cit. on p. 101
[Chr22] [SW] Juliusz Chroboczek, Polipo — a caching web proxy 2022. URL: https://ww
w.irif.fr/~jch/software/polipo/, SWHID: ⟨swh:1:dir:ce183542e1abba5a
a847fed92a700d23cf591f9f;origin=https://github.com/jech/polipo;⟩.
cit. on p. 134
[Cir+06] Walfredo Cirne, Francisco Brasileiro, Nazareno Andrade, et al. “Labs of the
world, unite!!!” In: Journal of Grid Computing 4 (2006), pp. 225–246.
cit. on p. 13
[CL12] Nicolo Cesa-Bianchi and Gábor Lugosi. “Combinatorial bandits”. In: Journal of
Computer and System Sciences 78.5 (2012), pp. 1404–1422. cit. on p. 97
A6 Bibliography
[CL22] Tom Cornebize and Arnaud Legrand. “Simulation-based optimization and sensi-
bility analysis of MPI applications: Variability matters”. In: Journal of Parallel
and Distributed Computing 166 (2022), pp. 111–125. cit. on p. 38
[Cou] Ludovic Courtès. Taming the ‘stat’ storm with a loader cache. URL: https://gui
x.gnu.org/blog/2021/taming-the-stat-storm-with-a-loader-cache/.
cit. on p. 170
[Cou13] Ludovic Courtès. “Functional Package Management with Guix”. en. In: arXiv:1305.4584
[cs] (May 2013). URL: http://arxiv.org/abs/1305.4584 (visited on June 13,
2020). cit. on p. 113
[CPW15] Christian Collberg, Todd Proebsting, and Alex M Warren. “Repeatability and
Benefaction in Computer Systems Research - A Study and a Modest Proposal”.
en. In: (2015), p. 68. cit. on p. 107
[CR23] Sophie Cerf and Eric Rutten. “Combining neural networks and control: poten-
tialities, patterns and perspectives”. In: 22nd World Congress of the International
Federation of Automatic Control. 2023. cit. on p. 17
[Dar57] Donald A Darling. “The kolmogorov-smirnov, cramer-von mises tests”. In: The
Annals of Mathematical Statistics 28.4 (1957), pp. 823–838. cit. on p. 35
[DAS] DAS. DAS2. URL: https : / / www . cs . vu . nl / das5 / home . shtml (visited on
Apr. 25, 2023). cit. on p. 29
[DB21] R.C. Dorf and R.H. Bishop. Modern Control Systems. Pearson, 14th edition, 2021.
cit. on p. 78
[Dép+18] Aline Déprez, Anne Socquet, Nathalie Cotte, and Andrea Walpersdorf. “Toward
the generation of EPOS-GNSS products”. In: 19th General Assembly of WEGENER:
on Earth deformation & the study of earthquakes using geodesy and geodynamics.
2018. cit. on p. 23
[DJV04] Eelco Dolstra, Merijn de Jonge, and Eelco Visser. “Nix: A Safe and Policy-Free
System for Software Deployment”. en. In: (2004), p. 14. cit. on pp. 113, 119
[DL08] Eelco Dolstra and Andres Löh. “NixOS: A Purely Functional Linux Distribution”.
In: SIGPLAN Not. 43.9 (Sept. 2008), pp. 367–378. DOI: 10.1145/1411203.141
1255. URL: https://doi.org/10.1145/1411203.1411255. cit. on p. 119
Bibliography A7
[Doc22] [SW] Docker, Docker Compose 2022. URL: https://docs.docker.com/compos
e/, SWHID: ⟨swh:1:dir:09c1c47b2ec22e82d5218e3c1d7193b397a3224d;ori
gin=https://github.com/docker/compose;⟩. cit. on p. 124
[Dor+14] Matthieu Dorier, Gabriel Antoniu, Rob Ross, Dries Kimpe, and Shadi Ibrahim.
“CALCioM: Mitigating I/O interference in HPC systems through cross-application
coordination”. In: 2014 IEEE 28th international parallel and distributed processing
symposium. IEEE. 2014, pp. 155–164. cit. on p. 3
[Dup+19] Dmitry Duplyakin, Robert Ricci, Aleksander Maricq, et al. “The Design and Op-
eration of CloudLab”. In: Proceedings of the USENIX Annual Technical Conference
(ATC). July 2019, pp. 1–14. URL: https://www.flux.utah.edu/paper/duply
akin-atc19. cit. on p. 139
[Dut+16] Pierre-François Dutot, Michael Mercier, Millian Poquet, and Olivier Richard.
“Batsim: a Realistic Language-Independent Resources and Jobs Management
Systems Simulator”. In: 20th Workshop on Job Scheduling Strategies for Parallel
Processing. Chicago, United States, May 2016. URL: https://hal.archives-ou
vertes.fr/hal-01333471. cit. on pp. 115, 141, 157, 158
[Eme+14] Joseph Emeras, Cristian Ruiz, Jean-Marc Vincent, and Olivier Richard. “Analysis
of the jobs resource utilization on a production system”. In: Job Scheduling
Strategies for Parallel Processing: 17th International Workshop, JSSPP 2013,
Boston, MA, USA, May 24, 2013 Revised Selected Papers 17. Springer. 2014,
pp. 1–21. cit. on p. 99
[FJ13] Michel Fliess and Cédric Join. “Model-free control”. In: International Journal of
Control 86.12 (2013), pp. 2228–2252. cit. on pp. 68, 69, 76
[For+21] Alessandra Forti, Ivan Glushkov, Lukas Heinrich, et al. “The fight against COVID-
19: Running Folding@ Home simulations on ATLAS resources”. In: EPJ Web of
Conferences. Vol. 251. EDP Sciences. 2021, p. 02003. cit. on p. 11
A8 Bibliography
[FPG] Adrien Faure, Millian Poquet, and Quentin Guilloteau. Nix tutorial. URL: https:
//nix-tutorial.gitlabpages.inria.fr/nix-tutorial/index.html.
cit. on p. A1
[Fri+13] Wolfgang Frings, Dong H Ahn, Matthew LeGendre, et al. “Massively parallel
loading”. In: Proceedings of the 27th international ACM conference on Interna-
tional conference on supercomputing. 2013, pp. 389–398. cit. on p. 170
[FZ87] Domenico Ferrari and Songnian Zhou. An empirical investigation of load indices
for load balancing applications. Computer Science Division, University of Califor-
nia, 1987. cit. on pp. 39, 46
[Gam+15] Todd Gamblin, Matthew LeGendre, Michael R. Collette, et al. “The Spack pack-
age manager: bringing order to HPC software chaos”. en. In: Proceedings of the
International Conference for High Performance Computing, Networking, Storage
and Analysis. Austin Texas: ACM, Nov. 2015, pp. 1–12. DOI: 10.1145/28075
91.2807623. URL: https://dl.acm.org/doi/10.1145/2807591.2807623
(visited on Nov. 22, 2021). cit. on p. 110
[Geo+06] Yiannis Georgiou, Julien Leduc, Brice Videau, Johann Peyrard, and Olivier
Richard. “A tool for environment deployment in clusters and light grids”. In:
Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
IEEE. 2006, 8–pp. cit. on p. 124
[GKN18] Cristian Galleguillos, Zeynep Kiziltan, and Alessio Netti. “Accasim: an HPC
simulator for workload management”. In: High Performance Computing: 4th
Latin American Conference, CARLA 2017, Buenos Aires, Argentina, and Colonia del
Sacramento, Uruguay, September 20-22, 2017, Revised Selected Papers 4. Springer.
2018, pp. 169–184. cit. on p. 157
[GRC07] Yiannis Georgiou, Olivier Richard, and Nicolas Capit. “Evaluations of the
lightweight grid cigri upon the grid5000 platform”. In: Third IEEE Interna-
tional Conference on e-Science and Grid Computing (e-Science 2007). IEEE. 2007,
pp. 279–286. cit. on pp. 14, 23
[GRR22a] Quentin Guilloteau, Olivier Richard, and Éric Rutten. “Étude des applications
Bag-of-Tasks du méso-centre Gricad”. In: COMPAS 2022-Conférence d’informatique
en Parallélisme, Architecture et Système. 2022. cit. on pp. 6, 29
[GRR22b] Quentin Guilloteau, Olivier Richard, and Eric Rutten. Étude des applications
Bag-of-Tasks du méso-centre Gricad. Zenodo, July 2022. DOI: 10.5281/zenodo.8
410346. URL: https://doi.org/10.5281/zenodo.8410346. cit. on p. 29
Bibliography A9
[Gui+a] Quentin Guilloteau, Jonathan Bleuzen, Millian Poquet, and Olivier Richard.
Initiation to NixOS Compose. URL: https://nixos-compose.gitlabpages.inr
ia.fr/tuto-nxc/. cit. on pp. 7, 139, A1
[Gui+b] Quentin Guilloteau, Sophie Cerf, Eric Rutten, Raphaël Bleuse, and Bogdan Robu.
Introduction to Control Theory for Computer Scientists. URL: https://control-
for-computing.gitlabpages.inria.fr/tutorial/intro.html.
cit. on pp. 7, 39, 173
[Gui+21a] Quentin Guilloteau, Olivier Richard, Bogdan Robu, and Eric Rutten. “Controlling
the Injection of Best-Effort Tasks to Harvest Idle Computing Grid Resources”.
In: ICSTCC 2021 - 25th International Conference on System Theory, Control and
Computing. Ias, i, Romania, Oct. 2021, pp. 1–6. DOI: 10.1109/ICSTCC52150.202
1.9607292. URL: https://hal.inria.fr/hal-03363709.
cit. on pp. 6, 42, 56, 58, 60
[Gui+21b] Quentin Guilloteau, Olivier Richard, Eric Rutten, and Bogdan Robu. “Collecte
de ressources libres dans une grille en préservant le système de fichiers : une
approche autonomique”. In: COMPAS 2021 - Conférence d’informatique en Parallé
lisme, Architecture et Système. Lyon, France, July 2021, pp. 1–11. URL: https:
//hal.inria.fr/hal-03282727. cit. on p. 6
[Gui+22a] [SW] Quentin Guilloteau, Jonathan Bleuzen, Millian Poquet, and Olivier Richard,
version 1.0, 2022. URL: https://gitlab.inria.fr/nixos-compose/nixos-
compose, SWHID: ⟨swh:1:dir:6308ca57ea23fbdcd4ea84006149583a2db8f88
1;origin=https://gitlab.inria.fr/nixos-compose/nixos-compose⟩.
cit. on p. 138
[Gui+22b] Quentin Guilloteau, Jonathan Bleuzen, Millian Poquet, and Olivier Richard.
“Painless Transposition of Reproducible Distributed Environments with NixOS
Compose”. In: CLUSTER 2022 - IEEE International Conference on Cluster Comput-
ing. Vol. CLUSTER 2022 - IEEE International Conference on Cluster Computing.
Heidelberg, Germany, Sept. 2022, pp. 1–12. URL: https://hal.science/hal-
03723771. cit. on pp. 6, 87, 114, 119, 144, 165
[Gui+22c] Quentin Guilloteau, Jonathan Bleuzen, Millian Poquet, and Olivier Richard.
“Transposition d’environnements distribués reproductibles avec NixOS Com-
pose”. In: COMPAS 2022 (July 2022), pp. 1–9. URL: https://hal.science/ha
l-03696485. cit. on p. 6
[Gui+22d] Quentin Guilloteau, Bogdan Robu, Cédric Join, et al. “Model-free control for
resource harvesting in computing grids”. In: Conference on Control Technology
and Applications, CCTA 2022. Trieste, Italy: IEEE, Aug. 2022. URL: https://hal
.archives-ouvertes.fr/hal-03663273. cit. on pp. 6, 66, 69, 70
[Gui+23a] Quentin Guilloteau, Adrien Faure, Millian Poquet, and Olivier Richard. “Com-
ment rater la reproductibilité de ses expériences ?” In: Conférence francophone
en informatique (ComPAS 2023). Annecy, France, July 2023, à paraître. URL:
https://hal.science/hal-04132438. cit. on pp. 6, 106
A10 Bibliography
[Gui+23b] [SW] Quentin Guilloteau, Olivier Richard, Raphaël Bleuse, and Eric Rutten,
2023. URL: https://gitlab.inria.fr/nixos-compose/hpc-io/articles/f
olding, SWHID: ⟨swh:1:dir:7b37ae5308065c18081acae7fac97d5028492948
;origin=https://gitlab.inria.fr/nixos-compose/hpc-io/articles/fo
lding⟩. cit. on p. 144
[Gui+23c] Quentin Guilloteau, Olivier Richard, Raphaël Bleuse, and Eric Rutten. Data for
the paper: "Folding a Cluster containing a Distributed File-System". Zenodo, Oct.
2023. DOI: 10.5281/zenodo.10005463. URL: https://doi.org/10.5281/zen
odo.10005463. cit. on p. 144
[Gui+23d] Quentin Guilloteau, Olivier Richard, Raphaël Bleuse, and Eric Rutten. “Folding a
Cluster containing a Distributed File-System”. working paper or preprint. 2023.
URL : https://hal.science/hal-04038000. cit. on pp. 6, 70, 141
[Gui23c] Quentin Guilloteau. Data for the paper: "Simulating a Multi-Layered Grid Middle-
ware". Zenodo, Oct. 2023. DOI: 10.5281/zenodo.10005440. URL: https://doi
.org/10.5281/zenodo.10005440. cit. on p. 165
[Han09] Jingqing Han. “From PID to Active Disturbance Rejection Control”. In: IEEE
Trans. Ind. Electron. 56.3 (Mar. 2009), pp. 900–906. DOI: 10.1109/TIE.2008.2
011621. cit. on p. 68
[Hei+17] Franz Christian Heinrich, Tom Cornebize, Augustin Degomme, et al. “Predicting
the energy-consumption of mpi applications at scale using only a single node”.
In: 2017 IEEE international conference on cluster computing (CLUSTER). IEEE.
2017, pp. 92–102. cit. on pp. 2, 92
[Hel+04] Joseph L Hellerstein, Yixin Diao, Sujay Parekh, and Dawn M Tilbury. Feedback
control of computing systems. John Wiley & Sons, 2004. cit. on pp. 46, 49
Bibliography A11
[Her] Software Heritage. Software Heritage. Accessed: 2023-03-30. URL: https://ww
w.softwareheritage.org/. cit. on p. 109
[Hin+11] Benjamin Hindman, Andy Konwinski, Matei Zaharia, et al. “Mesos: A platform
for {Fine-Grained} resource sharing in the data center”. In: 8th USENIX Sympo-
sium on Networked Systems Design and Implementation (NSDI 11). 2011.
cit. on p. 13
[IE10] Alexandru Iosup and Dick Epema. “Grid computing workloads”. In: IEEE Internet
Computing 15.2 (2010), pp. 19–26. cit. on p. 36
[Ios+08] Alexandru Iosup, Hui Li, Mathieu Jan, et al. “The grid workloads archive”. In:
Future Generation Computer Systems 24.7 (2008), pp. 672–686. cit. on p. 36
[IS96] Petros A Ioannou and Jing Sun. Robust adaptive control. Vol. 1. PTR Prentice-Hall
Upper Saddle River, NJ, 1996. cit. on p. 78
[IT18] Peter Ivie and Douglas Thain. “Reproducibility in scientific computing”. In: ACM
Computing Surveys (CSUR) 51.3 (2018), pp. 1–36. cit. on p. 108
[Jav+09] Bahman Javadi, Derrick Kondo, Jean-Marc Vincent, and David P Anderson.
“Mining for statistical models of availability in large-scale distributed systems:
An empirical study of seti@ home”. In: 2009 IEEE International Symposium on
Modeling, Analysis & Simulation of Computer and Telecommunication Systems.
IEEE. 2009, pp. 1–10. cit. on p. 35
A12 Bibliography
[JPV23] Emmanuel Jeannot, Guillaume Pallez, and Nicolas Vidal. “IO-aware Job-Scheduling:
Exploiting the Impacts of Workload Characterizations to select the Mapping
Strategy”. In: The International Journal of High Performance Computing Applica-
tions (2023), p. 10943420231175854. cit. on p. 25
[KC03] Jeffrey O Kephart and David M Chess. “The vision of autonomic computing”. In:
Computer 36.1 (2003), pp. 41–50. cit. on pp. 14, 15, 25
[Kea+20] Kate Keahey, Jason Anderson, Zhuo Zhen, et al. “Lessons Learned from the
Chameleon Testbed”. In: Proceedings of the 2020 USENIX Annual Technical
Conference (USENIX ATC ’20). USENIX Association, July 2020.
cit. on pp. 110, 139
[KSS20] Dalibor Klusáček, Mehmet Soysal, and Fréd éric Suter. “Alea–complex job
scheduling simulator”. In: Parallel Processing and Applied Mathematics: 13th
International Conference, PPAM 2019, Bialystok, Poland, September 8–11, 2019,
Revised Selected Papers, Part II 13. Springer. 2020, pp. 217–229. cit. on p. 157
[LAC06] Yun Li, Kiam Heong Ang, and G.C.Y. Chong. “Patents, software, and hardware
for PID control: an overview and analysis of the current art”. In: IEEE Control
Systems Magazine 26.1 (2006), pp. 42–54. DOI: 10.1109/MCS.2006.1580153.
cit. on p. 78
[Lan+11] Ioan Doré Landau, Rogelio Lozano, Mohammed M’Saad, and Alireza Karimi.
Adaptive Control. Algorithms, Analysis and Applications. 2nd ed. Communications
and Control Engineering. Springer, 2011. DOI: 10.1007/978-0-85729-664-1.
cit. on p. 78
[Lar+09] Stefan M Larson, Christopher D Snow, Michael Shirts, and Vijay S Pande.
“Folding@ Home and Genome@ Home: Using distributed computing to tackle
previously intractable problems in computational biology”. In: arXiv preprint
arXiv:0901.0866 (2009). cit. on p. 11
[Leb+15] Adrien Lebre, Arnaud Legrand, Frédéric Suter, and Pierre Veyre. “Adding storage
simulation capacities to the simgrid toolkit: Concepts, models, and api”. In: 2015
15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.
IEEE. 2015, pp. 251–260. cit. on p. 169
[Leg+] Arnaud Legrand, Konrad Hinsen, Christophe Pouzat, et al. Mooc Reproducible
research 2. URL: https://learninglab.gitlabpages.inria.fr/mooc-rr/mo
oc-rr2-ressources/. cit. on p. A1
Bibliography A13
[Len21] [SW] Dan Lenski, 2021. URL: https://github.com/dlenski/top500, SWHID:
⟨swh:1:dir:535da402e4285b7b26cdd294db27fee5abbf39ad;origin=https:
//github.com/dlenski/top500⟩. cit. on p. 2
[LHP] Arnaud Legrand, Konrad Hinsen, and Christophe Pouzat. Mooc Reproducible
research. URL: https://lms.fun-mooc.fr/courses/course-v1:inria+41023
+session01/info. cit. on p. A1
[Lit+17] Marin Litoiu, Mary Shaw, Gabriel Tamura, et al. “What can control theory teach
us about assurances in self-adaptive software systems?” In: Software Engineering
for Self-Adaptive Systems III. Assurances: International Seminar, Dagstuhl Castle,
Germany, December 15-19, 2013, Revised Selected and Invited Papers. Springer.
2017, pp. 90–134. cit. on p. 17
[Liu+11] Yonggang Liu, Renato Figueiredo, Dulcardo Clavijo, Yiqi Xu, and Ming Zhao.
“Towards simulation of parallel file system scheduling algorithms with PFSsim”.
In: Proceedings of the 7th IEEE International Workshop on Storage Network
Architectures and Parallel I/O (May 2011). 2011. cit. on p. 169
[Liu+20] Gang Liu, Zheng Xiao, GuangHua Tan, Kenli Li, and Anthony Theodore Chronopou-
los. “Game theory-based optimization of distributed idle computing resources
in cloud environments”. In: Theoretical Computer Science 806 (2020), pp. 468–
488. cit. on p. 12
[LLM87] Michel J Litzkow, Miron Livny, and Matt W Mutka. Condor-a hunter of idle work-
stations. Tech. rep. University of Wisconsin-Madison Department of Computer
Sciences, 1987. cit. on p. 12
[MC19] Tessema M Mengistu and Dunren Che. “Survey and taxonomy of volunteer
computing”. In: ACM Computing Surveys (CSUR) 52.3 (2019), pp. 1–35.
cit. on p. 11
[Mer+17] Michael Mercier, David Glesser, Yiannis Georgiou, and Olivier Richard. “Big
data and HPC collocation: Using HPC idle resources for Big Data analytics”.
In: 2017 IEEE International Conference on Big Data (Big Data). IEEE. 2017,
pp. 347–352. cit. on pp. 13, 81, 100
A14 Bibliography
[Mer19] Michael Mercier. “Contribution to High Performance Computing and Big Data
Infrastructure Convergence”. en. PhD Thesis. Universite Grenoble Alpes, 2019.
cit. on p. 161
[MFR18] Michael Mercier, Adrien Faure, and Olivier Richard. “Considering the Devel-
opment Workflow to Achieve Reproducibility with Variation”. In: SC 2018-
Workshop: ResCuE-HPC. 2018, pp. 1–5. cit. on p. 109
[Mid+88] Richard H Middleton, Graham C Goodwin, David J Hill, and David Q Mayne.
“Design issues in adaptive control”. In: IEEE transactions on automatic control
33.1 (1988), pp. 50–58. cit. on p. 78
[Mon+22] Julien Monniot, François Tessier, Matthieu Robert, and Gabriel Antoniu. “StorAl-
loc: A Simulator for Job Scheduling on Heterogeneous Storage Resources”. In:
European Conference on Parallel Processing. Springer. 2022, pp. 211–222.
cit. on p. 169
[Myt+09] Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F Sweeney.
“Producing wrong data without doing anything obviously wrong!” In: ACM
Sigplan Notices 44.3 (2009), pp. 265–276. cit. on p. 109
[Net+19] Alessio Netti, Micha Müller, Axel Auweter, et al. “From facility to application
sensor data: modular, continuous and holistic monitoring with DCDB”. In:
Proceedings of the International Conference for High Performance Computing,
Networking, Storage and Analysis. 2019, pp. 1–27. cit. on p. 53
[Nix23] [SW] NixOS, nixpkgs 2023. URL: https : / / github . com / nixos / nixpkgs,
SWHID : ⟨swh:1:dir:3ae910dfaee09a77c726f69f7c6519488e2bd9c3;origin
=https://github.com/NixOS/nixpkgs⟩. cit. on p. 114
Bibliography A15
[oar23c] [SW] oar-team, 2023. URL: https://github.com/oar-team/colmet, SWHID:
⟨swh:1:dir:eef757f42afc1c6032921ea4f56e4999aa2fd63d;origin=https:
//github.com/oar-team/colmet⟩. cit. on p. 53
[Ope13] LLC OpenStack. “OpenStack”. In: Apache Licence 2 (2013), p. 86. cit. on p. 139
[Pag23] Rosa Pagano. Controlling unused resources for digital sobriety. forthcoming. Nov.
2023. cit. on p. 66
[Paw+00] Brian Pawlowski, David Noveck, David Robinson, and Robert Thurlow. “The
NFS version 4 protocol”. In: In Proceedings of the 2nd International System
Administration and Networking Conference (SANE 2000. Citeseer. 2000.
cit. on p. 145
[Pon+18] Vitchyr Pong, Shixiang Gu, Murtaza Dalal, and Sergey Levine. “Temporal Differ-
ence Models: Model-Free Deep RL for M odel-Based Control”. In: (2018). arXiv:
1802.09081 [cs.LG]. cit. on p. 68
[Poq17] Millian Poquet. “Simulation approach for resource management”. PhD thesis.
Université Grenoble Alpes, 2017. cit. on p. 101
[PRD20] Barry Porter, Roberto Rodrigues Filho, and Paul Dean. “A survey of methodol-
ogy in self-adaptive systems research”. In: 2020 IEEE International Conference
on Autonomic Computing and Self-Organizing Systems (ACSOS). IEEE. 2020,
pp. 168–177. cit. on pp. 15, 16
[Prz+22] Bartłomiej Przybylski, Maciej Pawlik, Paweł Żuk, et al. “Using unused: non-
invasive dynamic FaaS infrastructure with HPC-whisk”. In: SC22: International
Conference for High Performance Computing, Networking, Storage and Analysis.
IEEE. 2022, pp. 1–15. cit. on pp. 13, 81
[RMS17] Eric Rutten, Nicolas Marchand, and Daniel Simon. “Feedback control as MAPE-K
loop in autonomic computing”. In: Software Engineering for Self-Adaptive Systems
III. Assurances: International Seminar, Dagstuhl Castle, Germany, December 15-19,
2013, Revised Selected and Invited Papers. Springer. 2017, pp. 349–373.
cit. on p. 17
[Ros+20] Daniel Rosendo, Pedro Silva, Matthieu Simonin, Alexandru Costan, and Gabriel
Antoniu. “E2Clab: Exploring the Computing Continuum through Repeatable,
Replicable and Reproducible Edge-to-Cloud Experiments”. In: Cluster 2020 -
IEEE International Conference on Cluster Computing. Kobe, Japan, Sept. 2020,
pp. 1–11. DOI: 10.1109/CLUSTER49012.2020.00028. URL: https://hal.scie
nce/hal-02916032. cit. on p. 109
A16 Bibliography
[Rui+15] Cristian Ruiz, Salem Harrache, Michael Mercier, and Olivier Richard. “Recon-
structable Software Appliances with Kameleon”. en. In: ACM SIGOPS Operating
Systems Review 49.1 (Jan. 2015), pp. 80–89. DOI: 10.1145/2723872.2723883.
URL : https : / / dl . acm . org / doi / 10 . 1145 / 2723872 . 2723883 (visited on
June 12, 2020). cit. on pp. 69, 111
[RW18] David Randall and Christopher Welser. The Irreproducibility Crisis of Modern
Science. Causes, Consequences, and the Road to Reform. en. New York: National
Association of Scholars, 2018. URL: https://www.nas.org/reports/the-irr
eproducibility-crisis-of-modern-science. cit. on p. 107
[ser23] [SW] serokell, 2023. URL: https : / / github . com / serokell / deploy - rs,
SWHID : ⟨swh:1:dir:d28ae84202e328ef087cff85d9c6748ad9b19080;origin
=https://github.com/serokell/deploy-rs⟩. cit. on p. 119
[SG19] Aadirupa Saha and Aditya Gopalan. “Combinatorial bandits with relative feed-
back”. In: Advances in Neural Information Processing Systems 32 (2019).
cit. on p. 97
[She+17] Stepan Shevtsov, Mihaly Berekmeri, Danny Weyns, and Martina Maggio. “Control-
theoretical software adaptation: A systematic literature review”. In: IEEE Trans-
actions on Software Engineering 44.8 (2017), pp. 784–810. cit. on pp. 20, 66
[Shv+10] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. “The
hadoop distributed file system”. In: 2010 IEEE 26th symposium on mass storage
systems and technologies (MSST). Ieee. 2010, pp. 1–10. cit. on p. 13
[SK05] David Skinner and William Kramer. “Understanding the causes of performance
variability in HPC workloads”. In: IEEE International. 2005 Proceedings of the
IEEE Workload Characterization Symposium, 2005. IEEE. 2005, pp. 137–149.
cit. on p. 3
[SMM18] Josef Spillner, Cristian Mateos, and David A Monge. “Faaster, better, cheaper:
The prospect of serverless scientific computing and hpc”. In: High Performance
Computing: 4th Latin American Conference, CARLA 2017, Buenos Aires, Argentina,
and Colonia del Sacramento, Uruguay, September 20-22, 2017, Revised Selected
Papers 4. Springer. 2018, pp. 154–168. cit. on p. 13
[Sou+19] Abel Souza, Mohamad Rezaei, Erwin Laure, and Johan Tordsson. “Hybrid re-
source management for HPC and data intensive workloads”. In: 2019 19th
IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CC-
GRID). IEEE. 2019, pp. 399–409. cit. on p. 13
[SRI16] Supreeth Shastri, Amr Rizk, and David Irwin. “Transient guarantees: Maximizing
the value of idle cloud capacity”. In: SC’16: Proceedings of the International
Conference for High Performance Computing, Networking, Storage and Analysis.
IEEE. 2016, pp. 992–1002. cit. on p. 12
Bibliography A17
[ST17] Tim Shaffer and Douglas Thain. “Taming metadata storms in parallel filesystems
with metaFS”. In: Proceedings of the 2nd Joint International Workshop on Parallel
Data Storage & Data Intensive Scalable Computing Systems. 2017, pp. 25–30.
cit. on p. 170
[Sta+04] David A Stainforth, Myles R Allen, David Frame, et al. “Climateprediction. net:
a global community for research in climate physics”. In: Environmental online
communication. Springer, 2004, pp. 101–112. cit. on p. 11
[Sta+18] Emmanuel Stahl, Agustin Gabriel Yabo, Olivier Richard, et al. “Towards a
control-theory approach for minimizing unused grid resources”. In: Proceedings
of the 1st International Workshop on Autonomous Infrastructure for Science. 2018,
pp. 1–8. cit. on p. 25
[Sto+21] Mathieu Stoffel, François Broquedis, Fré déric Desprez, and Abdelhafid Mazouz.
“Phase-TA: Periodicity Detection and Characterization for HPC Applications”. In:
HPCS 2020-18th IEEE International Conference on High Performance Computing
and Simulation. IEEE. 2021, pp. 1–12. cit. on p. 100
[Sze10] Miklos Szeredi. “FUSE: Filesystem in userspace”. In: http://fuse. sourceforge. net
(2010). cit. on p. 170
[Tar+23] Ahmad Tarraf, Alexis Bandet, Francieli Boito, Guillaume Pallez, and Felix Wolf.
“FTIO: Detecting I/O Periodicity Using Frequency Techniques”. In: arXiv preprint
arXiv:2306.08601 (2023). cit. on p. 100
[Ter+17] Théophile Terraz, Alejandro Ribes, Yvan Fournier, Bertrand Iooss, and Bruno
Raffin. “Melissa: Large Scale In Transit Sensitivity Analysis Avoiding Interme-
diate Files”. In: The International Conference for High Performance Computing,
Networking, Storage and Analysis (Supercomputing). Denver, United States, Nov.
2017, pp. 1–14. URL: https://hal.inria.fr/hal-01607479. cit. on p. 127
[toh22] [SW] tohojo, Overview — Flent: The FLExible Network Tester 2022. URL: https:
//flent.org/, SWHID: ⟨swh:1:dir:b31ce4d599147041ab4bb658e521620f35
25a059;origin=https://github.com/tohojo/flent⟩. cit. on p. 136
[TTL05] Douglas Thain, Todd Tannenbaum, and Miron Livny. “Distributed computing in
practice: the Condor experience”. In: Concurrency and computation: practice and
experience 17.2-4 (2005), pp. 323–356. cit. on p. 81
[Wan+21] Yawen Wang, Kapil Arya, Marios Kogias, et al. “Smartharvest: Harvesting idle
cpus safely and efficiently in the cloud”. In: Proceedings of the Sixteenth European
Conference on Computer Systems. 2021, pp. 1–16. cit. on p. 12
[Wei+06] Sage A Weil, Scott A Brandt, Ethan L Miller, Darrell DE Long, and Carlos
Maltzahn. “Ceph: A scalable, high-performance distributed file system”. In:
Proceedings of the 7th symposium on Operating systems design and implementation.
2006, pp. 307–320. cit. on p. 154
A18 Bibliography
[Yab+19] Agustin Gabriel Yabo, Bogdan Robu, Olivier Richard, Bruno Bzeznik, and Eric
Rutten. “A control-theory approach for cluster autonomic management: max-
imizing usage while avoiding overload”. In: 2019 IEEE Conference on Control
Technology and Applications (CCTA). IEEE. 2019, pp. 189–195. cit. on p. 25
[YJG03] Andy B. Yoo, Morris A. Jette, and Mark Grondona. “SLURM: Simple Linux
Utility for Resource Management”. en. In: Job Scheduling Strategies for Parallel
Processing. Ed. by Gerhard Goos, Juris Hartmanis, Jan van Leeuwen, et al.
Vol. 2862. Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, pp. 44–60. DOI:
10.1007/10968987_3. URL: http://link.springer.com/10.1007/10968987
_3 (visited on Nov. 22, 2021). cit. on pp. 13, 125, 129, 142
[Zak+22] Farid Zakaria, Thomas RW Scogland, Todd Gamblin, and Carlos Maltzahn.
“Mapping out the HPC dependency chaos”. In: SC22: International Conference
for High Performance Computing, Networking, Storage and Analysis. IEEE. 2022,
pp. 1–12. cit. on p. 170
Bibliography A19
List of Figures
A21
2.5. Evolution of the proportion of Bag-of-Tasks jobs executed by CiGri on
the Gricad computing center over the years, as well as the evolution
of the work (execution time times the number of resources) of CiGri
jobs. We observe that there is often one project which has the majority
of the jobs executed during a quarter. However, this project does not
necessarily perform the most work. For example, for the year 2020,
the biggnss project has the majority of jobs executed, but not the
majority of executed work. . . . . . . . . . . . . . . . . . . . . . . . 34
2.6. Distribution of the execution times for the 10 CiGri projects with the
most jobs. We observe that most projects have a clear unique mode
of distribution. This mode can be very wide, like for f-image and
simsert, or thin, like for biggnss and pr-mdcp. . . . . . . . . . . . 35
2.9. Relation between the number of jobs in a campaign and the mean
duration of its jobs. We observe that large campaigns have short jobs,
and small campaigns have long-lasting jobs. . . . . . . . . . . . . . . 37
3.3. Processing time (top) and the file-server load (bottom) for differ-
ent submissions in number of jobs and I/O loads. It represents the
identification phase. We vary the quantity of I/O (columns) and the
number of simultaneous write requests/jobs in x-axis. We observe
that the loadavg sensor captures the (over)load of the file-system. . 45
5.1. Results of the identification of the system. Figure 5.1a depicts the link
between the input and the output of our system. Figure 5.1b shows
that the distribution of execution times is impacted by the commission
and decommission of the nodes by OAR, and thus must be taken into
account in the modelling. . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3. Feedback loop representing the control scheme. The reference value
(yref ) is proactively changed to take into account the future availabil-
ity of the resources (dhk ). . . . . . . . . . . . . . . . . . . . . . . . . 87
5.5. Control signals for a scenario with pj = 60s and a horizon of 60s.
The top plot represents that number of resources submitted by CiGri
through time. The bottom plot depicts the value of our sensor, as well
as the number of available resources to CiGri in dashed red. . . . . . 91
5.6. Gantt chart for a scenario with pj = 60s and a horizon of 60s. The
killed jobs are depicted with a thicker contour. . . . . . . . . . . . . 91
6.1. Feedback loop gathering the controllers and objectives from Chapter
3 (bottom) and Chapter 5 (top). At every iteration, we ran both
controllers and take the minimum number of jobs in the submission
(uk ) to satisfy the objectives. In the case of a PI controller, we would
only add the error to the integral part only if the chosen submission
size comes from this controller. It would be also possible to add the
feedback loop presented in Figure 3.13. . . . . . . . . . . . . . . . . 98
7.1. Software dependencies of CiGri (Figure 7.1a) and OAR (Figure 7.1b). 106
8.3. Mechanism for the nodes to get the deployment information. For a
few nodes, the information is passed via the kernel parameters. For
a higher number of nodes, this is not possible due to the size limit
on the kernel parameters (4096 bytes). In this case NixOS Compose
starts a light HTTP server on the frontend and passes its URL to the
nodes via the kernel parameters. The nodes then query this server to
retrieve the deployment information. . . . . . . . . . . . . . . . . . . 128
8.4. Packages present in the Nix Store of the Melissa image for the different
flavours. The colors represent the packages common to the flavours.
The smaller packages are gathered under the others-* name. The
docker flavour is omitted as it mounts the Nix Store of the host machine.133
9.2. Architectures for a distributed file system (Figure 9.2a), and parallel
file system (Figure 9.2b). . . . . . . . . . . . . . . . . . . . . . . . . . 145
9.4. Evolution of the reading (top row) and writing (bottom row) times
based on the folding factor (x-axis) for experiments with different
cluster size (i.e., number of CPU nodes) and different sizes of file to
read and write (point shape). We observe that the writing perfor-
mances are not affected by the folding, but that the reading ones are,
and that the degradation has quadratic growth with respect to the
folding factor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
9.5. Linear regression modeling the reading times (y-axis) and the folding
factor (x-axis), file size (point shape). The top row shows the fitting
of the model on the data, and the bottom row the same data but in
log scale. We can see that the model fits correctly the data for file
sizes greater than 10M. 1M files does not seem to be affected by the
folding, and their variation in performance seem to be due to noise. . 149
9.6. Figure 9.6a shows the maximum folding factor to use to have a
desired overhead on the reading times on NFS base on the file size.
Figure 9.6b shows the distribution of the number of read requests per
size on ANL-Theta. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
List of Tables
2.1. Table summarazing the dataset. texec represents the execution times
of the Bag-of-Tasks jobs from the two computing grids Gricad and DAS2. 30
A28
8.1. Table summarizing the different flavours with their building and
deployment phases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Résumé
Les systèmes de calcul haute performance (HPC) sont devenus de plus en plus complexes, et leurs performances
ainsi que leur consommation d’énergie les rendent de moins en moins prévisibles. Cette imprévisibilité nécessite
une gestion en ligne et prudente, afin garantir une qualité de service acceptable aux utilisateurs. Un tel problème
de régulation se pose dans le contexte de l’intergiciel de grille de calcul CiGri qui vise à récolter les ressources
inutilisées d’un ensemble de grappes via l’injection de tâches faiblement prioritaires. Une stratégie de récolte
trop agressive peut conduire à la dégradation des performances pour tous les utilisateurs des grappes, tandis
qu’une récolte trop timide laissera des ressources inutilisées et donc une perte de puissance de calcul. Il existe
ainsi un compromis entre la quantité de ressources pouvant être récoltées et la dégradation des performances
pour les tâches des utilisateurs qui en résulte. Ce compromis peut évoluer au cours de l’exécution en fonction
des accords de niveau de service et de la charge du système.
Nous affirmons que de tels défis de régulation peuvent être résolus avec des outils issus de l’informatique
autonomique, et en particulier lorsqu’ils sont couplés à la théorie du contrôle. Cette thèse étudie plusieurs
problèmes de régulation dans le contexte de CiGri avec de tels outils. Nous nous concentrerons sur la régulation
de la récolte de ressources libres en fonction de la charge d’un système de fichiers distribué partagé et sur
l’amélioration de l’utilisation globale des ressources de calcul. Nous évaluerons et comparerons également la
réutilisabilité des solutions proposées dans le contexte des systèmes HPC.
Les expériences réalisées dans cette thèse nous ont par ailleurs amené à rechercher de nouveaux outils et
techniques pour améliorer le coût et la reproductibilité des expériences. Nous présenterons un outil nommé
NixOS-compose capable de générer et de déployer des environnements logiciels distribués reproductibles. Nous
étudierons de plus des techniques permettant de réduire le nombre de machines nécessaires pour expérimenter
sur des intergiciels de grappe, tels que CiGri, tout en garantissant un niveau de réalisme acceptable pour le
système final déployé.