Simgrid Tutorial

The SimGrid Framework for Research on
Large-Scale Distributed Systems
Martin Quinson (Nancy University, France)

Arnaud Legrand (CNRS, Grenoble University, France)
Henri Casanova (Hawai‘i University at Manoa, USA)
[email protected]
Large-Scale Distributed Systems Research
Large-scale distributed systems are in production today
I Grid platforms for ”e-Science” applications
I Peer-to-peer file sharing
I Distributed volunteer computing
I Distributed gaming
SimGrid for Research on Large-Scale Distributed Systems Experiments for Large-Scale Distributed Systems Research 2/130
Researchers study a broad range of systems
I Data lookup and caching algorithms
I Application scheduling algorithms
I Resource management and resource sharing strategies
They want to study several aspects of their system performance
I Response time I Robustness
I Throughput I Fault-tolerance
I Scalability I Fairness
They want to study several aspects of their system performance
I Response time I Robustness
I Throughput I Fault-tolerance
I Scalability I Fairness
Main question: comparing several solutions in relevant settings
Large-Scale Distributed Systems Science?
Requirement for a Scientific Approach
I Reproducible results
I You can read a paper,
I reproduce a subset of its results,
I improve
I Standard methodologies and tools
I Grad students can learn their use and become operational quickly
I Experimental scenario can be compared accurately
Current practice in the field: quite different

I Very little common methodologies and tools
I Experimental settings rarely detailed enough in literature (test source codes?)
Large-Scale Distributed Systems Science?
Requirement for a Scientific Approach
I Reproducible results
I You can read a paper,
I reproduce a subset of its results,
I improve
I Standard methodologies and tools
I Grad students can learn their use and become operational quickly
I Experimental scenario can be compared accurately
Current practice in the field: quite different

I Very little common methodologies and tools
I Experimental settings rarely detailed enough in literature (test source codes?)
Purpose of this tutorial

I Present “emerging” methodologies and tools
I Show how to use some of them in practice
I Discuss open questions and future directions
Agenda
Experiments for Large-Scale Distributed Systems Research
Methodological Issues
Main Methodological Approaches
Tools for Experimentations in Large-Scale Distributed Systems
Resource Models in SimGrid
Analytic Models Underlying SimGrid
Experimental Validation of the Simulation Models
Platform Instanciation
Platform Catalog
Synthetic Topologies
Using SimGrid for Practical Grid Experiments
Overview of the SimGrid Components
SimDag: Comparing Scheduling Heuristics for DAGs
MSG: Comparing Heuristics for Concurrent Sequential Processes
GRAS: Developing and Debugging Real Applications
Conclusion
Agenda
Real-world experiments
Simulation
Platform Catalog
Conclusion
Analytical or Experimental?
Analytical works?
I Some purely mathematical models exist
, Allow better understanding of principles in spite of dubious applicability
impossibility theorems, parameter influence, . . .
/ Theoretical results are difficult to achieve
I Everyday practical issues (routing, scheduling) become NP-hard problems
Most of the time, only heuristics whose performance have to be assessed are proposed
I Models too simplistic, rely on ultimately unrealistic assumptions.
⇒ One must run experiments

; Most published research in the area is experimental
Running real-world experiments
, Eminently believable to demonstrate the proposed approach applicability
/ Very time and labor consuming
I Entire application must be functional I Parameter-sweep; Design alternatives
/ Choosing the right testbed is difficult
I My own little testbed?
, Well-behaved, controlled,stable / Rarely representative of production platforms
I Real production platforms?
I Not everyone has access to them; CS experiments are disruptive for users
I Experimental settings may change drastically during experiment
(components fail; other users load resources; administrators change config.)
/ Results remain limited to the testbed
I Impact of testbed specificities hard to quantify ⇒ collection of testbeds...
I Extrapolations and explorations of “what if” scenarios difficult
(what if the network were different? what if we had a different workload?)
/ Experiments are uncontrolled and unrepeatable
No way to test alternatives back-to-back (even if disruption is part of the experiment)
/ Very time and labor consuming
I Entire application must be functional I Parameter-sweep; Design alternatives
/ Choosing the right testbed is difficult
I My own little testbed?
, Well-behaved, controlled,stable / Rarely representative of production platforms
I Real production platforms?
I Not everyone has access to them; CS experiments are disruptive for users
I Experimental settings may change drastically during experiment
(components fail; other users load resources; administrators change config.)
/ Results remain limited to the testbed
I Impact of testbed specificities hard to quantify ⇒ collection of testbeds...
I Extrapolations and explorations of “what if” scenarios difficult
(what if the network were different? what if we had a different workload?)
/ Experiments are uncontrolled and unrepeatable
No way to test alternatives back-to-back (even if disruption is part of the experiment)
Difficult for others to reproduce results

even if this is the basis for scientific advances!
Simulation
, Simulation solves these difficulties

I No need to build a real system, nor the full-fledged application
I Ability to conduct controlled and repeatable experiments

I (Almost) no limits to experimental scenarios
I Possible for anybody to reproduce results
Simulation in a nutshell
I Predict aspects of the behavior of a system using an approximate model of it
Simulation


I Model: Set of objects defined by a state ⊕ Rules governing the state evolution
Simulation


I Simulator: Program computing the evolution according to the rules
Simulation


I Wanted features:
I Accuracy: Correspondence between simulation and real-world
Simulation


I Wanted features:
I Scalability: Actually usable by computers (fast enough)
Simulation


I Wanted features:
I Tractability: Actually usable by human beings (simple enough to understand)
Simulation


I Wanted features:
I Instanciability: Can actually describe real settings (no magical parameter)
Simulation


I Wanted features:
I Instanciability: Can actually describe real settings (no magical parameter)
I Relevance: Captures object of interest
Simulation in Computer Science
Microprocessor Design
IA few standard “cycle-accurate” simulators are used extensively
http://www.cs.wisc.edu/~arch/www/tools.html
⇒ Possible to reproduce simulation results
Networking
I A few established “packet-level” simulators: NS-2, DaSSF, OMNeT++, GTNetS
I Well-known datasets for network topologies
I Well-known generators of synthetic topologies
I SSF standard: http://www.ssfnet.org/
⇒ Possible to reproduce simulation results
Large-Scale Distributed Systems?

I No established simulator up until a few years ago
I Most people build their own “ad-hoc” solutions
Simulation in Parallel and Distributed Computing
I Used for decades, but under drastic assumptions in most cases
Simplistic platform model

I Fixed computation and communication rates (Flops, Mb/s)
I Topology either fully connected or bus (no interference or simple ones)
I Communication and computation are perfectly overlappable
Simplistic application model

I All computations are CPU intensive (no disk, no memory, no user)
I Clear-cut communication and computation phases
I Computation times even ignored in Distributed Computing community
I Communication times sometimes ignored in HPC community
Straightforward simulation in most cases

I Fill in a Gantt chart or count messages with a computer rather than by hand
I No need for a “simulation standard”
Large-Scale Distributed Systems Simulations?
Simple models justifiable at small scale
I Cluster computing (matrix multiply application on switched dedicated cluster)
I Small scale distributed systems
Hardly justifiable for Large-Scale Distributed Systems

I Heterogeneity of components (hosts, links)
I Quantitative: CPU clock, link bandwidth and latency
I Qualitative: ethernet vs myrinet vs quadrics; Pentium vs Cell vs GPU
I Dynamicity
I Quantitative: resource sharing ; availability variation
I Qualitative: resource come and go (churn)
I Complexity
I Hierarchical systems: grids of clusters of multi-processors being multi-cores
I Resource sharing: network contention, QoS, batches
I Multi-hop networks, non-negligible latencies
I Middleware overhead (or optimizations)
I Interference of computation and communication (and disk, memory, etc)
Agenda
Possible designs
Experimentation platforms: Grid’5000 and PlanetLab
Emulators: ModelNet and MicroGrid
Packet-level Simulators: ns-2, SSFNet and GTNetS
Ad-hoc simulators: ChicagoSim, OptorSim, GridSim, . . .
Peer to peer simulators
SimGrid
Platform Catalog
GRAS:SimGrid
Developing
for Research onand Debugging
Large-Scale Real Applications
Distributed Systems Experiments for Large-Scale Distributed Systems Research 12/130
Models of Large-Scale Distributed Systems
Model = Set of objects defined by a state ⊕ Set of rules governing the state evolution
Model objects:
I Evaluated application: Do actions, stimulus to the platform
I Resources (network, CPU, disk): Constitute the platform, react to stimulus.
I Application blocked until actions are done
I Resource can sometime “do actions” to represent external load
Model objects:
Expressing interaction rules

more
abstract Mathematical Simulation: Based solely on equations
Discrete-Event Simulation: System = set of dependant actions & events
Emulation: Trapping and virtualization of low-level application/system actions
less Real execution: No modification
abstract
Model objects:
Expressing interaction rules

more
abstract Mathematical Simulation: Based solely on equations
Discrete-Event Simulation: System = set of dependant actions & events
Emulation: Trapping and virtualization of low-level application/system actions
less Real execution: No modification
abstract
Boundaries are blurred

I Tools can combine several paradigms for different resources
I Emulators may use a simulator to compute resource availabilities
Simulation options to express rules
Network
I Macroscopic: Flows in ”pipes” (mathematical & coarse-grain d.e. simulation)
Data sizes are ”liquid amount”, links are ”pipes”
I Microscopic: Packet-level simulation (fine-grain d.e. simulation)
I Emulation: Actual flows through “some” network timing + time expansion
CPU
I Macroscopic: Flows of operations in the CPU pipelines
I Microscopic: Cycle-accurate simulation (fine-grain d.e. simulation)
I Emulation: Virtualization via another CPU / Virtual Machine
Applications
I Macroscopic: Application = analytical “flow”
I Less macroscopic: Set of abstract tasks with resource needs and dependencies
I Coarse-grain d.e. simulation
I Application specification or pseudo-code API
I Virtualization: Emulation of actual code trapping application generated events
Large-Scale Distributed Systems Simulation Tools
A lot of tools exist

I Grid’5000, Planetlab, MicroGrid, Modelnet, Emulab, DummyNet
I ns-2, GTNetS, SSFNet
I ChicagoSim, GridSim, OptorSim, SimGrid, . . .
I PeerSim, P2PSim, . . .
How do they compare?

I How do they work?
I Components taken into account (CPU, network, application)
I Options used for each component (direct execution; emulation; d.e.; simulation)
I What are their relative qualities?
I Accuracy (correspondence between simulation and real-world)
I Technical requirement (programming language, specific hardware)
I Scale (tractable size of systems at reasonable speed)
I Experimental settings configurable and repeatable, or not
Grid’5000 (consortium – INRIA)
French experimental platform

I 1500 nodes (3000 cpus, 4000 cores) over 9 sites
I Nation-wide 10Gb dedicated interconnection
I http://www.grid5000.org
Scientific tool for computer scientists

I Nodes are deployable: install your own OS image
I Allow study at any level of the stack:
I Network (TCP improvements)
I Middleware (scalability, scheduling, fault-tolerance)
I Programming (components, code coupling, GridRPC)
I Applications
, Applications not modified, direct execution
, Environment controlled, experiments repeatable
/ Relative scalability (“only” 1500-4000 nodes)
Experimental tools comparison
CPU Disk Network Application Requirement Settings Scale
Grid’5000 direct direct direct direct access fixed <5000
Planetlab
Modelnet
MicroGrid
ns-2
SSFNet
GTNetS
ChicSim
OptorSim
GridSim
P2PSim
PlanetSim
PeerSim
SimGrid
I Direct execution ; no experimental bias (?)

Experimental settings fixed (between hardware upgrades), but not controllable
PlanetLab (consortium)
Open platform for developping, deploying, and accessing planetary-scale services
Planetary-scale 852 nodes, 434 sites, >20 countries
Distribution Virtualization each user can get a slice of the platform

Unbundled Management
I local behavior defined per node; network-wide behavior: services
I multiple competing services in parallel (shared, unprivileged interfaces)
As unstable as the real world

, Demonstrate the feasability of P2P applications or middlewares
/ No reproducibility!
Planetlab virtualize virtualize virtualize virtualize none uncontrolled hundreds
Modelnet
MicroGrid
ns-2
SSFNet
GTNetS
ChicSim
OptorSim
GridSim
P2PSim
PlanetSim
PeerSim
SimGrid

I Virtualization allows sandboxing, but no experimental settings control
ModelNet (UCSD/Duke)
Applications
I Emulation and virtualization: Actual code executed on “virtualized” resources
I Key tradeoff: scalability versus accuracy
Resources: system calls intercepted
I gethostname, sockets
CPU: direct execution on CPU

I Slowdown not taken into account!
Network: emulation through:
I one emulator (running on FreeBSD)
I a gigabit LAN
I hosts + IP aliasing for virtual nodes
; emulation of heterogeneous links
I Similar ideas used in other projects (Emulab, DummyNet, Panda, . . . )
Amin Vahdat et Al., Scalability and Accuracy in a LargeScale Network Emulator, OSDI’02.
Modelnet - - emulation emulation lot material controlled dozens
MicroGrid
ns-2
SSFNet
GTNetS
ChicSim
OptorSim
GridSim
P2PSim
PlanetSim
PeerSim
SimGrid

I Emulation can have high overheads (but captures the overhead)
MicroGrid (UCSD)
Applications
I Application supported by emulation and virtualization
I Actual application code is executed on “virtualized” resources
I Accounts for CPU and network
Application
Resources: wraps syscalls & grid tools Virtual

I gethostname, sockets, GIS, MDS, NWS Resources
CPU: direct execution on fraction of CPU

I finds right mapping
MicroGrid
Network: packet-level simulation Physical
I parallel version of MaSSF Ressources
Time: synchronize real and virtual time

I find the good execution rate
Andrew Chien et Al., The MicroGrid: a Scientific Tool for Modeling Computational Grids, Super-
Computing 2002.
MicroGrid emulation - fine d.e. emulation none controlled hundreds
ns-2
SSFNet
GTNetS
ChicSim
OptorSim
GridSim
P2PSim
PlanetSim
PeerSim
SimGrid

I Discrete event simulation is slow, but hopefully accurate
Packet-level simulators
ns-2: the most popular one
I Several protocols (TCP, UDP, . . . ), several queuing models (DropTail, RED, . . . )
I Several application models (HTTP, FTP), wired and wireless networks
I Written in C++, configured using TCL. Limitated scalability (< 1, 000)
SSFNet: implementation of SSF standard

I Scalable Simulation Framework: unified API for d.e. of distributed systems
I Written in Java, usable on 100 000 nodes
GTNetS: Georgia Tech Network Simulator

I Design close to real networks protocol philosophy (layers stacked)
I C++, reported usable with 177, 000 nodes


Simulation tools of the networking community

ITopic: Study networks behavior, routing protocols, QoS, . . .
IGoal: Improve network protocols
; Microscopic simulation of packet movements


Simulation tools for the networking community

ITopic: Study networks behavior, routing protocols, QoS, . . .
IGoal: Improve network protocols
; Microscopic simulation of packet movements
⇒ Inadequate for us (long simulation time, CPU not taken into account)
ns-2 - - fine d.e. coarse d.e. C++ and tcl controlled <1,000
SSFNet - - fine d.e. coarse d.e. Java controlled <100,000
GTNetS - - fine d.e. coarse d.e. C++ controlled <177,000
ChicSim
OptorSim
GridSim
P2PSim
PlanetSim
PeerSim
SimGrid

ChicagoSim, OptorSim, GridSim, . . .
I Network simulator are not adapted, emulation solutions are too heavy
I PhD students just need simulator to plug in their algorithm
I Data placement/replication
I Grid economy
⇒ Many simulators. Most are home-made, short-lived; Some are released
ChicSim designed for the study of data replication (Data Grids), built on ParSec
Ranganathan, Foster, Decoupling Computation and Data Scheduling in Distributed
Data-Intensive Applications, HPDC’02.
OptorSim developped for European Data-Grid
DataGrid, CERN. OptorSim: Simulating data access optimization algorithms
GridSim focused on Grid economy
Buyya et Al. GridSim: A Toolkit for the Modeling and Simulation of Global Grids,
CCPE’02.
every [sub-]community seems to have its own simulator
ChicSim coarse d.e. - coarse d.e. coarse d.e. C controlled few 1,000
OptorSim coarse d.e. amount coarse d.e. coarse d.e. Java controlled few 1,000
GridSim coarse d.e. coarse d.e. coarse d.e. coarse d.e. Java controlled few 1,000
P2PSim
PlanetSim
PeerSim
SimGrid

To scale, you have to trade speed for accuracy
PeerSim, P2PSim, . . .
Thee peer-to-peer community also has its own private collection of simulators:
focused on P2P protocols ; main challenge = scale

P2PSim Multi-threaded discrete-event simulator. Constant communication time.
Alpha release (april 2005)
http://pdos.csail.mit.edu/p2psim/
PlanetSim Multi-threaded discrete-event simulator. Constant communication time.
Last release (2006)
http://planet.urv.es/trac/planetsim/wiki/PlanetSim
PeerSim Designed for epidemic protocols. processes = state machines. Two sim-
ulation modes: cycle-based (time is discrete) or event-based. Resources are not
modeled. 1.0.3 release (december 2007)
http://peersim.sourceforge.net/
OverSim A recent one based on OMNeT++ (april 2008)
http://www.oversim.org/
P2PSim - - - state machine C++ controlled few 1,000
PlanetSim - - cste time coarse d.e. Java controlled 100,000
PeerSim - - - state machine Java controlled 1,000,000
SimGrid

SimGrid (Hawai’i, Grenoble, Nancy)
History
I Created just like other home-made simulators (only a bit earlier ;)
I Original goal: scheduling research ; need for speed (parameter sweep)
I HPC community concerned by performance ; accuracy not negligible
SimGrid in a Nutshell
I Simulation ≡ communicating processes performing computations
I Key feature: Blend of mathematical simulation and coarse-grain d. e. simula-
tion
I Resources: Defined by a rate (MFlop/s or Mb/s) + latency
I Also allows dynamic traces and failures
I Tasks can use multiple resources explicitely or implicitly
I Transfer over multiple links, computation using disk and CPU
I Simple API to specify an heuristic or application easily
Casanova, Legrand, Quinson.
SimGrid: a Generic Framework for Large-Scale Distributed Experimentations, EUROSIM’08.
P2PSim - - - state machine C++ controlled few 1,000
PlanetSim - - cste time coarse d.e. Java controlled 100,000
PeerSim - - - state machine Java controlled 1,000,000
SimGrid math/d.e. (underway) math/d.e. d.e./emul C or Java controlled few 100,000

So what simulator should I use?
It really depends on your goal / resources
I Grid’5000 experiments very good . . . if have access and plenty of time
I PlanetLab does not enable reproducible experiments
I ModelNet, ns-2, SSFNet, GTNetS meant for networking experiments (no CPU)
I ModelNet requires some specific hardware setup
I MicroGrid simulations take a lot of time (although they can be parallelized)
I SimGrid’s models have clear limitations (e.g. for short transfers)
I SimGrid simulations are quite easy to set up (but rewrite needed)
I SimGrid does not require that a full application be written
I Ad-hoc simulators are easy to setup, but their validity is still to be shown,
ie, the results obtained may be plainly wrong
I Ad-hoc simulators obviously not generic (difficult to adapt to your own need)
So what simulator should I use?
It really depends on your goal / resources
I Grid’5000 experiments very good . . . if have access and plenty of time
I PlanetLab does not enable reproducible experiments
I ModelNet, ns-2, SSFNet, GTNetS meant for networking experiments (no CPU)
I ModelNet requires some specific hardware setup
I MicroGrid simulations take a lot of time (although they can be parallelized)
I SimGrid’s models have clear limitations (e.g. for short transfers)
I SimGrid simulations are quite easy to set up (but rewrite needed)
I SimGrid does not require that a full application be written
I Ad-hoc simulators are easy to setup, but their validity is still to be shown,
ie, the results obtained may be plainly wrong
I Ad-hoc simulators obviously not generic (difficult to adapt to your own need)
Key trade-off seem to be accuracy vs speed
I The more abstract the simulation the fastest
I The less abstract the simulation the most accurate
Does this trade-off really hold?
Simulation Validation
Crux of simulation works

I Validation is difficult
I Almost never done convincingly
I (not specific to CS: other science have same issue here)
How to validate a model (and obtain scientific results?)

I Claim that it is plausible (justification = argumentation)
I Show that it is reasonable
I Some validation graphs in a few special cases at best
I Validation against another “validated” simulator
I Argue that trends are respected (absolute values may be off)
; it is useful to compare algorithms/designs
I Conduct extensive verification campaign against real-world settings
Simulation Validation: the FLASH example
FLASH project at Stanford
I Building large-scale shared-memory multiprocessors
I Went from conception, to design, to actual hardware (32-node)
I Used simulation heavily over 6 years
Authors compared simulation(s) to the real world

I Error is unavoidable (30% error in their case was not rare)
Negating the impact of “we got 1.5% improvement”
I Complex simulators not ensuring better simulation results
ISimple simulators worked better than sophisticated ones (which were unstable)
ISimple simulators predicted trends as well as slower, sophisticated ones
⇒ Should focus on simulating the important things
I Calibrating simulators on real-world settings is mandatory
Gibson, Kunz, Ofelt, Heinrich, FLASH vs. (Simulated) FLASH: Closing the Simulation Loop,
Architectural Support for Programming Languages and Operating Systems, 2000
Simulation Validation: the FLASH example
FLASH project at Stanford
I Building large-scale shared-memory multiprocessors
I Went from conception, to design, to actual hardware (32-node)
I Used simulation heavily over 6 years
Authors compared simulation(s) to the real world

I Error is unavoidable (30% error in their case was not rare)
Negating the impact of “we got 1.5% improvement”
I Complex simulators not ensuring better simulation results
ISimple simulators worked better than sophisticated ones (which were unstable)
ISimple simulators predicted trends as well as slower, sophisticated ones
⇒ Should focus on simulating the important things
I Calibrating simulators on real-world settings is mandatory
For FLASH, the simple simulator was all that was needed. . .
Gibson, Kunz, Ofelt, Heinrich, FLASH vs. (Simulated) FLASH: Closing the Simulation Loop,
Architectural Support for Programming Languages and Operating Systems, 2000
Conclusion
Large-Scale Distributed System Research is Experimental

I Analytical models are too limited
I Real-world experiments are hard & limited
⇒ Most literature rely on simulation
Simulation for distributed applications still taking baby steps

I Compared for example to hardware design or networking communities
but more advanced for HPC Grids than for P2P
I Lot of home-made tools, no standard methodology
I Very few simulation projects even try to:
I Publish their tools for others to use
I Validate their tools
I Support other people’s use:
genericity, stability, portability, documentation, . . .
Conclusion
Claim: SimGrid may prove helpful to your research

I User-community much larger than contributors group
I Used in several communities (scheduling, GridRPC, HPC infrastructure, P2P)
I Model limits known thanks to validation studies
I Easy to use, extensible, fast to execute
I Around since almost 10 years
Remainder of this talk: present SimGrid in detail

I Under the cover:
I Models used I Implementation overview
I Main limitations
I Model validity I Tool performance and scalability
I Practical usage
I How to use it for your research I Use cases and success stories
Agenda
Platform Catalog
Conclusion
SimGrid for Research on Large-Scale Distributed Systems Resource Models in SimGrid 31/130
Agenda
Modeling a Single Resource
Multi-hop Networks
Resource Sharing
Platform Catalog
Conclusion
Analytic Models underlying the SimGrid Framework
Main challenges for SimGrid design
I Simulation accuracy:
I Designed for HPC scheduling community ; don’t mess with the makespan!
I At the very least, understand validity range
I Simulation speed:
I Users conduct large parameter-sweep experiments over alternatives
Microscopic simulator design

I Simulate the packet movements and routers algorithms
I Simulate the CPU actions (or micro-benchmark classical basic operations)
I Hopefully very accurate, but very slow (simulation time simulated time)
Going faster while remaining reasonable?

I Need to come up with macroscopic models for each kind of resource
I Main issue: resource sharing. Emerge naturally in microscopic approach:
I Packets of different connections interleaved by routers
I CPU cycles of different processes get slices of the CPU
size
Basic model: Time = L + B
I Resource work at given rate (B, in MFlop/s or Mb/s)
I Each use have a given latency (L, in s)
Application to processing elements (CPU/cores)

I Very widely used (latency usually neglected)
I No cache effects and other specific software/hardware adequation
I No better analytical model (reality too complex and changing)
I Sharing easy in steady-state: fair share for each process
Application to networks
I Turns out to be “inaccurate” for TCP
I B not constant, but depends on RTT, packet loss ratio, window size, etc.
I Several models were proposed in the literature
Modeling TCP performance (single flow, single link)
Padhye, Firoiu, Towsley, Krusoe. Modeling TCP Reno Performance: A Simple Model and
Its Empirical Validation. IEEE/ACM Transactions on Networking, Vol. 8, Num. 2, 2000.
!
Wmax 1
B = min , p p
RTT RTT 2bp/3 + T0 × min(1, 3 3bp/8) × p(1 + 32p 2 )
I Wmax : receiver advertised window I p: loss indication rate

I RTT: Round trip time I b: #packages acknowledged per ACK
I T0 : TCP average retransmission timeout value
Model discussion
I Captures TCP congestion control (fast retransmit and timeout mecanisms)
I Assumes steady-state (no slow-start)
I Accuracy shown to be good over a wide range of values
Modeling TCP performance (single flow, single link)
Padhye, Firoiu, Towsley, Krusoe. Modeling TCP Reno Performance: A Simple Model and
Its Empirical Validation. IEEE/ACM Transactions on Networking, Vol. 8, Num. 2, 2000.
!
Wmax 1
B = min , p p
RTT RTT 2bp/3 + T0 × min(1, 3 3bp/8) × p(1 + 32p 2 )
I Wmax : receiver advertised window I p: loss indication rate

I RTT: Round trip time I b: #packages acknowledged per ACK
I T0 : TCP average retransmission timeout value
Model discussion
I Captures TCP congestion control (fast retransmit and timeout mecanisms)
I Assumes steady-state (no slow-start)
I Accuracy shown to be good over a wide range of values
I p and b not known in general (model hard to instanciable)
SimGrid model for single TCP flow, single link
Definition of the link l

I Ll : physical latency
I Bl : physical bandwidth
Time to transfer size bytes over the link:

size
Time = Ll +
Bl0
Empirical bandwidth: Bl0 = min(Bl , W max

RTT
)
I Justification: sender emits Wmax then waits for ack (ie, waits RTT)
I Upper limit: first min member of previous model
I RTT assumed to be twice the physical latency
I Router queue time assumed to be included in this value
Modeling Multi-hop Networks
S
l1
l2
l3
Modeling Multi-hop Networks: Store & Forward
S
l1
l2
l3
First idea, quite natural

I Pay the price of going through link 1, then go through link 2, etc.
I Analogy to the time to go from a city to another: time on each road
Modeling Multi-hop Networks: Store & Forward
S
l1
l2
l3
First idea, quite natural

I Pay the price of going through link 1, then go through link 2, etc.
I Analogy to the time to go from a city to another: time on each road
Unfortunately, things don’t work this way
I Whole message not stored on each router
I Data split in packets over TCP networks (surprise, surprise)
I Transfers on each link occur in parallel
Modeling Multi-hop Networks: WormHole
S
l1
l2
l3
Remember Networking classes?

I Links packetize stream according to MTU (Maximum Transmission Unit)
S
l1
l2 pi,j
l3
MTU

I Easy to simulate (SimGrid until 2002; GridSim 4.0 & most ad-hoc tools do)
S
l1
l2 pi,j
l3
MTU

I Easy to simulate (SimGrid until 2002; GridSim 4.0 & most ad-hoc tools do)
Unfortunately, things don’t work this way
I IP packet fragmentation algorithms complex (when MTUs differ)
I TCP contention mecanisms:
I Sender only emits cwnd packets before ACK
I Timeouts, fast retransmit, etc.
⇒ as slow as packet-level simulators, not quite as accurate
Macroscopic TCP modeling is a field
TCP bandwidth sharing studied by several authors

I Data streams modeled as fluids in pipes
I Same model for single stream/multiple links or multiple stream/multiple links
Macroscopic TCP modeling is a field
TCP bandwidth sharing studied by several authors

I Data streams modeled as fluids in pipes
I Same model for single stream/multiple links or multiple stream/multiple links
flow 0
link 1 link 2 link L
flow 1 flow 2 flow L
Notations
I L: set of links I F: set of flows; f ∈ P(L)
I Cl : capacity of link l (Cl > 0) I λf : transfer rate of f
I nl : amount of flows using link l
Feasibility constraint
X
I Links deliver their capacity at most: ∀l ∈ L, λf ≤ Cl
f 3l
Max-Min Fairness
Objective function: maximize min(λf )
f ∈F
I Equilibrium reached if increasing any λf decreases a λ0f (with λf > λ0f )
I Very reasonable goal: gives fair share to anyone
I Optionally, one can add prorities wi for each flow i
; maximizing min(wf λf )
f ∈F
Bottleneck links
I For each flow f , one of the links is the limiting one l
(with more on that link l, the flow f would get more overall)
I The objective function gives that l is saturated, and f gets the biggest share
X
∀f ∈ F, ∃l ∈ f , λf 0 = Cl and λf = max{λf 0 , f 0 3 l}
f 0 3l
L. Massoulié and J. Roberts, Bandwidth sharing: objectives and algorithms,

IEEE/ACM Trans. Netw., vol. 10, no. 3, pp. 320-328, 2002.
Implementation of Max-Min Fairness
Bucket-filling algorithm
I Set the bandwidth of all flows to 0
I Increase the bandwidth of every flow by . And again, and again, and again.
I When one link is saturated, all flows using it are limited (; removed from set)
I Loop until all flows have found a limiting link
Implementation of Max-Min Fairness
Bucket-filling algorithm
I Set the bandwidth of all flows to 0
I Increase the bandwidth of every flow by . And again, and again, and again.
I When one link is saturated, all flows using it are limited (; removed from set)
I Loop until all flows have found a limiting link
Efficient Algorithm
Cl Ck
1. Search for the bottleneck link l so that: = min , k∈L
nl nk
2. ∀f ∈ l, λf = Cnll ;
Update all nl and Cl to remove these flows
3. Loop until all λf are fixed
Max-Min Fairness on Homogeneous Linear Network
C1 = C n1 = 2
flow 0
C2 = C n2 = 2
link 1 link 2
λ0 =
flow 1 flow 2 λ1 =
λ2 =
I All links have the same capacity C

I Each of them is limiting. Let’s choose link 1
C1 = C n1 = 2
flow 0
C2 = C n2 = 2
link 1 link 2
λ0 = C /2
flow 1 flow 2 λ1 = C /2
λ2 =

⇒ λ0 = C /2 and λ1 = C /2
C1 = 0 n1 = 0
111111 000000
000000
000000 111111
C2 = C /2 n2 = 1
111111 flow 2
λ0 = C /2
λ1 = C /2
λ2 =

⇒ λ0 = C /2 and λ1 = C /2
I Remove flows 0 and 1; Update links’ capacity
C1 = 0 n1 = 0
111111 000000
000000
000000 111111
C2 = 0 n2 = 0
111111 flow 2
λ0 = C /2
λ1 = C /2
λ2 = C /2

⇒ λ0 = C /2 and λ1 = C /2
I Link 2 sets λ1 = C /2
C1 = 0 n1 = 0
111111 000000
000000
000000 111111
C2 = 0 n2 = 0
111111 flow 2
λ0 = C /2
λ1 = C /2
λ2 = C /2

⇒ λ0 = C /2 and λ1 = C /2
I Link 2 sets λ1 = C /2
We’re done computing the bandwidth allocated to each flow
Max-Min Fairness on Backbone
C0 =1 n0 =1
Flow 1 C1 = 1000 n1 =1
link 1 link 3 C2 = 1000 n2 =2
C3 = 1000 n3 =1
link 2
C4 = 1000 n4 =1
link 0 link 4
Flow 2 λ1 =
λ2 =
1 1 1000 1000 1000 1000

I The limiting link is link 0 since 1 = min 1, 1 , 2 , 1 , 1
C0 =0 n0 =0
Flow 1 C1 = 1000 n1 =1
link 1 link 3 C2 = 999 n2 =1
C3 = 1000 n3 =1
link 2
C4 = 999 n4 =0
link 0 link 4
Flow 2 λ1 =
λ2 = 1
1 1 1000 1000 1000 1000

I This fixes λ2 = 1. Update the links
C0 =0 n0 =0
Flow 1 C1 = 1000 n1 =1
link 1 link 3 C2 = 999 n2 =1
C3 = 1000 n3 =1
link 2
C4 = 999 n4 =0
link 0 link 4
Flow 2 λ1 =
λ2 = 1
1 1 1000 1000 1000 1000

I The limiting link is link 2
C0 =0 n0 =0
Flow 1 C1 =1 n1 =0
link 1 link 3 C2 =0 n2 =0
C3 =1 n3 =0
link 2
C4 = 999 n4 =0
link 0 link 4
Flow 2 λ1 = 999
λ2 = 1
1 1 1000 1000 1000 1000

I This fixes λ1 = 999
C0 =0 n0 =0
Flow 1 C1 =1 n1 =0
link 1 link 3 C2 =0 n2 =0
C3 =1 n3 =0
link 2
C4 = 999 n4 =0
link 0 link 4
Flow 2 λ1 = 999
λ2 = 1
1 1 1000 1000 1000 1000

I This fixes λ1 = 999
I Done. We know λ1 and λ2
Side note: OptorSim 2.1 on Backbone
OptorSim (developped @CERN for Data-Grid)

I One of the rare ad-hoc simulators not using wormhole
Unfortunately, “strange” resource sharing:

Cl
1. For each link, compute the share that each flow may get: nl

Cl
2. For each flow, compute what it gets: λf = min
l∈f nl

1. For each link, compute the share that each flow may get: Cnll

Cl
l∈f nl
C0 = 1 n1 = 1 share =
Flow 1
C1 = 1000 n1 = 1 share =
link 1 link 3 C2 = 1000 n2 = 2 share =
link 2 C3 = 1000 n3 = 1 share =
link 0 C4 = 1000 n4 = 1 share =
link 4
Flow 2
λ1 =
λ2 =


Cl
l∈f nl
C0 = 1 n1 = 1 share = 1
Flow 1
C1 = 1000 n1 = 1 share = 1000
link 1 link 3 C2 = 1000 n2 = 2 share = 500
link 2 C3 = 1000 n3 = 1 share = 1000
link 0
C4 = 1000 n4 = 1 share = 1000
link 4
Flow 2
λ1 = min(1000, 500, 1000)
λ2 = min( 1 , 500, 1000)


Cl
l∈f nl
C0 = 1 n1 = 1 share = 1
Flow 1
C1 = 1000 n1 = 1 share = 1000
link 1 link 3 C2 = 1000 n2 = 2 share = 500
link 2 C3 = 1000 n3 = 1 share = 1000
link 0
C4 = 1000 n4 = 1 share = 1000
link 4
Flow 2
λ1 = min(1000, 500, 1000) = 500!!
λ2 = min( 1 , 500, 1000) = 1
λ1 limited by link 2, but 499 still unused on link 2
This “unwanted feature” is even listed in the README file...
Proportional Fairness
Max-Min validity limits

I MaxMin gives a fair share to everyone
I Reasonable, but TCP does not do so
I Congestion mecanism: Additive Increase, Muplicative Decrease (AIMD)
I Complicates modeling, as shown in literature
Proportional Fairness
I MaxMin gives more to long flows (resource-eager), TCP known to do opposite
X
I Objective function: maximize wf log(λf ) (instead of min wf λf for MaxMin)
F
F
I log favors short flows
Kelly, Charging and rate control for elastic traffic, in European Transactions on Telecommunica-
tions, vol. 8, 1997, pp. 33-37.
Implementing Proportional Fairness
Karush Kuhn Tucker conditions:

I Solution {λf }f ∈F is uniq
X λ0 − λf
I Any other feasible solution {λ0f }f ∈F satisfy: f
≤0
λf
f ∈F
⇒ Compute the point {λf } where the derivate is zero (convex optimization)
→ Use Lagrange multipliers and steepest gradient descent

X λ0 − λf
≤0
λf
f ∈F
Proportional Fairness on Homogeneous Linear Network

flow 0

C C ×n
I Maths give that: λ0 = and ∀l 6= 0, λl =
n+1 n+1

X λ0 − λf
≤0
λf
f ∈F
Proportional Fairness on Homogeneous Linear Network

flow 0

C C ×n
I Maths give that: λ0 = and ∀l 6= 0, λl =
n+1 n+1
I Ie, for C=100Mb/s and n=3, λ0 = 25Mb/s, λ1 = λ2 = λ3 = 75Mb/s
I Closer to practitioner expectations
Recent TCP implementation
More protocol refinement, more model complexity
I Every agent changes its window size according to its neighbors’ one
(selfish net-utility maximization)
I Computing a distributed gradient for Lagrange multipliers ; same updates
TCP Vegas converges to a weighted proportional fairness

P
I Objective function: maximize Lf × log(λf ) (Lf being the latency)

P
TCP Reno is even worse X

I Objective function: maximize arctan(λf )
f ∈F
Low, S.H., A Duality Model of TCP and Queue Management Algorithms, IEEE/ACM
Transactions on Networking, 2003.

P
TCP Reno is even worse X

I Objective function: maximize arctan(λf )
f ∈F
Low, S.H., A Duality Model of TCP and Queue Management Algorithms, IEEE/ACM
Transactions on Networking, 2003.
Efficient implementation: possible, but not so trivial

I Computing distributed gradient for Lagrange multipliers: useless in our setting
I Lagrange multipliers computable with efficient optimal-step gradient descent
So, what is the model used in SimGrid?
“--cfg=network model” command line argument
I CM02 ; MaxMin fairness
I Vegas ; Vegas TCP fairness (Lagrange approach)
I Reno ; Reno TCP fairness (Lagrange approach)
I By default in SimGrid v3.3: CM02
I Example: ./my simulator --cfg=network model:Vegas
CPU sharing policy

I Default MaxMin is sufficient for most cases
I cpu model:ptask L07 ; model specific to parallel tasks
CPU sharing policy

Want more?
I network model:gtnets ; use Georgia Tech Network Simulator for network
Accuracy of a packet-level network simulator without changing your code (!)
CPU sharing policy

Want more?
I network model:gtnets ; use Georgia Tech Network Simulator for network
Accuracy of a packet-level network simulator without changing your code (!)
I Plug your own model in SimGrid!!
(usable as scientific instrument in TCP modeling field, too)
How are these models used in practice?
Simulation kernel main loop
Data: set of resources with working rate
1
0
0
1
0
1
11
00
00
11
00
11
Simulated time
1. Some actions get created (by application) and assigned to resources
111
000
000
111
000 1
111
00
1100
11 0
00
11
00
1100
11
000
111
0
1
0
1
00111
11000
11
0000
11
00
11
0011
1100
Simulated time
2. Compute share of everyone (resource sharing algorithms)
111
000
000
111
000 1
111
00
1100
11 0
00
11
00
1100
11
000
111
0
1
0
1
00111
11000
11
0000
11
00
11
0011
1100
Simulated time
3. Compute the earliest finishing action, advance simulated time to that time
111
000
000
111
t
000 1
111 000
111
00
1100
11 0 111
000
111
00000
11
00
11
00
11
00
1100
11
000
111
0
1
0
1
111
00000
11
00
11
111111
000000
00111
11000
11
00
11
00
11
00
11
0011
00 11
00
00
11 Simulated time
4. Remove finished actions
111
000 00
11
11
00 00
11
000
111
000
111 00
11
00011
00
00111
11
t
000
111 111
000
1
0
0
1 111
000
111
00000
11
00
11
0
1
111
00000
11
00
11
111111
000000
11
00
11
00
11
00
11
0011
00 11
00
00
11 Simulated time
5. Loop back to 2
111
000
000
111
t
000
111 000
111
1
0
0
1 111
000
111
000
0
1
111
000
111111
000000
11
00
11
00
11
00
11
0011
00 11
00
00
11 Simulated time
5. Loop back to 2
111
000
000
111
t
000
111 000
111 0
1
1
0
0
1 111
000
111
000 0
1
0
1
111
000
1111111
0000000
0 11
1 00
00
11
11
00
00
11
00
11 11
00
0
1
00
11
0
1 Simulated time
5. Loop back to 2
111
000
000
111
000
111
t
111
000
111
000 1
00
11
0
1
0
0
1
111
000
111
000 00
11
0
1
1111111
00
11
0
0
1
00000000
11
0
1
11
00
00
11
00
11 11
00
0
1
00
11
0
1 Simulated time
Adding Dynamic Availabilities to the Picture
Trace definition
I List of discrete events where the maximal availability changes
I t0 → 100%, t1 → 50%, t2 → 80%, etc.
Adding traces doesn’t change kernel main loop

I Availability changes: simulation events, just like action ends
1
0
0
1
0
1
Simulated time
Trace definition
I t0 → 100%, t1 → 50%, t2 → 80%, etc.

111
000
000
111
000 1
111
0011
1100 0
00
11
00
1100
11
000
111
0
1
0
1
00111
11000 Simulated time
Trace definition
I t0 → 100%, t1 → 50%, t2 → 80%, etc.

111
000
000
111
000 1
111 00
11
0011
1100 0 11
00
111
000
00
11
00
1100
11
000
111
0
1
0
1
11
00
11
00
00111
Trace definition
I t0 → 100%, t1 → 50%, t2 → 80%, etc.

111
000
000
111
000 1
111 00
11
0011
1100 0 11
00
111
000
00
1100
11 11000
111
0
1
00
11000
111 0
1
11
00
00 111111
000000
111
000
00111
Trace definition
I t0 → 100%, t1 → 50%, t2 → 80%, etc.

111
000
000
111
000
111
000
111
00
1100
11 11
00
00
11 000
111
00
11
00
1100
11
000
111
0
1
0
1 000
111
00
11 000
111000
111
00000
11111
000
111
00000
11111
00
11
00
1100
11
000
111
0
1
000
111
00
11 000
111000
111
00000
11111
0011111
00000
0
1
00111
11000 11 000
111000
11111
00 Simulated time
Trace definition
I t0 → 100%, t1 → 50%, t2 → 80%, etc.

111
000
000
111
000 1
111 00
11 000
111
0011
1100 0 11
00
111
000111
000 00000
11111
111
000
000
111 00
11
00
1100
11 0
1
11
00 00000
11111
00
11000
111 110000001111
111111 00
0
1
00111
11000 00 111
000111
000
00 00
11 Simulated time
SimGrid also accept state changes (on/off)
Agenda
Single link
Dumbbell
Random platforms
Simulation speed
Platform Catalog
Conclusion
SimGrid Validation
Quantitative comparison of SimGrid with Packet-Level Simulators
I NS2: The Network Simulator
I SSFnet: Scalable Simulation Framework 2.0 (Dartmouth)
I GTNetS: Georgia Tech Network Simulator
Methodological limits
I Packet-level supposed accurate (comparison to real-world: future work)
I Max-Min only: other models were not part of SimGrid at that time
Challenges
I Which topology?
I Which parameters consider? e.g. bandwidth, latency, size, all
I How to estimate performance? e.g. throughput, communication time
I How to estimate simulation response time slowdown?
P PerfPacketLevel
I How to compute error? e.g. PerfSimGrid
Velho, Legrand, Accuracy Study and Improvement of Network Simulation in the SimGrid Frame-
work, to appear in Second International Conference on Simulation Tools and Techniques, SIMU-
Tools’09, Rome, Italy, March 2009.
(other publication by Velho and Legrand submitted to SimuTools’09)
SimGrid Validation
Experiments assumptions
I Topology: Single Link; Dumbbell; Random topologies (several)
I Parameters: data size, #flows, #nodes, link bandwidth and latency
I Performance: communication time and bandwidth estimation
I All TCP flows start at the same time
I All TCP flows are stopped when the first flow completes
I Bandwidth estimation is done based on communication remaining.
Simulation time
I Slowdown: Simulated time
Notations
I B, link nominal bandwidth ; L, link latency
I S, Amount of transmitted data
I Error: ε(TGTNetS , TSimGrid ) = log(TGTNetS ) − log(TSimGrid )
I Symmetrical for over and under estimations (thanks to logs)
1X
I Average error: |ε| = |εi | Max error: |εmax | = max(|εi |)
n i i
I Computing gain/loss in percentage: e |ε| − 1 or e |εmax | − 1
Validation experiments on a single link (1/2)
Experimental settings
TCP Link TCP I Flow throughput as function of L and B
source sink
1 flow
I Fixed size (S=100MB) and window (W=20KB)
source sink
1 flow
Results
Legend
1000 I Mesh: SimGrid results
800
S
Throughput (KB/s)
600 W
S/min(B, 2L
) +L
400
200
I 2: GTNetS results
I #: NS2 results
0
1000
I ×: SSFNet
0
20 with TCP FAST INTERVAL=default
500 40
Bandwidth (KB/s)
60 I +: SSFNet
80 Latency (ms)
0 100 with TCP FAST INTERVAL=0.01
source sink
1 flow
Results
Legend
1000 I Mesh: SimGrid results
800
S
Throughput (KB/s)
600 W
S/min(B, 2L
) +L
400
200
I 2: GTNetS results
I #: NS2 results
0
1000
I ×: SSFNet
0
20 with TCP FAST INTERVAL=default
500 40
Bandwidth (KB/s)
60 I +: SSFNet
80 Latency (ms)
0 100 with TCP FAST INTERVAL=0.01
Conclusion
I SimGrid estimations close to packet-level simulators (when S=100MB)
W
I When B < 2L
(B=100KB/s, L=500ms), |εmax | ≈ |ε| ≈ 1%
W
I When B > 2L
(B=100KB/s, L= 10ms), |εmax | ≈ |ε| ≈ 2%
TCP
source
Link TCP
sink
I Compute achieved bandwidth as function of S
1 flow I Fixed L=10ms and B=100MB/s
Evaluation of the CM02 model

900
800
700
I Packet-level tools don’t completely agree
Throughput (Kb/s)
600
500
400
I SSFNet TCP FAST INTERVAL bad default
300 NS2
SSFNet (0.2)
I GTNetS is equally distant from others
200
SSFNet (0.01)
100 GTNets
SimGrid
0
0.001 0.01 0.1 1 10 100 1000
Data size (Mb)

2 I CM02 doesn’t take slow start into account
1.5
S |ε| |εmax |
1 S < 100KB ≈ 146% ≈ 508%
|ε|
0.5
S ∈ [100KB; 10MB] ≈ 17% ≈ 80%
S > 10MB ≈ 1% ≈ 1%
0
0.001 0.01 0.1 1 10 100 1000
Data size (MB)
TCP
source
Link TCP
sink
I Compute achieved bandwidth as function of S
1 flow I Fixed L=10ms and B=100MB/s
Evaluation of the LV08 model

900
800
I Statistical analysis of GTNetS slow-start
700 I New SimGrid model (MaxMin based)
Throughput (KB/s)
600
500 I Bandwidth decreased (92%)

400
300 NS2
I Latency changed to 10.4 × L
SSFNet (0.2)
200
SSFNet (0.01)
100 GTNets
SimGrid
0
0.001 0.01 0.1 1 10 100 1000
I This dramatically improve validity range
Data size (MB)
|ε| |εmax |
2
S
1.5 S < 100KB ≈ 12% ≈ 162%
1
S > 100KB ≈ 1% ≈ 6%
|ε|
0.5
0
0.001 0.01 0.1 1 10 100 1000
Data size (MB)
Validation experiments on the dumbbell topology
Flo
wA
10
0M B/s
10 B/ 0M s
ms s 10 m
10
B MB/s
20 ms
s 10
B/ 0M
0 0M s L B/
s
1 m ms
10
B
ow
Fl
I Comparison limited to the GTNetS packet-level simulator

I Bandwidth: linearly sampled with 16 points B ∈ [0.01, 1000] MB/s
I Latency: linearly sampled with 20 points L ∈ [0, 200] ms
I Size: S=100MB
I Compare instantaneous bandwidth share (SimGrid vs. GTNetS)
Throughput as function of L when B=100MB/s (limited by latency)
0.60
0.55
Flo
wA 0.50
Instantaneous Bandwidth Share (MB/s)

10 /s 0.45 Flow B
0M
10 B/ 0 MB Flow A
ms s 10 ms 0.40
10
B MB/s
0.35
20 ms
s 10
B/ 0M 0.30
0M L B/
10 ms ms s 0.25
10
B 0.20
ow
Fl 0.15
0.10
0.05
Similar trends in both simulators 0
1
50
50
0.1
0.1
10
8
8
100
200
100
200
150
150
10
GTNetS Latency (ms) SimGrid
I L < 10 ms ⇒ Flow A gets less bandwidth than B
I L = 10 ms ⇒ Flow A gets as much bandwidth as B
I L > 10 ms ⇒ Flow A gets more bandwidth than B
Neglectable error
I |εmax | ≈ 0.1%
Throughput as function of L when B=0.1MB/s (limited by bandwidth)

0.100
0.090
Instantaneous Bandwidth Share (MB/s)

0.080 Flow B
Flo Flow A
wA 0.070
10
0M B/s 0.060
10 B/ 0M s
ms s 10 m
10 0.050
B MB/s
20 ms
0.040
s 10
B/ 0M 0.030
0M L B/
10 0 ms ms s
1 0.020
B
ow 0.010
Fl
0
50
10
100
200
50
150
10
100
200
150
GTNetS SimGrid
Analysis Latency (ms)
I The trend is respected (as L increases share of flow A increases)

I |ε| ≈ 15%; |εmax | ≈ 202%
I Model inaccurate or badly instantiated...
Data fitting on the Bandwidth ratio in GTNetS
Ratio = f(B,L) Share flow A
Approximation
I Ratio = (in GTNetS)
Share flow B
6 P
Ratio 5
4
Li + 8775
Bi
3
2
I Data fitting = P
8775
1 Lj + Bj
0.02 5e+07
0.04
0.06 4e+07
0.08
0.1
0.12
0.14 2e+07
3e+07
Bandwidth
I Conclusion: MaxMin needs priorities
Latency
0.16 1e+07
0.18
0.2 0
I CM02 use latency only (wi = Li )
I bandwidth should be considered
Data fitting on the Bandwidth ratio in GTNetS
Ratio = f(B,L) Share flow A
Approximation
I Ratio = (in GTNetS)
Share flow B
6 P
Ratio 5
4
Li + 8775
Bi
3
2
I Data fitting = P
8775
1 Lj + Bj
0.02 5e+07
0.04
0.06 4e+07
0.08
0.1
0.12
0.14 2e+07
3e+07
Bandwidth
I Conclusion: MaxMin needs priorities
Latency
0.16 1e+07
0.18
0.2 0
I CM02 use latency only (wi = Li )
I bandwidth should be considered
LV08 improvements
X 8775
I Max-Min with priorities: wi = Lk +
Bk
link k is used by flow i
I Improved results:
I |ε| ≈ 4% ; |εmax | ≈ 44%
Flo
w A
10
0M B/s
10 B/s 0M s
ms 10 m
Summary of all experiences for both flows s

B MB/s
20 ms
10
10
B/ 0M
0M s L B/s
10 m ms
Some are bandwidth-limited, some are latency-limited 10
ow
B
Fl
Flow A Flow B
2 2
Old Old
Improved Improved
1.5 1.5
1 1
|ε|
|ε|
0.5 0.5
0 0
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
Experiments Experiments
Conclusion
I SimGrid uses an accurate yet fast sharing model
I Improved model is validated against GTNetS
I Accuracy has to be evaluated against more general topologies
Validation experiments on random platforms
Experiments on 160 platforms

Hayward
I Platforms generated with BRITE (Waxman model) King

Europe
Victoriaville
SPARCs
Boston
I Bandwidth: uniform distribution Julian

Mike
Browne
Tremblay
Gregory
Gagnon
Mahoney
Turcotte
Homogeneous: B ∈ [100,128] MB/s

Croteau
I Bentz
UniPress
Interleaf
Horne
AutoCAD
Alain
Heterogeneous: B ∈ [10,128] MB/s

Emacs
I Florient
Romano
Poussart
Vincent
Toronto
Ltd
Jean_Maurice
Pointe_Claire
Linda Saint_Amand
Ronald Foisy
I Latency: L ∈ [0; 5] ms (euclidian distance) Maltais

Thibault
Jean_Paul
Jupiter
Toulouse
George
Ginette
Fourier
Wright
Frank
Wilfrid Ottawa
I Flow size: S=10MB Charron

Mont_Tremblant
Jean_Yves
I #flows: F=150; #nodes: N ∈ [50; 200] Boily
I Four scenarios, ten different flow instantiations
Pedro Vehlo, Arnaud Legrand. Accuracy Study and Improvement of Network Simulation in the
SimGrid Framework. Submitted to SimuTools’09.
Validation experiments on random platforms
Summary of experiments
1 2
Max Error (|εmax |)

Old
Old
Mean Error (|ε|)
Improved
Improved
1.5
0.5 1
0.5
0 0
3 3
2 2
1 1
Ratio
Ratio
0 0
−1 −1
−2 −2
−3 −3
10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160
Experiment Experiment
ratio = log (|εOld |) − log (|εImproved |)

Interpretation
I Clear improvements of new model
I |ε| < 0.2 (i.e., ≈ 22%); |εmax | still challenging up to 461%
Simulation speed
200-nodes/200-flows network sending 1MB each
GTNetS SimGrid
# of flows Simulation time simulation
simulated
Simulation time simulation
simulated
10 0.661s 0.856 0.002s 0.002
25 1.651s 1.712 0.008s 0.010
50 3.697s 3.589 0.028s 0.028
100 7.649s 7.468 0.137s 0.140
200 15.705s 11.515 0.536s 0.396
Simulation speed
GTNetS SimGrid
# of flows Simulation time simulation
simulated
Simulation time simulation
simulated
10 0.661s 0.856 0.002s 0.002
25 1.651s 1.712 0.008s 0.010
50 3.697s 3.589 0.028s 0.028
100 7.649s 7.468 0.137s 0.140
200 15.705s 11.515 0.536s 0.396

GTNetS SimGrid
simulation simulation
# of flows Simulation time simulated
Simulation time simulated
10 65s 0.92 0.001s 0.00002
25 163s 1.85 0.008s 0.00010
50 364s 3.89 0.028s 0.00029
100 753s 8.08 0.138s 0.00142
200 1562s 12.59 0.538s 0.00402
I GTNetS execution time linear in both data size and #flows
I SimGrid only depends on #flows
Conclusion
Models of “Grid” Simulators
I Most are overly simplistic (wormhole: slow and inaccurate at best)
I Some are plainly wrong (OptorSim unfortunate sharing policy)
Analytic TCP models not trivial, but possible

I Several models exist in the literature
I They can be implemented efficiently
I SimGrid implements Max-Min fairness, proportional (Vegas & Reno)
Conclusion
Models of “Grid” Simulators
I Most are overly simplistic (wormhole: slow and inaccurate at best)
I Some are plainly wrong (OptorSim unfortunate sharing policy)
Analytic TCP models not trivial, but possible

I Several models exist in the literature
I They can be implemented efficiently
I SimGrid implements Max-Min fairness, proportional (Vegas & Reno)
SimGrid almost compares to Packet-Level Simulators

I Validity acceptable in a many cases (|ε| ≈ 5% in most cases)
I Validity range clearly delimited
I Maximum error still unacceptable
I It is often one GTNetS flow that achieves an insignificant throughput
I Maybe SimGrid is right and GTNetS is wrong?
I SimGrid speedup ≈ 103 , GTNetS slowdown up to 10 (ns-2, SSFNet even worse)
I SimGrid execution time depends only on #flows, not data size
I SimGrid can use GTNetS to perform network predictions (for paranoids)
Future Work
Towards Real-World Experiments
I Assess the several models implemented in SimGrid
I Assess Packet-Level simulators themselves
I Use even more realistic platforms: high contention scenarios
I Use more realistic applications: e.g. (NAS benchmark)
Improve the Macrosopic TCP Models in SimGrid

I Decrease maximum error
I Use LV08 by default instead of CM02
Develop New Models

I Compound models (influence of computation load over communications)
I High-speed networks such as quadrics or myrinet

I Model the disks λ + size
β don’t seem sufficient
I Model multicores
Agenda
Platform Catalog
Conclusion
SimGrid for Research on Large-Scale Distributed Systems Platform Instanciation 67/130

Platform Instantiation
To use models, one must instantiate them

Key questions
I How can I run my tests on realistic platforms? What is a realistic platform?
I What are platform parameters? What are their values in real platforms?
Sources of platform descriptions

I Manual modeling: define the characteristics with your sysadmins
I Automatic mapping
I Synthetic platform generator

What is a Platform Instance Anyway?
Structural description
I Hosts list
I Links and interconnexion topology

I Hosts list
Peak Performance
I Bandwidth and Latencies
I Processing capacity

I Hosts list
Peak Performance
I Bandwidth and Latencies
I Processing capacity
Background Conditions
I Load
I Failures

Platform description for SimGrid
Example of XML file
<?xml version=’1.0’?>
<!DOCTYPE platform SYSTEM "surfxml.dtd">
<platform version="2">
<host id="Jacquelin" power="137333000"/>
<host id="Boivin" power="98095000"/>
<link id="1" bandwidth="3430125" latency="0.000536941"/>

<route src="Jacquelin" dst="Boivin"><link:ctn id="1"/></route>
<route src="Boivin" dst="Jacquelin"><link:ctn id="1"/></route>
</platform>
I Declare all your hosts, with their computing power
I Declare all your links, with bandwidth and latency
I Declare routes from each host to each host (list of links)

Example of XML file
<host id="Boivin" power="98095000"/>

</platform>

other attributes:
I availability file: trace file to let the power vary
I state file: trace file to specify whether the host is up/down

I bandwidth file, latency file, state file: trace files
I sharing policy ∈ {shared, fatpipe} (fatpipe ; no sharing)

Example of XML file
<host id="Boivin" power="98095000">
<prop key="someproperty" value="somevalue"/> 
</host>
</platform>

other attributes:
I availability file: trace file to let the power vary
I state file: trace file to specify whether the host is up/down

I bandwidth file, latency file, state file: trace files
I sharing policy ∈ {shared, fatpipe} (fatpipe ; no sharing)
I Arbitrary data can be attached to components using the <prop> tag
Agenda
Platform Catalog
Conclusion

Platform Catalog
Several Existing Platforms Modeled
Grid’5000 DAS 3
9 sites, 25 clusters 5 clusters
1,528 hosts 277 hosts
GridPP LCG
18 clusters 113 clusters
7,948 hosts 44,184 hosts
Files available from the Platform Description Archive

http://pda.gforge.inria.fr
(+ tool to extract platform subsets)

Agenda
Platform Catalog
Conclusion

Synthetic Topology Generation
Characterizing Platform Realism (to design a generator)
I Examine real platforms
I Discover principles
I Implement a generator
Topology of the Internet

I Subject of studies in Network Community for years
I Decentralized growth, obeying complex rules and incentives
; Could it have a mathematical structure?

; Could we then have generative models?
Three “generations” of graph generators

I Random (or flat)
I Structural
I Degree-based
Random Platform Generator
Two-step generators
1. Nodes are placed on on a square (of side c) following a probability law
2. Each couple (u, v ) get interconnected with a given probability
1. Node Placement
Uniform Heavy Tailed

Random Platform Generator
2. Probability for (u, v ) get be connected
I Uniform: Uniform probability α (not realist, but simple enough to be popular)
I Exponential: probability P(u, v ) = αe −d/(L−d) 0<α61

√
I d: Euclidean distance between u and v ; L = c 2; c side of placement square
I Amount of edges increases with α
I Waxman: probability P(u, v ) = αe −d/(βL) , 0 < α, β 6 1
I Amount of edges increases with α, edge length heterogeneity increases with β

Waxman, Routing of Multipoint Connections, IEEE J. on Selected Areas in Comm., 1988.
(
α if d < L × r
I Locality-aware: probability P(u, v ) =
β if d > L × r
Zegura, Calvert, Donahoo, A quantitative comparison of graph-based models for
Internet topology, IEEE/ACM Transactions on Networking, 1997.

Structural Topology Generators
Generate the hierarchy explicitly (Top-Down)
AS-level Topology (1)
AS Nodes
Transit-stub [Zegura et Al]

I Starting from a connected graph
I Replace some nodes by connected graphs
I Add some additional edges
Edge
Connection
I (GT-ITM, BRITE) Router Level
Topologies
Method
(3)
(2)

Structural Topology Generators
Generate the hierarchy explicitly (Top-Down)
AS-level Topology (1)
AS Nodes
Transit-stub [Zegura et Al]

I Starting from a connected graph
I Replace some nodes by connected graphs
I Add some additional edges
Edge
Connection
I (GT-ITM, BRITE) Router Level
Topologies
Method
(3)
(2)
Transit Domains Multi-homed stub
N-level [Zegura et Al]

I Iterate previous algorithm
I (Tiers, GT-ITM)
Stub-Stub Edge
Stub Domains

Power-Law : Rank Exponent
Analysis of topology at AS level

I Rank rv of node v : its index in the order of decreasing degree
I Degree dv of node v is proportional to its rank rv to the power of constant R
dv = rvR × k
1000 1000 1000 100
"971108.rank" "980410.rank" "981205.rank" "routes.rank"
exp(6.34763) * x ** ( −0.811392 ) exp(6.62082) * x ** ( −0.821274 ) exp(6.16576) * x ** ( −0.74496 ) exp(4.39519) * x ** ( −0.487592 )
100 100 100
10 10 10 10
1 1 1
0.1 0.1 0.1 1

1 10 100 1000 10000 1 10 100 1000 10000 1 10 100 1000 10000 1 10 100 1000 10000
Nov 97 Apr 98 Dec 98 Routers 95

(R = 0, 81) (R = 0, 82) (R = 0, 74) (R = 0, 48)
Faloutsos, Faloutsos, Faloutsos, On Power-law Relationships of the Internet Topology,

SIGCOMM 1999, p251–262.

Power-Law : Rank Exponent
Analysis of topology at AS level

I Rank rv of node v : its index in the order of decreasing degree
I Degree dv of node v is proportional to its rank rv to the power of constant R
dv = rvR × k
1000 1000 1000 100
"971108.rank" "980410.rank" "981205.rank" "routes.rank"
exp(6.34763) * x ** ( −0.811392 ) exp(6.62082) * x ** ( −0.821274 ) exp(6.16576) * x ** ( −0.74496 ) exp(4.39519) * x ** ( −0.487592 )
100 100 100
10 10 10 10
1 1 1
0.1 0.1 0.1 1

1 10 100 1000 10000 1 10 100 1000 10000 1 10 100 1000 10000 1 10 100 1000 10000
Nov 97 Apr 98 Dec 98 Routers 95

(R = 0, 81) (R = 0, 82) (R = 0, 74) (R = 0, 48)
Seem to be necessary condition for topology realism

Faloutsos, Faloutsos, Faloutsos, On Power-law Relationships of the Internet Topology,
SIGCOMM 1999, p251–262.

Degree-based Topology Generators
Power-laws received a lot of attention recently
I Small-World theory
I Not only in CS, but also in sociology for example
Using this idea for realistic platform generation

I Enforce the power law by construction of the platform
Baràbasi-Albert algorithm
I Incremental growth
I Affinity connexion
Probability to connect new v to existing u

du
I Depends on du : P(u, v ) = P
k dk
Barabási and Albert, Emergence of scaling in random networks, Science 1999, num 59, p509–512.
Checking two Power-Laws
Out degree rank
100 10
100
outdegree_rank
outdegree_rank
outdegree_rank
outdegree_rank
outdegree_rank
10
10 100
10
1 1
1 10 100 1000 1 10 100 1000 10000 1 10 100 1000 10000 1 10 100 1000 1 10 100 1000
rank rank rank rank rank
Interdomain Barábasi Albert Waxman Transit-Stub GT-ITM

11/97 (BRITE) (GT-ITM)

Out degree rank
100 10
100
outdegree_rank
outdegree_rank
outdegree_rank
outdegree_rank
outdegree_rank
10
10 100
10
1 1
1 10 100 1000 1 10 100 1000 10000 1 10 100 1000 10000 1 10 100 1000 1 10 100 1000

Out degree frequency
1000 1000
1000
1000
100
frequency
100
frequency
frequency
frequency
frequency
100 100 10
10 10 10
10
1 1 1 1 1
1 10 1 10 1 10 1 10 100
outdegree_freq outdegree_freq outdegree_freq outdegree_freq outdegree_freq


Out degree rank
100 10
100
outdegree_rank
outdegree_rank
outdegree_rank
outdegree_rank
outdegree_rank
10
10 100
10
1 1
1 10 100 1000 1 10 100 1000 10000 1 10 100 1000 10000 1 10 100 1000 1 10 100 1000

Out degree frequency
1000 1000
1000
1000
100
frequency
100
frequency
frequency
frequency
frequency
100 100 10
10 10 10
10
1 1 1 1 1
1 10 1 10 1 10 1 10 100
outdegree_freq outdegree_freq outdegree_freq outdegree_freq outdegree_freq

I Laws respected by interdomain topology ; seemingly necessary condition

I Baràbasi-Albert performs the best (as expected)
I GT-ITM performs the worst
Power laws discussion
Other power laws? On which measurements?
I Expansion I Distortion I Eigenvalues distribution
I Resilience I Excentricity distribution I Set cover size, . . .
I Necessary condition 6= sufficient condition
I Laws observed by Faloutsos brothers are correlated
I They could be irrelevant parameters
Baford, Bestavros, Byers, Crovella, On the Marginal Utility of Network Topology
Measurements, 1st ACM SIGCOMM Workshop on Internet Measurement, 2001.
I They could even be measurement bias!
Lakhina, Byers, Crovella, Xie, Sampling Biases in IP Topology Measurements, INFOCOM’03.

Power laws discussion
Other power laws? On which measurements?
I Expansion I Distortion I Eigenvalues distribution
I Resilience I Excentricity distribution I Set cover size, . . .
I Necessary condition 6= sufficient condition
I Laws observed by Faloutsos brothers are correlated
I They could be irrelevant parameters
Baford, Bestavros, Byers, Crovella, On the Marginal Utility of Network Topology
Measurements, 1st ACM SIGCOMM Workshop on Internet Measurement, 2001.
I They could even be measurement bias!
Lakhina, Byers, Crovella, Xie, Sampling Biases in IP Topology Measurements, INFOCOM’03.
Networks have Power Laws AND structure!

I Cannot afford to trash hierarchical structures just to obey power laws!
I Some projects try to combine both (GridG)
So, Structural or Degree-based Topology Generator?
Observation
I AS-level and router-level have similar characteristics
I Degree-based represent better large-scale properties of the Internet
I Hierarchy seems to arise from degree-based generators
Tangmunarunkit, Govindan, Jamin, Shenker, Willinger, Network topology generators: Degree-
based vs structural, SIGCOMM’02
Conclusion
I 10,000 nodes platform: Degree-based generators perform better
I 100 nodes platform
I Power-laws make no sense
I Structural generators seem more appropriate

So, Structural or Degree-based Topology Generator?
Observation
I AS-level and router-level have similar characteristics
I Degree-based represent better large-scale properties of the Internet
I Hierarchy seems to arise from degree-based generators
Tangmunarunkit, Govindan, Jamin, Shenker, Willinger, Network topology generators: Degree-
based vs structural, SIGCOMM’02
Conclusion
I 10,000 nodes platform: Degree-based generators perform better
I 100 nodes platform
I Power-laws make no sense
I Structural generators seem more appropriate
Routing still remains to be characterized

I It is known that a multi-hop network route is not always the shortest path
Paxson, Measurements and Analysis of End-to-End Internet Dynamics, PhD Thesis UCB, 1997.
I Generators wrongly assume the opposite
Network Performance (labeling graph edges)
We need more than a graph!

I Bandwidth and latency
I Sharing capacity (backplane)

Network Performance (labeling graph edges)
We need more than a graph!

I Bandwidth and latency
I Sharing capacity (backplane)
Model Physical Characteristics (Peak Performance+Background)

I Some “models” in topology generators (WAN/LAN/SAN)
I Need to simulate background traffic (no accepted model to generate it)
I Simulation can be very costly
Model End-to-End Performance (Usable Performance)

I Easier way to go
I Some models exist Lee, Stepanek, On future global grid communication performance, HCW
I Use real raw measurements (NWS, . . . )

Computing Resources (labeling graph vertices)
Situation quite different from network resources:

I Hard to qualify usable performance
I Easy to model peak performance + background conditions
“Ad-hoc” generalization of peak performance

I Look at a real-world platform, e.g., the TeraGrid
I Generate new sites based on existing sites
Statistical modeling (as usual)

I Examine many production resources
I Identify key statistical characteristics
I Come up with a generative/predictive model

Synthetic Clusters
Clusters are classical resource

I What is the “typical” distribution of clusters?
Commodity Cluster synthesizer

I Examined 114 production clusters (10K+ procs)
I Came up with statistical models
I Linear fit between clock-rate and release-year within a processor family
I Quadratic fraction of processors released on a given year
I Validated model against a set of 191 clusters (10K+ procs)
I Models allow “extrapolation” for future configurations
I Models implemented in a resource generator
Kee, Casanova, Chien, Realistic Modeling and Synthesis of Resources for Computational Grids,
Supercomputing 2004.

Background Conditions (workload and resource availability)
Probabilistic Models
I Naive: experimental distributed availability and unavailability intervals
I Weibull distributions:
Nurmi, Brevik, Wolski, Modeling Machine Availability in Enterprise and Wide-area
Distributed Computing Environments, EuroPar 2005.
I Models by Feitelson et Al.: job inter-arrival times (Gamma), amount of work
requested (Hyper-Gamma), number of processors requested: Compounded (2p ,
1, ...)
Feitelson, Workload Characterization and Modeling Book, available at http://www.cs.huji.
il/~feit/wlmod/
Traces
I The Grid Workloads Archive (http://gwa.ewi.tudelft.nl/pmwiki/)
I Resource Prediction System Toolkit (RPS) based traces (http://www.cs.
northwestern.edu/~pdinda/LoadTraces)
I Home-made traces with NWS
Example Synthetic Grid Generation
Generate topology and networks
I Topology: Generate a 5,000 node graph with Tiers
I Latency: Euclidian distance (scaling to obtain the desired network diameter)
I Bandwidth: Set of end-to-end NWS measurements
Generate computational resources
I Pick 30% of the end-points
I Clusters at each end-point: Kee’s synthesizer for Year 2008
I Cluster load: Feitelson’s model (parameters picked randomly)
I Resource failures: based on the Grid Workloads Archive
All-in-one tools
I GridG
Lu and Dinda, GridG: Generating Realistic Computational Grids,
Performance Evaluation Review, Vol. 30::4 2003.
I Simulacrum tool
Quinson, Suter, A Platform Description Archive for Reproducible Simulation Experiments,
Submitted to SimuTools’09.

Agenda
Platform Catalog
Conclusion
SimGrid for Research on Large-Scale Distributed Systems Using SimGrid for Practical Grid Experiments 88/130
Agenda
Platform Catalog
Conclusion
User-visible SimGrid Components
SimDag MSG GRAS AMOK SMPI
Framework for Simple application- Framework toolbox Library to run MPI
to develop applications on top of
DAGs of parallel tasks level simulator distributed applications a virtual environment
XBT: Grounding features (logging, etc.), usual data structures (lists, sets, etc.) and portability layer
SimGrid user APIs

I SimDag: model applications as DAG of (parallel) tasks
I MSG: model applications as Concurrent Sequential Processes
I GRAS: develop real applications, studied and debugged in simulator
AMOK: set of distributed tools (bandwidth measurement, failure detector, . . . )
I SMPI: simulate MPI codes (still under development)
I XBT: grounding toolbox
User-visible SimGrid Components
SimDag MSG GRAS AMOK SMPI
Framework for Simple application- Framework toolbox Library to run MPI
to develop applications on top of
DAGs of parallel tasks level simulator distributed applications a virtual environment
XBT: Grounding features (logging, etc.), usual data structures (lists, sets, etc.) and portability layer
SimGrid user APIs

I SimDag: model applications as DAG of (parallel) tasks
I MSG: model applications as Concurrent Sequential Processes
I GRAS: develop real applications, studied and debugged in simulator
AMOK: set of distributed tools (bandwidth measurement, failure detector, . . . )
I SMPI: simulate MPI codes (still under development)
I XBT: grounding toolbox
Which API should I choose?
I Your application is a DAG ; SimDag
I You have a MPI code ; SMPI
I You study concurrent processes, or distributed applications
I You need graphs about several heuristics for a paper ; MSG
I You develop a real application (or want experiments on real platform) ; GRAS
I Most popular API (for now): MSG
Argh! Do I really have to code in C?!
No, not necessary
I Some bindings exist: Java bindings to the MSG interface (new in v3.3)
I More bindings planned:
I C++, Python, and any scripting language
I SimDag interface
No, not necessary
I SimDag interface
Well, sometimes yes, but...

I SimGrid itself is written from C for speed and portability (no dependency)
I All components naturally usable from C (most of them only accessible from C)
I XBT eases some difficulties of C
I Full-featured logs (similar to log4j), Exception support (in ANSI C)
I Popular abstract data types (dynamic array, hash tables, . . . )
I Easy string manipulation, Configuration, Unit testing, . . .
No, not necessary
I SimDag interface
Well, sometimes yes, but...

I SimGrid itself is written from C for speed and portability (no dependency)
I All components naturally usable from C (most of them only accessible from C)
I XBT eases some difficulties of C
I Full-featured logs (similar to log4j), Exception support (in ANSI C)
I Popular abstract data types (dynamic array, hash tables, . . . )
I Easy string manipulation, Configuration, Unit testing, . . .
What about portability?

I Regularly tested under: Linux (x86, amd64), Windows and MacOSX
I Supposed to work under any other Unix system (including AIX and Solaris)
Agenda
Platform Catalog
Conclusion
Root
1 2
3 4 5
End
Main functionalities
1. Create a DAG of tasks
I Vertices: tasks (either communication or computation)
I Edges: precedence relation
Root
1 2 2
3 4 5 Time
1
6 5
Time
6
End
2. Schedule tasks on resources
Root
1 2 2
2 3
3 4 5 Time
1
4
1 5 4 6
6 5
Time
6
End
2. Schedule tasks on resources
3. Run the simulation (respecting precedences)
; Compute the makespan
The SimDag interface
DAG creation
I Creating tasks: SD task create(name, data)
I Creating dependencies: SD task dependency {add/remove}(src,dst)
Scheduling tasks
I SD task schedule(task, workstation number, *workstation list,
double *comp amount, double *comm amount,
double rate)
I Tasks are parallel by default; simply put workstation number to 1 if not
I Communications are regular tasks, comm amount is a matrix
I Both computation and communication in same task possible
I rate: To slow down non-CPU (resp. non-network) bound applications
I SD task unschedule, SD task get start time
Running the simulation
I SD simulate(double how long) (how long < 0 ; until the end)
I SD task {watch/unwatch}: simulation stops as soon as task’s state changes
Full API in the doxygen-generated documentation

Agenda
Platform Catalog
Motivations, Concepts and Example of Use
Java bindings
A Glance at SimGrid Internals
Performance Results
Conclusion
MSG: Heuristics for Concurrent Sequential Processes
(historical) Motivation
I Centralized scheduling does not scale
I SimDag (and its predecessor) not adapted to study decentralized heuristics
I MSG not strictly limited to scheduling, but particularly convenient for it
Main MSG abstractions

I Agent: some code, some private data, running on a given host
I Task: amount of work to do and of data to exchange
I Host: location on which agents execute

I Mailbox: similar to MPI tags
Main MSG abstractions

I Agent: some code, some private data, running on a given host
set of functions + XML deployment file for arguments
I Task: amount of work to do and of data to exchange
I MSG task create(name, compute duration, message size, void *data)
I Communication: MSG task {put,get}, MSG task Iprobe
I Execution: MSG task execute
MSG process sleep, MSG process {suspend,resume}
I Host: location on which agents execute
I Mailbox: similar to MPI tags
The MSG master/workers example: the worker
The master has a large number of tasks to dispatch to its workers for execution
int worker(int argc, char *argv[ ]) {

m_task_t task; int errcode;
int id = atoi(argv[1]);
char mailbox[80];
sprintf(mailbox,"worker-%d",id);
while(1) {
errcode = MSG_task_receive(&task, mailbox);
xbt_assert0(errcode == MSG_OK, "MSG_task_get failed");
if (!strcmp(MSG_task_get_name(task),"finalize")) {
MSG_task_destroy(task);
break;
}
INFO1("Processing ’%s’", MSG_task_get_name(task));

MSG_task_execute(task);
INFO1("’%s’ done", MSG_task_get_name(task));
MSG_task_destroy(task);
}
INFO0("I’m done. See you!");

return 0;
}
The MSG master/workers example: the master
int master(int argc, char *argv[ ]) {

int number_of_tasks = atoi(argv[1]); double task_comp_size = atof(argv[2]);
double task_comm_size = atof(argv[3]); int workers_count = atoi(argv[4]);
char mailbox[80]; char buff[64];
int i;
/* Dispatching (dumb round-robin algorithm) */

for (i = 0; i < number_of_tasks; i++) {
sprintf(buff, "Task_%d", i);
task = MSG_task_create(sprintf_buffer, task_comp_size, task_comm_size, NULL);
sprintf(mailbox,"worker-%d",i % workers_count);
INFO2("Sending %̈s¨ to mailbox %̈s¨
", task->name, mailbox);
MSG_task_send(task, mailbox);
}
/* Send finalization message to workers */

INFO0("All tasks dispatched. Let’s stop workers");
for (i = 0; i < workers_count; i++)
MSG_task_put(MSG_task_create("finalize", 0, 0, 0), workers[i], 12);
INFO0("Goodbye now!"); return 0;

}
The MSG master/workers example: deployment file
Specifying which agent must be run on which host, and with which arguments
XML deployment file



<process host="Tremblay" function="master">
<argument value="6"/> 
<argument value="50000000"/> 
<argument value="1000000"/> 
<argument value="3"/> 
</process>


<process host="Jupiter" function="worker"><argument value="0"/></process>
<process host="Fafard" function="worker"><argument value="1"/></process>
<process host="Ginette" function="worker"><argument value="2"/></process>
</platform>
The MSG master/workers example: the main()
Putting things together
int main(int argc, char *argv[ ]) {

/* Declare all existing agent, binding their name to their function */
MSG_function_register("master", &master);
MSG_function_register("worker", &worker);
/* Load a platform instance */

MSG_create_environment("my_platform.xml");
/* Load a deployment file */
MSG_launch_application("my_deployment.xml");
/* Launch the simulation (until its end) */

MSG_main();
INFO1("Simulation took %g seconds",MSG_get_clock());

}
The MSG master/workers example: raw output
[Tremblay:master:(1) 0.000000] [example/INFO] Got 3 workers and 6 tasks to process
[Tremblay:master:(1) 0.000000] [example/INFO] Sending ’Task_0’ to ’worker-0’
[Jupiter:worker:(2) 0.147613] [example/INFO] Processing ’Task_0’
[Fafard:worker:(3) 0.347192] [example/INFO] Processing ’Task_1’
[Ginette:worker:(4) 0.475692] [example/INFO] Processing ’Task_2’
[Jupiter:worker:(2) 0.802956] [example/INFO] ’Task_0’ done
[Jupiter:worker:(2) 0.950569] [example/INFO] Processing ’Task_3’
[Fafard:worker:(3) 1.002534] [example/INFO] ’Task_1’ done
[Fafard:worker:(3) 1.202113] [example/INFO] Processing ’Task_4’
[Ginette:worker:(4) 1.506790] [example/INFO] ’Task_2’ done
[Jupiter:worker:(2) 1.605911] [example/INFO] ’Task_3’ done
[Tremblay:master:(1) 1.635290] [example/INFO] All tasks dispatched. Let’s stop workers.
[Ginette:worker:(4) 1.635290] [example/INFO] Processing ’Task_5’
[Jupiter:worker:(2) 1.636752] [example/INFO] I’m done. See you!
[Fafard:worker:(3) 1.857455] [example/INFO] ’Task_4’ done
[Fafard:worker:(3) 1.859431] [example/INFO] I’m done. See you!
[Ginette:worker:(4) 2.666388] [example/INFO] ’Task_5’ done
[Tremblay:master:(1) 2.667660] [example/INFO] Goodbye now!
[Ginette:worker:(4) 2.667660] [example/INFO] I’m done. See you!
[2.667660] [example/INFO] Simulation time 2.66766
The MSG master/workers example: colorized output
$ ./my_simulator | MSG_visualization/colorize.pl
[ 0.000][ Tremblay:master ] Got 3 workers and 6 tasks to process
[ 0.000][ Tremblay:master ] Sending ’Task_0’ to ’worker-0’
[ 0.148][ Jupiter:worker ] Processing ’Task_0’
[ 0.347][ Fafard:worker ] Processing ’Task_1’
[ 0.476][ Ginette:worker ] Processing ’Task_2’
[ 0.803][ Jupiter:worker ] ’Task_0’ done
[ 0.951][ Jupiter:worker ] Processing ’Task_3’
[ 1.003][ Fafard:worker ] ’Task_1’ done
[ 1.202][ Fafard:worker ] Processing ’Task_4’
[ 1.507][ Ginette:worker ] ’Task_2’ done
[ 1.606][ Jupiter:worker ] ’Task_3’ done
[ 1.635][ Tremblay:master ] All tasks dispatched. Let’s stop workers.
[ 1.635][ Ginette:worker ] Processing ’Task_5’
[ 1.637][ Jupiter:worker ] I’m done. See you!
[ 1.857][ Fafard:worker ] ’Task_4’ done
[ 1.859][ Fafard:worker ] I’m done. See you!
[ 2.666][ Ginette:worker ] ’Task_5’ done
[ 2.668][ Tremblay:master ] Goodbye now!
[ 2.668][ Ginette:worker ] I’m done. See you!
[ 2.668][ ] Simulation time 2.66766
Agenda
Platform Catalog
Java bindings
Performance Results
Conclusion
MSG bindings for Java: master/workers example
import simgrid.msg.*;
public class BasicTask extends simgrid.msg.Task {
public BasicTask(String name, double computeDuration, double messageSize)
throws JniException {
super(name, computeDuration, messageSize);
}
}
public class FinalizeTask extends simgrid.msg.Task {
public FinalizeTask() throws JniException {
super("finalize",0,0);
}
}
public class Worker extends simgrid.msg.Process {
public void main(String[ ] args) throws JniException, NativeException {
String id = args[0];
while (true) {
Task t = Task.receive("worker-" + id);
if (t instanceof FinalizeTask)
break;
BasicTask task = (BasicTask)t;
Msg.info("Processing ’" + task.getName() + "’");
task.execute();
Msg.info("’" + task.getName() + "’ done ");
}
Msg.info("Received Finalize. I’m done. See you!");
}
}
import simgrid.msg.*;
public class Master extends simgrid.msg.Process {
public void main(String[ ] args) throws JniException, NativeException {
int numberOfTasks = Integer.valueOf(args[0]).intValue();
double taskComputeSize = Double.valueOf(args[1]).doubleValue();
double taskCommunicateSize = Double.valueOf(args[2]).doubleValue();
int workerCount = Integer.valueOf(args[3]).intValue();
Msg.info("Got "+ workerCount + " workers and " + numberOfTasks + " tasks.");
for (int i = 0; i < numberOfTasks; i++) {

BasicTask task = new BasicTask("Task_" + i ,taskComputeSize,taskCommunicateSize);
task.send("worker-" + (i % workerCount));
Msg.info("Send completed for the task " + task.getName() +

" on the mailbox ’worker-" + (i % workerCount) + "’");
}
Msg.info("Goodbye now!");
}
}
Rest of the story

I XML files (platform, deployment) not modified
I No need for a main() function glueing things together
I Java introspection mecanism used for this
I simgrid.msg.Msg contains an adapted main() function
I Name of XML files must be passed as command-line argument
I Output very similar too
XWhat about performance loss?

XXworkers
XXX
100 500 1,000 5,000 10,000
tasks XXX
X
1,000 native .16 .19 .21 .42 0.74
java .41 .59 .94 7.6 27.
10,000 native .48 .52 .54 .83 1.1 I Small platforms: ok
java 1.6 1.9 2.38 13. 40.
100,000 native 3.7 3.8 4.0 4.4 4.5 I Larger ones: not quite. . .
java 14. 13. 15. 29. 77.
1,000,000 native 36. 37. 38. 41. 40.
java 121. 130. 134. 163. 200.
Agenda
Platform Catalog
Java bindings
Performance Results
Conclusion
Implementation of CSPs on top of simulation kernel
Idea
I Each process is implemented in a thread
I Blocking actions (execution and communication) reported into kernel
I A maestro thread unlocks the runnable threads (when action done)
Idea
Example
I Thread A:
I Send ”toto” to B
I Receive something from B
I Thread B:
I Receive something from A
I Send ”blah” to A
Idea
Example
I Thread A: Maestro Thread A Thread B
Simulation
Kernel:
I Send ”toto” to B who’s next?
Send "toto" to B
I Thread B: Receive from A

I Send ”blah” to A Receive from B
Send "blah" to A
(done)
(done)
Idea
Example
I Thread A: Maestro Thread A Thread B
Simulation
Kernel:
I Send ”toto” to B who’s next?
Send "toto" to B
I Thread B: Receive from A

I Send ”blah” to A Receive from B
I Maestro schedules threads Send "blah" to A

Order given by simulation kernel
(done)
I Mutually exclusive execution
(done)
(don’t fear)
SMPI GRAS
SimDag MSG
SMURF
SimIX network proxy
SimIX
”POSIX-like” API on a virtual platform
SURF
virtual platform simulator
XBT
I SURF: Simulation kernel, grounding simulation

Contains all the models (uses GTNetS on need)
I SimIX: Eases the writting of user APIs based on CSPs
Provided semantic: threads, mutexes and conditions on top of simulator
I SMURF: Allows to distribute the simulation over a cluster (under development)
Not for speed but for memory limit (at least for now)
Agenda
Platform Catalog
Java bindings
Performance Results
Conclusion
Some Performance Results
Master/Workers on amd64 with 4Gb
#tasks Context #Workers
mecanism 100 500 1,000 5,000 10,000 25,000
1,000 ucontext 0.16 0.19 0.21 0.42 0.74 1.66
pthread 0.15 0.18 0.19 0.35 0.55 ?
java 0.41 0.59 0.94 7.6 27. ?
10,000 ucontext 0.48 0.52 0.54 0.83 1.1 1.97 ?: #semaphores reached system limit
pthread 0.51 0.56 0.57 0.78 0.95 ?
java 1.6 1.9 2.38 13. 40. ?
(2 semaphores per user process,
100,000 ucontext 3.7 3.8 4.0 4.4 4.5 5.5 System limit = 32k semaphores)
pthread 4.7 4.4 4.6 5.0 5.23 ?
java 14. 13. 15. 29. 77. ?
1,000,000 ucontext 36. 37. 38. 41. 40. 41.
pthread 42. 44. 46. 48. 47. ?
java 121. 130. 134. 163. 200. ?
Extensibility with UNIX contextes

#tasks Stack #Workers
size 25,000 50,000 100,000 200,000
1,000 128Kb 1.6 † † †
12Kb 0.5 0.9 1.7 3.2
10,000 128Kb 2 † † †
12Kb 0.8 1.2 2 3.5
100,000 128Kb 5.5 † † †
12Kb 3.7 4.1 4.8 6.7
1,000,000 128Kb 41 † † †
12Kb 33 33.6 33.7 35.5
5,000,000 128Kb 206 † † †
12Kb 161 167 161 165 †: out of memory
Some Performance Results
Master/Workers on amd64 with 4Gb
#tasks Context #Workers
mecanism 100 500 1,000 5,000 10,000 25,000
1,000 ucontext 0.16 0.19 0.21 0.42 0.74 1.66
pthread 0.15 0.18 0.19 0.35 0.55 ?
java 0.41 0.59 0.94 7.6 27. ?
10,000 ucontext 0.48 0.52 0.54 0.83 1.1 1.97 ?: #semaphores reached system limit
pthread 0.51 0.56 0.57 0.78 0.95 ?
java 1.6 1.9 2.38 13. 40. ?
(2 semaphores per user process,
100,000 ucontext 3.7 3.8 4.0 4.4 4.5 5.5 System limit = 32k semaphores)
pthread 4.7 4.4 4.6 5.0 5.23 ?
java 14. 13. 15. 29. 77. ?
1,000,000 ucontext 36. 37. 38. 41. 40. 41.
pthread 42. 44. 46. 48. 47. ?
java 121. 130. 134. 163. 200. ?
Extensibility with UNIX contextes Scalability limit of GridSim

#tasks Stack #Workers
size 25,000 50,000 100,000 200,000
I 1 user process = 3 java threads
1,000 128Kb 1.6 † † † (code, input, output)
12Kb 0.5 0.9 1.7 3.2
10,000 128Kb 2 † † † I System limit = 32k threads
12Kb 0.8 1.2 2 3.5
100,000 128Kb 5.5 † † † ⇒ at most 10,922 user processes
12Kb 3.7 4.1 4.8 6.7
1,000,000 128Kb 41 † † †
12Kb 33 33.6 33.7 35.5
5,000,000 128Kb 206 † † †
12Kb 161 167 161 165 †: out of memory
Agenda
Platform Catalog
Motivation and project goals
Functionalities
Experimental evaluation (performance and simplicity)
Conclusion and Perspectives
Conclusion
Goals of the GRAS project (Grid Reality And Simulation)
Ease development of large-scale distributed apps
Development of real distributed applications using a simulator
Research Development
Code Code
rewrite
Simulation Application
Without GRAS
Research Development Research & Development
Code Code Code
rewrite
Simulation Application Simulation Application
Without GRAS With GRAS
I Framework for Rapid Development of Distributed Infrastructure

I Develop and tune on the simulator; Deploy in situ without modification
Code Code Code
rewrite 11111111
00000000
00000000
11111111
GRDK
API
GRE
SimGrid 1
0 00
11
Simulation Application
011
1 00

How: One API, two implementations
Code Code Code
rewrite API
GRAS GRDK GRE
Simulation Application 0011
11
SimGrid
00
11
00
00
11

Code Code Code
rewrite API
GRAS GRDK GRE
Simulation Application 0011
11
SimGrid
00
11
00
00
11

I Efficient Grid Runtime Environment (result = application 6= prototype)

I Performance concern: efficient communication of structured data
How: Efficient wire protocol (avoid data conversion)
I Portability concern: because of grid heterogeneity
How: ANSI C + autoconf + no dependency
Agenda
Platform Catalog
Functionalities
Conclusion
Main concepts of the GRAS API
Agents (acting entities)
I Code (C function)
I Private data
I Location (hosting computer)
Sockets (communication endpoints)

I Server socket: to receive messages
I Client socket: to contact a server (and receive answers)
Messages (what gets exchanged between agents)

I Semantic: Message type
I Payload described by data type description (fixed for a given type)
Callbacks (code to execute when a message is received)

I Also possible to explicitly wait for given messages
Emulation and Virtualization
Same code runs without modification both in simulation and in situ

I In simulation, agents run as threads within a single process
I In situ, each agent runs within its own process
⇒ Agents are threads, which can run as separate processes
Emulation issues
I How to get the process sleeping? How to get the current time?
I System calls are virtualized: gras os time; gras os sleep
I How to report computation time into the simulator?
I Asked explicitly by user, using provided macros
I Time to report can be benchmarked automatically
I What about global data?
I Agent status placed in a specific structure, ad-hoc manipulation API
Example of code: ping-pong (1/2)
Code common to client and server
#include "gras.h"
XBT_LOG_NEW_DEFAULT_CATEGORY(test,"Messages specific to this example" );
static void register_messages(void) {
gras_msgtype_declare("ping", gras_datadesc_by_name("int" ));
gras_msgtype_declare("pong", gras_datadesc_by_name("int" ));
}
Client code
int client(int argc,char *argv[ ]) {
gras_socket_t peer=NULL, from ;
int ping=1234, pong;
gras_init(&argc, argv);
gras_os_sleep(1); /* Wait for the server startup */
peer=gras_socket_client("127.0.0.1",4000);
register_messages();
gras_msg_send(peer, "ping", &ping);

INFO3("PING(%d) -> %s:%d",ping, gras_socket_peer_name(peer), gras_socket_peer_port(peer));
gras_msg_wait(6000,"pong",&from,&pong);
gras_exit();
return 0;
}
Example of code: ping-pong (2/2)
Server code
typedef struct { /* Global private data */
int endcondition;
} server_data_t;
int server (int argc,char *argv[ ]) {
server_data_t *globals;
gras_init(&argc,argv);
globals = gras_userdata_new(server_data_t);
globals->endcondition=0;
gras_socket_server(4000);
register_messages();
gras_cb_register("ping", &server_cb_ping_handler);
while (!globals->endcondition) { /* Handle messages until our state change */

gras_msg_handle(600.0); /* Actually, one ping is enough for that */
}
free(globals); gras_exit(); return 0;
}
int server_cb_ping_handler(gras_msg_cb_ctx_t ctx, void *payload_data) {
server_data_t *globals = (server_data_t*)gras_userdata_get(); /* Get the globals */
globals->endcondition = 1;
int msg = *(int*) payload_data; /* What’s the content? */

gras_socket_t expeditor = gras_msg_cb_ctx_from(ctx); /* Who sent it?*/
/* Send data back as payload of a pong message to the ping’s expeditor */
gras_msg_send(expeditor, "pong", &msg);
return 0;
}
Exchanging structured data
GRAS wire protocol: NDR (Native Data Representation)
Avoid data conversion when possible:
I Sender writes data on socket as they are in memory
I If receiver’s architecture does match, no conversion
I Receiver able to convert from any architecture
GRAS message payload can be any valid C type

I Structure, enumeration, array, pointer, . . .
I Classical garbage collection algorithm to deep-copy it
I Cycles in pointed structures detected & recreated
GRAS message payload can be any valid C type

I Structure, enumeration, array, pointer, . . .
I Classical garbage collection algorithm to deep-copy it
I Cycles in pointed structures detected & recreated
Describing a data type to GRAS Automatic description of vector
Manual description (excerpt) GRAS_DEFINE_TYPE(s_vect,
struct s_vect {
gras_datadesc_type_t gras_datadesc_struct(name); int cnt;
gras_datadesc_struct_append(struct type,name,field type); double*data GRAS_ANNOTE(size,cnt);
gras datadesc struct close(struct type); }
);
C declaration stored into a char* variable to be parsed at runtime

Agenda
Platform Catalog
Functionalities
Conclusion
Assessing communication performance
Only communication performance studied since computation are not mediated

I Experiment: timing ping-pong of structured data (a message of Pastry)
typedef struct {
int id, row_count; typedef struct {
double time_sent; int which_row;
row_t *rows; int row[COLS][MAX_ROUTESET];
int leaves[MAX_LEAFSET]; } row_t ;
} welcome_msg_t;
I Tested solutions
I GRAS
I PBIO (uses NDR)
I OmniORB (classical CORBA solution)
I MPICH (classical MPI solution)
I XML (Expat parser + handcrafted communication)
I Platform: x86, PPC, sparc (all under Linux)
Performance on a LAN
40.0ms
Sender: ppc 22.7ms
sparc x86 17.9ms
8.2ms
10
-2
10
-2 7.7ms 10
-2
4.3ms 5.4ms
3.9ms 3.1ms
2.4ms
0.8ms
ppc 10-3 10-3 10-3
n/a n/a n/a n/a

Receiver
10-4 10-4 10-4

GRAS MPICH OmniORB PBIO XML GRAS MPICH OmniORB PBIO XML GRAS MPICH OmniORB PBIO XML
55.7ms 38.0ms
42.6ms
26.8ms
20.7ms
-2 6.3ms -2 7.7ms 7.0ms -2 5.7ms 6.9ms

10 10 4.8ms 10
2.5ms
1.6ms
sparc 10-3 10-3 10-3
n/a n/a
10-4 10-4 10-4
34.3ms
18.0ms
12.8ms
10-2 5.2ms 10-2 5.4ms
5.6ms 10-2
3.4ms 2.9ms 3.8ms
2.3ms 2.2ms
x86 10-3 10-3 10-3 0.5ms
n/a n/a n/a

10-4 10-4 10-4
I MPICH twice as fast as GRAS, but cannot mix little- and big-endian Linux
I PBIO broken on PPC
I XML much slower (extra conversions + verbose wire encoding)
GRAS is the better compromise between performance and portability
Assessing API simplicity
Experiment: ran code complexity measurements on code for previous experiment
GRAS MPICH PBIO OmniORB XML

McCabe Cyclomatic Complexity 8 10 10 12 35
Number of lines of code 48 65 84 92 150
Results discussion
I XML complexity may be artefact of Expat parser (but fastest)
I MPICH: manual marshaling/unmarshalling
I PBIO: automatic marshaling, but manual type description
I OmniORB: automatic marshaling, IDL as type description
I GRAS: automatic marshaling & type description (IDL is C)
Conclusion
GRAS is the least demanding solution from developer perspective
Conclusion: GRAS eases infrastructure development
Research & Development

Code
11111111
00000000
00000000
11111111API
GRDK GRE
00
11 00
11
0011
11
SimGrid 00
With GRAS
GRDK: Grid Research & Development Kit

I API for (explicitly) distributed applications
I Study applications in the comfort of the simulator
GRE: Grid Runtime Environment
I Efficient: twice as slow as MPICH, faster than OmniORB, PBIO, XML
I Portable: Linux (11 CPU archs); Windows; Mac OS X; Solaris; IRIX; AIX
I Simple and convenient:
I API simpler than classical communication libraries (+XBT tools)
I Easy to deploy: C ANSI; no dependency; autotools; <400kb
Conclusion: GRAS eases infrastructure development
Research & Development

SMPI GRAS
SimDag MSG
Code
11111111
00000000
GRE: GRAS in situ

SMURF
SimIX network proxy 00000000
11111111
GRDK
API
GRE
00
11 00
11
0011 00
SimIX
”POSIX-like” API on a virtual platform 11
SimGrid
SURF
virtual platform simulator
With GRAS
XBT
GRDK: Grid Research & Development Kit

I API for (explicitly) distributed applications
I Study applications in the comfort of the simulator
GRE: Grid Runtime Environment
I Efficient: twice as slow as MPICH, faster than OmniORB, PBIO, XML
I Portable: Linux (11 CPU archs); Windows; Mac OS X; Solaris; IRIX; AIX
I Simple and convenient:
I API simpler than classical communication libraries (+XBT tools)
I Easy to deploy: C ANSI; no dependency; autotools; <400kb
GRAS perspectives
Future work on GRAS

I Performance: type precompilation, communication taming and compression
I GRASPE (GRAS Platform Expender) for automatic deployment
I Model-checking as third mode along with simulation and in-situ execution
Ongoing applications
I Comparison of P2P protocols (Pastry, Chord, etc)
I Use emulation mode to validate SimGrid models
I Network mapper (ALNeM): capture platform descriptions for simulator
I Large scale mutual exclusion service
Future applications
I Platform monitoring tool (bandwidth and latency)
I Group communications & RPC; Application-level routing; etc.
Agenda
Platform Catalog
Conclusion
SimGrid for Research on Large-Scale Distributed Systems Conclusion 126/130

Conclusions on Distributed Systems Research
Research on Large-Scale Distributed Systems
I Reflexion about common methodologies needed (reproductible results needed)
I Purely theoritical works limited (simplistic settings ; NP-complete problems)
I Real-world experiments time and labor consuming; limited representativity
I Simulation appealing, if results remain validated
Simulating Large-Scale Distributed Systems

I Packet-level simulators too slow for large scale studies
I Large amount of ad-hoc simulators, but discutable validity
I Coarse-grain modelization of TCP flows possible (cf. networking community)
I Model instantiation (platform mapping or generation) remains challenging
SimGrid provides interesting models

I Implements non-trivial coarse-grain models for resources and sharing
I Validity results encouraging with regard to packet-level simulators
I Several orders of magnitude faster than packet-level simulators
I Several models availables, ability to plug new ones or use packet-level sim.

SimGrid provides several user interfaces
SimDag: Comparing Scheduling Heuristics for DAGs of (parallel) tasks
I Declare tasks, their precedences, schedule them on resource, get the makespan
I Declare independent agents running a given function on an host
I Let them exchange and execute tasks
I Easy interface, rapid prototyping
I New in SimGrid v3.3: Java bindings for MSG
I Develop once, run in simulation or in situ (debug; test on non-existing platforms)
I Resulting application twice slower than MPICH, faster than omniorb
I Highly portable and easy to deploy
Other interfaces comming
I SMPI: Simulate MPI applications
I BSP model, OpenMP?
SimGrid is an active and exciting project
Future Plans
SMPI GRAS
I Improve usability SimDag MSG
GRE: GRAS in situ

SMURF
(statistics tools, campain management) SimIX network proxy
SimIX
I Extreme Scalability for P2P ”POSIX-like” API on a virtual platform
SURF
I Model-checking of GRAS applications virtual platform simulator
XBT
I Emulation solution à la MicroGrid
Large community
http://gforge.inria.fr/projects/simgrid/
I 130 subscribers to the user mailling list (40 to -devel)
I 40 scientific publications using the tool for their experiments
I 15 co-signed by one of the core-team members
I 25 purely external
I LGPL, 120,000 lines of code (half for examples and regression tests)
I Examples, documentation and tutorials on the web page
Use it in your works!

Detailed agenda
Real-world experiments
Simulation
Possible designs
Experimentation platforms: Grid’5000 and PlanetLab
Emulators: ModelNet and MicroGrid
Packet-level Simulators: ns-2, SSFNet and GTNetS
Ad-hoc simulators: ChicagoSim, OptorSim, GridSim, . . .
Peer to peer simulators
SimGrid
Multi-hop Networks
Resource Sharing
Single link
Dumbbell
Random platforms
Simulation speed
Platform Catalog
Java bindings
Performance Results
Functionalities
Conclusion

Simgrid Tutorial

Uploaded by

Copyright:

Available Formats

Simgrid Tutorial

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Simgrid Tutorial

Uploaded by

Copyright:

Available Formats

The SimGrid Framework for Research on

Large-Scale Distributed Systems

Martin Quinson (Nancy University, France)

Current practice in the field: quite different

Current practice in the field: quite different

Purpose of this tutorial

⇒ One must run experiments

Difficult for others to reproduce results

, Simulation solves these difficulties

I Ability to conduct controlled and repeatable experiments

, Simulation solves these difficulties

I Ability to conduct controlled and repeatable experiments

, Simulation solves these difficulties

I Ability to conduct controlled and repeatable experiments

, Simulation solves these difficulties

I Ability to conduct controlled and repeatable experiments

, Simulation solves these difficulties

I Ability to conduct controlled and repeatable experiments

, Simulation solves these difficulties

I Ability to conduct controlled and repeatable experiments

, Simulation solves these difficulties

I Ability to conduct controlled and repeatable experiments

, Simulation solves these difficulties

I Ability to conduct controlled and repeatable experiments

Large-Scale Distributed Systems?

Simplistic platform model

Simplistic application model

Straightforward simulation in most cases

Hardly justifiable for Large-Scale Distributed Systems

Expressing interaction rules

Expressing interaction rules

Boundaries are blurred

A lot of tools exist

How do they compare?

French experimental platform

Scientific tool for computer scientists

I Direct execution ; no experimental bias (?)

Distribution Virtualization each user can get a slice of the platform

As unstable as the real world

I Direct execution ; no experimental bias (?)

CPU: direct execution on CPU

I Direct execution ; no experimental bias (?)

Resources: wraps syscalls & grid tools Virtual

CPU: direct execution on fraction of CPU

Time: synchronize real and virtual time

I Direct execution ; no experimental bias (?)

SSFNet: implementation of SSF standard

GTNetS: Georgia Tech Network Simulator

SSFNet: implementation of SSF standard

GTNetS: Georgia Tech Network Simulator

Simulation tools of the networking community

SSFNet: implementation of SSF standard

GTNetS: Georgia Tech Network Simulator

Simulation tools for the networking community

I Direct execution ; no experimental bias (?)

I Direct execution ; no experimental bias (?)

focused on P2P protocols ; main challenge = scale

I Direct execution ; no experimental bias (?)

I Direct execution ; no experimental bias (?)