Simgrid Tutorial
Simgrid Tutorial
Simgrid Tutorial
Martin Quinson (Nancy University, France) Arnaud Legrand (CNRS, Grenoble University, France) Henri Casanova (Hawaii University at Manoa, USA) Presented By: Pedro Velho (Grenoble University, France)
[email protected]
Agenda
Distributed Systems Experiments Methodological Issues Main Methodological Approaches Tools for Experimentations in Large-Scale Distributed Systems Resource Models Analytic Models Underlying SimGrid SimGrid Architecture and Features Overview of the SimGrid Components SimDag: Comparing Scheduling Heuristics for DAGs MSG: Comparing Heuristics for Concurrent Sequential Processes GRAS: Developing and Debugging Real Applications SMPI: Running MPI applications on top of SimGrid Conclusion
(4/96)
Agenda
Distributed Systems Experiments Methodological Issues Main Methodological Approaches
Real-world experiments Simulation
Tools for Experimentations in Large-Scale Distributed Systems Resource Models Analytic Models Underlying SimGrid SimGrid Architecture and Features Overview of the SimGrid Components SimDag: Comparing Scheduling Heuristics for DAGs MSG: Comparing Heuristics for Concurrent Sequential Processes GRAS: Developing and Debugging Real Applications SMPI: Running MPI applications on top of SimGrid Conclusion
(5/96)
Analytical or Experimental?
Analytical works?
Some purely mathematical models exist Allow better understanding of principles in spite of dubious applicability
impossibility theorems, parameter inuence, . . .
(6/96)
Simulation
Simulation solves these diculties No need to build a real system, nor the full-edged application Ability to conduct controlled and repeatable experiments (Almost) no limits to experimental scenarios Possible for anybody to reproduce results Simulation in a nutshell Predict aspects of the behavior of a system using an approximate model of it Model: Set of objects dened by a state Rules governing the state evolution Simulator: Program computing the evolution according to the rules Wanted features:
Accuracy: Correspondence between simulation and real-world Scalability: Actually usable by computers (fast enough) Tractability: Actually usable by human beings (simple enough to understand) Instanciability: Can actually describe real settings (no magical parameter)
(8/96)
Networking
A few established packet-level simulators: NS-2, DaSSF, OMNeT++, GTNetS Well-known datasets for network topologies Well-known generators of synthetic topologies SSF standard: http://www.ssfnet.org/ Possible to reproduce simulation results
From 141 P2P sim.papers, 30% use a custom tool, 50% dont report used tool
SimGrid for Research on Large-Scale Distributed Systems Distributed Systems Experiments (9/96)
Dynamicity
Quantitative: resource sharing availability variation Qualitative: resource come and go (churn)
Complexity
Hierarchical systems: grids of clusters of multi-processors being multi-cores Resource sharing: network contention, QoS, batches Multi-hop networks, non-negligible latencies Middleware overhead (or optimizations) Interference of computation and communication (and disk, memory, etc)
SimGrid for Research on Large-Scale Distributed Systems Distributed Systems Experiments (11/96)
Agenda
Distributed Systems Experiments Methodological Issues Main Methodological Approaches Tools for Experimentations in Large-Scale Distributed Systems
Possible designs Experimentation platforms: Grid5000 and PlanetLab Emulators: ModelNet and MicroGrid Packet-level Simulators: ns-2, SSFNet and GTNetS Ad-hoc simulators: ChicagoSim, OptorSim, GridSim, . . . Peer to peer simulators SimGrid
Resource Models Analytic Models Underlying SimGrid SimGrid Architecture and Features Overview of the SimGrid Components SimDag: Comparing Scheduling Heuristics for DAGs MSG: Comparing Heuristics for Concurrent Sequential Processes GRAS: Developing and Debugging Real Applications SMPI: Running MPI applications on top of SimGrid Conclusion
SimGrid for Research on Large-Scale Distributed Systems Distributed Systems Experiments (12/96)
Model objects:
Evaluated application: Do actions, stimulus to the platform Resources (network, CPU, disk): Constitute the platform, react to stimulus.
Application blocked until actions are done Resource can sometime do actions to represent external load
less abstract
CPU
Macroscopic: Flows of operations in the CPU pipelines Microscopic: Cycle-accurate simulation (ne-grain d.e. simulation) Emulation: Virtualization via another CPU / Virtual Machine
Applications
Macroscopic: Application = analytical ow Less macroscopic: Set of abstract tasks with resource needs and dependencies
Coarse-grain d.e. simulation Application specication or pseudo-code API
(15/96)
Direct execution no experimental bias (?) Experimental settings xed (between hardware upgrades), but not controllable Virtualization allows sandboxing, but no experimental settings control Emulation can have high overheads (but captures the overhead) Discrete event simulation is slow, but hopefully accurate To scale, you have to trade speed for accuracy
SimGrid for Research on Large-Scale Distributed Systems Distributed Systems Experiments (16/96)
Applications not modied, direct execution Environment controlled, experiments repeatable Relative scalability (only 1500-4000 nodes)
SimGrid for Research on Large-Scale Distributed Systems Distributed Systems Experiments (17/96)
PlanetLab (consortium)
Open platform for developping, deploying, and accessing planetary-scale services Planetary-scale 852 nodes, 434 sites, >20 countries
Distribution Virtualization each user can get a slice of the platform Unbundled Management local behavior dened per node; network-wide behavior: services multiple competing services in parallel (shared, unprivileged interfaces) As unstable as the real world Demonstrate the feasability of P2P applications or middlewares No reproducibility!
SimGrid for Research on Large-Scale Distributed Systems Distributed Systems Experiments (18/96)
ModelNet (UCSD/Duke)
Applications
Emulation and virtualization: Actual code executed on virtualized resources Key tradeo: scalability versus accuracy
MicroGrid (UCSD)
Applications
Application supported by emulation and virtualization Actual application code is executed on virtualized resources Accounts for CPU and network
Application
Virtual Resources
Packet-level simulators
ns-2: the most popular one
Several protocols (TCP, UDP, . . . ), several queuing models (DropTail, RED, . . . ) Several application models (HTTP, FTP), wired and wireless networks Written in C++, congured using TCL. Limitated scalability (< 1, 000)
(23/96)
Many simulators. Most are home-made, short-lived; Some are released ChicSim designed for the study of data replication (Data Grids), built on ParSec
Ranganathan, Foster, Decoupling Computation and Data Scheduling in Distributed Data-Intensive Applications, HPDC02.
(24/96)
PeerSim, P2PSim, . . .
Thee peer-to-peer community also has its own private collection of simulators: focused on P2P protocols main challenge = scale
P2PSim Multi-threaded discrete-event simulator. Constant communication time. Alpha release (april 2005)
http://pdos.csail.mit.edu/p2psim/
PlanetSim Multi-threaded discrete-event simulator. Constant communication time. Last release (2006)
http://planet.urv.es/trac/planetsim/wiki/PlanetSim
PeerSim Designed for epidemic protocols. processes = state machines. Two simulation modes: cycle-based (time is discrete) or event-based. Resources are not modeled. 1.0.3 release (december 2007)
http://peersim.sourceforge.net/
(25/96)
SimGrid
History
Created just like other home-made simulators (only a bit earlier ;) Original goal: scheduling research need for speed (parameter sweep) accuracy not negligible HPC community concerned by performance
SimGrid in a Nutshell
Simulation communicating processes performing computations Key feature: Blend of mathematical simulation and coarse-grain d. e. simulation Resources: Dened by a rate (MFlop/s or Mb/s) + latency
Also allows dynamic traces and failures
Direct execution no experimental bias (?) Experimental settings xed (between hardware upgrades), but not controllable Virtualization allows sandboxing, but no experimental settings control Emulation can have high overheads (but captures the overhead) Discrete event simulation is slow, but hopefully accurate To scale, you have to trade speed for accuracy
SimGrid for Research on Large-Scale Distributed Systems Distributed Systems Experiments (27/96)
Simulation Validation
Crux of simulation works
Validation is dicult Almost never done convincingly (not specic to CS: other science have same issue here)
Argue that trends are respected (absolute values may be o) it is useful to compare algorithms/designs Conduct extensive verication campaign against real-world settings
(29/96)
For FLASH, the simple simulator was all that was needed. . .
Gibson, Kunz, Ofelt, Heinrich, FLASH vs. (Simulated) FLASH: Closing the Simulation Loop, Architectural Support for Programming Languages and Operating Systems, 2000
SimGrid for Research on Large-Scale Distributed Systems Distributed Systems Experiments (30/96)
Conclusion
Large-Scale Distributed System Research is Experimental
Analytical models are too limited Real-world experiments are hard & limited Most literature rely on simulation
(31/96)
Conclusion
Claim: SimGrid may prove helpful to your research
User-community much larger than contributors group Used in several communities (scheduling, GridRPC, HPC infrastructure, P2P) Model limits known thanks to validation studies Easy to use, extensible, fast to execute Around since almost 10 years
Implementation overview
SimGrid architecture Features
Main limitations
Tool performance and scalability
Hands On
Scheduling algorithm experiences
SimGrid for Research on Large-Scale Distributed Systems Distributed Systems Experiments (32/96)
Agenda
Distributed Systems Experiments Methodological Issues Main Methodological Approaches Tools for Experimentations in Large-Scale Distributed Systems Resource Models Analytic Models Underlying SimGrid SimGrid Architecture and Features Overview of the SimGrid Components SimDag: Comparing Scheduling Heuristics for DAGs MSG: Comparing Heuristics for Concurrent Sequential Processes GRAS: Developing and Debugging Real Applications SMPI: Running MPI applications on top of SimGrid Conclusion
Resource Models
(33/96)
Agenda
Distributed Systems Experiments Methodological Issues Main Methodological Approaches Tools for Experimentations in Large-Scale Distributed Systems Resource Models Analytic Models Underlying SimGrid
Modeling a Single Resource Multi-hop Networks Resource Sharing
SimGrid Architecture and Features Overview of the SimGrid Components SimDag: Comparing Scheduling Heuristics for DAGs MSG: Comparing Heuristics for Concurrent Sequential Processes GRAS: Developing and Debugging Real Applications SMPI: Running MPI applications on top of SimGrid Conclusion
SimGrid for Research on Large-Scale Distributed Systems Resource Models (34/96)
Simulation speed:
Users conduct large parameter-sweep experiments over alternatives
Resource work at given rate (B, in MFlop/s or Mb/s) Each use have a given latency (L, in s)
Application to networks
Turns out to be inaccurate for TCP B not constant, but depends on RTT, packet loss ratio, window size, etc. Several models were proposed in the literature
Resource Models
(36/96)
B = min
RTT: Round trip time b: #packages acknowledged per ACK T0 : TCP average retransmission timeout value
Model discussion
Captures TCP congestion control (fast retransmit and timeout mecanisms) Assumes steady-state (no slow-start) Accuracy shown to be good over a wide range of values p and b not known in general (model hard to instanciable)
Resource Models
(37/96)
l1
l2
l3
l1
l2
pi,j
l3
MTU
Notations
L: set of links Cl : capacity of link l (Cl > 0) nl : amount of ows using link l F: set of ows; f P(L) f : transfer rate of f
Feasibility constraint
Links deliver their capacity at most: l L,
f l
f Cl
Resource Models
(41/96)
Max-Min Fairness
Objective function: maximize min(f )
f F
Equilibrium reached if increasing any f decreases a f (with f > f ) Very reasonable goal: gives fair share to anyone Optionally, one can add prorities wi for each ow i maximizing min(wf f )
f F
Bottleneck links
For each ow f , one of the links is the limiting one l (with more on that link l, the ow f would get more overall) The objective function gives that l is saturated, and f gets the biggest share f F, l f ,
f l
f = Cl
and f = max{f , f
l}
L. Massouli and J. Roberts, Bandwidth sharing: objectives and algorithms, e IEEE/ACM Trans. Netw., vol. 10, no. 3, pp. 320-328, 2002.
SimGrid for Research on Large-Scale Distributed Systems Resource Models (42/96)
Bucket-lling algorithm
Set the bandwidth of all ows to 0 Increase the bandwidth of every ow by . And again, and again, and again. When one link is saturated, all ows using it are limited ( Loop until all ows have found a limiting link removed from set)
Ecient Algorithm
1. Search for the bottleneck link l so that: Cl = min nl Ck , kL nk
2. f l, f = Cll ; n Update all nl and Cl to remove these ows 3. Loop until all f are xed
Resource Models
(43/96)
flow 0
n1 = 2 n2 = 2
0 C /2 1 C /2 2 C /2
All links have the same capacity C Each of them is limiting. Lets choose link 1 0 = C /2 and 1 = C /2 Remove ows 0 and 1; Update links capacity Link 2 sets 1 = C /2 Were done computing the bandwidth allocated to each ow
Resource Models
(44/96)
Flow 1
link 3
link 4
1 999 2 1
1 1000 1000 1000 1000 1, 1 , 2 , 1 , 1
The limiting link is link 0 since The limiting link is link 2 This xes 1 = 999 Done. We know 1 and 2
1 1
= min
Resource Models
(45/96)
Flow 1
link 3
link 4
C0 C1 C2 C3 C4
Cl nl n1 n1 n2 n3 n4
=1 =1 =2 =1 =1
= = = = =
Proportional Fairness
Max-Min validity limits
MaxMin gives a fair share to everyone Reasonable, but TCP does not do so Congestion mecanism: Additive Increase, Muplicative Decrease (AIMD) Complicates modeling, as shown in literature
Proportional Fairness
MaxMin gives more to long ows (resource-eager), TCP known to do opposite Objective function: maximize
F
wf log(f )
Resource Models
(47/96)
f f 0 f
Compute the point {f } where the derivate is zero (convex optimization) Use Lagrange multipliers and steepest gradient descent
C 0 = n+1
and
l = 0, l =
C n n+1
Ie, for C=100Mb/s and n=3, 0 = 25Mb/s, 1 = 2 = 3 = 75Mb/s Closer to practitioner expectations
SimGrid for Research on Large-Scale Distributed Systems Resource Models (48/96)
same updates
arctan(f )
Low, S.H., A Duality Model of TCP and Queue Management Algorithms, IEEE/ACM Transactions on Networking, 2003.
Want more?
network model:gtnets use Georgia Tech Network Simulator for network Accuracy of a packet-level network simulator without changing your code (!) Plug your own model in SimGrid!!
(usable as scientic instrument in TCP modeling eld, too)
SimGrid for Research on Large-Scale Distributed Systems Resource Models (50/96)
111 000 111 000 111 1 000 0 1111 1 0000 0 1111 1 0000 0 11111 00000 11111 00000
11 00 11 00 11 00 11 00 11 00
111 000 1 0 11 00 111 000 1 0 11 00 111 000 111 000 1 0 11 00 111111 000000 1 0 11 00
11 00 1 0 11 00 1 0
Resource Models
Simulated time
(51/96)
111 000 111 000 111 1 000 0 1111 1 0000 0 1111 1 0000 0 11111 00000 11111 00000
SimGrid for Research on Large-Scale Distributed Systems
11 11111 00 00000 11 111 00 000 11111 00000 111111 11 000000 00 111 000 11 111 00 000 111111 000000 11 1111111 00 0000000 11 00 111 000
Resource Models
Simulated time
Agenda
Distributed Systems Experiments Methodological Issues Main Methodological Approaches Tools for Experimentations in Large-Scale Distributed Systems Resource Models Analytic Models Underlying SimGrid SimGrid Architecture and Features Overview of the SimGrid Components SimDag: Comparing Scheduling Heuristics for DAGs MSG: Comparing Heuristics for Concurrent Sequential Processes GRAS: Developing and Debugging Real Applications SMPI: Running MPI applications on top of SimGrid Conclusion
(53/96)
Agenda
Distributed Systems Experiments Methodological Issues Main Methodological Approaches Tools for Experimentations in Large-Scale Distributed Systems Resource Models Analytic Models Underlying SimGrid SimGrid Architecture and Features Overview of the SimGrid Components SimDag: Comparing Scheduling Heuristics for DAGs MSG: Comparing Heuristics for Concurrent Sequential Processes GRAS: Developing and Debugging Real Applications SMPI: Running MPI applications on top of SimGrid Conclusion
(54/96)
MSG
Simple applicationlevel simulator
GRAS
AMOK
SMPI
Library to run MPI applications on top of a virtual environment
XBT: Grounding features (logging, etc.), usual data structures (lists, sets, etc.) and portability layer
Agenda
Distributed Systems Experiments Methodological Issues Main Methodological Approaches Tools for Experimentations in Large-Scale Distributed Systems Resource Models Analytic Models Underlying SimGrid SimGrid Architecture and Features Overview of the SimGrid Components SimDag: Comparing Scheduling Heuristics for DAGs MSG: Comparing Heuristics for Concurrent Sequential Processes GRAS: Developing and Debugging Real Applications SMPI: Running MPI applications on top of SimGrid Conclusion
(57/96)
3
Time
6 End
Main functionalities
1. Create a DAG of tasks
Vertices: tasks (either communication or computation) Edges: precedence relation
Scheduling tasks
SD task schedule(task, workstation number, *workstation list, double *comp amount, double *comm amount, double rate)
Tasks are parallel by default; simply put workstation number to 1 if not Communications are regular tasks, comm amount is a matrix Both computation and communication in same task possible rate: To slow down non-CPU (resp. non-network) bound applications
Agenda
Distributed Systems Experiments Methodological Issues Main Methodological Approaches Tools for Experimentations in Large-Scale Distributed Systems Resource Models Analytic Models Underlying SimGrid SimGrid Architecture and Features Overview of the SimGrid Components SimDag: Comparing Scheduling Heuristics for DAGs MSG: Comparing Heuristics for Concurrent Sequential Processes
Motivations, Concepts and Example of Use Java bindings A Glance at SimGrid Internals Performance Results
GRAS: Developing and Debugging Real Applications SMPI: Running MPI applications on top of SimGrid Conclusion
SimGrid for Research on Large-Scale Distributed Systems SimGrid Architecture and Features (60/96)
sprintf(mailbox,"worker-%d",id); while(1) { errcode = MSG_task_receive(&task, mailbox); xbt_assert0(errcode == MSG_OK, "MSG_task_get failed"); if (!strcmp(MSG_task_get_name(task),"finalize")) { MSG_task_destroy(task); break; } INFO1("Processing %s", MSG_task_get_name(task)); MSG_task_execute(task); INFO1("%s done", MSG_task_get_name(task)); MSG_task_destroy(task); } INFO0("Im done. See you!"); return 0; }
SimGrid for Research on Large-Scale Distributed Systems SimGrid Architecture and Features (62/96)
/* Dispatching (dumb round-robin algorithm) */ for (i = 0; i < number_of_tasks; i++) { sprintf(buff, "Task_%d", i); task = MSG_task_create(sprintf_buffer, task_comp_size, task_comm_size, NULL); sprintf(mailbox,"worker-%d",i % workers_count); INFO2("Sending %s to mailbox , task->name, mailbox); %s" MSG_task_send(task, mailbox); } /* Send finalization message to workers */ INFO0("All tasks dispatched. Lets stop workers"); for (i = 0; i < workers_count; i++) MSG_task_put(MSG_task_create("finalize", 0, 0, 0), workers[i], 12); INFO0("Goodbye now!"); return 0; }
(63/96)
</platform>
(64/96)
(65/96)
(66/96)
(67/96)
Agenda
Distributed Systems Experiments Methodological Issues Main Methodological Approaches Tools for Experimentations in Large-Scale Distributed Systems Resource Models Analytic Models Underlying SimGrid SimGrid Architecture and Features Overview of the SimGrid Components SimDag: Comparing Scheduling Heuristics for DAGs MSG: Comparing Heuristics for Concurrent Sequential Processes
Motivations, Concepts and Example of Use Java bindings A Glance at SimGrid Internals Performance Results
GRAS: Developing and Debugging Real Applications SMPI: Running MPI applications on top of SimGrid Conclusion
SimGrid for Research on Large-Scale Distributed Systems SimGrid Architecture and Features (68/96)
import simgrid.msg.*; public class Master extends simgrid.msg.Process { public void main(String[ ] args) throws JniException, NativeException { int numberOfTasks = Integer.valueOf(args[0]).intValue(); double taskComputeSize = Double.valueOf(args[1]).doubleValue(); double taskCommunicateSize = Double.valueOf(args[2]).doubleValue(); int workerCount = Integer.valueOf(args[3]).intValue(); Msg.info("Got "+ workerCount + " workers and " + numberOfTasks + " tasks.");
for (int i = 0; i < numberOfTasks; i++) { BasicTask task = new BasicTask("Task_" + i ,taskComputeSize,taskCommunicateSize); task.send("worker-" + (i % workerCount)); Msg.info("Send completed for the task " + task.getName() + " on the mailbox worker-" + (i % workerCount) + ""); } Msg.info("Goodbye now!"); } }
(70/96)
What about performance XXX XXXworkers 100 500 XXX tasks X 1,000 native .16 .19 java .41 .59 10,000 native .48 .52 java 1.6 1.9 100,000 native 3.7 3.8 java 14. 13. 1,000,000 native 36. 37. java 121. 130.
loss?
1,000 .21 .94 .54 2.38 4.0 15. 38. 134. 5,000 .42 7.6 .83 13. 4.4 29. 41. 163. 10,000 0.74 27. 1.1 40. 4.5 77. 40. 200.
(71/96)
Agenda
Distributed Systems Experiments Methodological Issues Main Methodological Approaches Tools for Experimentations in Large-Scale Distributed Systems Resource Models Analytic Models Underlying SimGrid SimGrid Architecture and Features Overview of the SimGrid Components SimDag: Comparing Scheduling Heuristics for DAGs MSG: Comparing Heuristics for Concurrent Sequential Processes
Motivations, Concepts and Example of Use Java bindings A Glance at SimGrid Internals Performance Results
GRAS: Developing and Debugging Real Applications SMPI: Running MPI applications on top of SimGrid Conclusion
SimGrid for Research on Large-Scale Distributed Systems SimGrid Architecture and Features (72/96)
Example
Thread A:
Send toto to B Receive something from B
Maestro
Simulation Kernel:
whos next?
Thread A
Thread B
Send "toto" to B
Thread B:
Receive something from A Send blah to A
Receive from B
Receive from A
(73/96)
SimDag
MSG
GRAS
SimIX
POSIX-like API on a virtual platform
SURF
virtual platform simulator
XBT
(74/96)
Agenda
Distributed Systems Experiments Methodological Issues Main Methodological Approaches Tools for Experimentations in Large-Scale Distributed Systems Resource Models Analytic Models Underlying SimGrid SimGrid Architecture and Features Overview of the SimGrid Components SimDag: Comparing Scheduling Heuristics for DAGs MSG: Comparing Heuristics for Concurrent Sequential Processes
Motivations, Concepts and Example of Use Java bindings A Glance at SimGrid Internals Performance Results
GRAS: Developing and Debugging Real Applications SMPI: Running MPI applications on top of SimGrid Conclusion
SimGrid for Research on Large-Scale Distributed Systems SimGrid Architecture and Features (75/96)
10,000
1.97
100,000
5.5
1,000,000
41.
: out of memory
SimGrid Architecture and Features (76/96)
Agenda
Distributed Systems Experiments Methodological Issues Main Methodological Approaches Tools for Experimentations in Large-Scale Distributed Systems Resource Models Analytic Models Underlying SimGrid SimGrid Architecture and Features Overview of the SimGrid Components SimDag: Comparing Scheduling Heuristics for DAGs MSG: Comparing Heuristics for Concurrent Sequential Processes GRAS: Developing and Debugging Real Applications
Motivation and project goals Functionalities Experimental evaluation (performance and simplicity) Conclusion and Perspectives
Goals of the GRAS project (Grid Reality And Simulation) Ease development of large-scale distributed apps
Development of real distributed applications using a simulator
Research
Code
Development
rewrite
Code
GRAS
Simulation Application
1 0
Without GRAS
With GRAS
Agenda
Distributed Systems Experiments Methodological Issues Main Methodological Approaches Tools for Experimentations in Large-Scale Distributed Systems Resource Models Analytic Models Underlying SimGrid SimGrid Architecture and Features Overview of the SimGrid Components SimDag: Comparing Scheduling Heuristics for DAGs MSG: Comparing Heuristics for Concurrent Sequential Processes GRAS: Developing and Debugging Real Applications
Motivation and project goals Functionalities Experimental evaluation (performance and simplicity) Conclusion and Perspectives
Emulation issues
How to get the process sleeping? How to get the current time?
System calls are virtualized: gras os time; gras os sleep
(81/96)
Client code
int client(int argc,char *argv[ ]) { gras_socket_t peer=NULL, from ; int ping=1234, pong; gras_init(&argc, argv); gras_os_sleep(1); /* Wait for the server startup */ peer=gras_socket_client("127.0.0.1",4000); register_messages(); gras_msg_send(peer, "ping", &ping); INFO3("PING(%d) -> %s:%d",ping, gras_socket_peer_name(peer), gras_socket_peer_port(peer)); gras_msg_wait(6000,"pong",&from,&pong); gras_exit(); return 0; }
SimGrid for Research on Large-Scale Distributed Systems SimGrid Architecture and Features (82/96)
GRAS_DEFINE_TYPE(s_vect, struct s_vect { gras_datadesc_type_t gras_datadesc_struct(name); int cnt; gras_datadesc_struct_append(struct type,name,field type); double*data GRAS_ANNOTE(size,cnt); gras datadesc struct close(struct type); } );
Agenda
Distributed Systems Experiments Methodological Issues Main Methodological Approaches Tools for Experimentations in Large-Scale Distributed Systems Resource Models Analytic Models Underlying SimGrid SimGrid Architecture and Features Overview of the SimGrid Components SimDag: Comparing Scheduling Heuristics for DAGs MSG: Comparing Heuristics for Concurrent Sequential Processes GRAS: Developing and Debugging Real Applications
Motivation and project goals Functionalities Experimental evaluation (performance and simplicity) Conclusion and Perspectives
Tested solutions
GRAS PBIO (uses NDR) OmniORB (classical CORBA solution) MPICH (classical MPI solution) XML (Expat parser + handcrafted communication)
(86/96)
Performance on a LAN
Sender: ppc
10
-2
22.7ms 10
-2
sparc
7.7ms 3.9ms 2.4ms 10-3
40.0ms
x86
10
-2
8.2ms 4.3ms
17.9ms
3.1ms 10-3
5.4ms
ppc
Receiver
10-3
0.8ms
10-4
10-4
10-4
n/a
10
-2
6.3ms 1.6ms
10
-2
4.8ms 2.5ms
7.7ms 7.0ms
10
-2
5.7ms
6.9ms
sparc
10-3
10-3
10-3
10-4
10-4
XML 34.3ms
10-4
18.0ms 10-2 3.4ms 5.2ms 10-2 2.9ms 10-3 5.4ms 5.6ms 10-2 2.3ms 10-3 3.8ms 2.2ms
12.8ms
x86
10-3
0.5ms
10-4
n/a
n/a XML
10-4
10-4
XML
MPICH twice as fast as GRAS, but cannot mix little- and big-endian Linux PBIO broken on PPC XML much slower (extra conversions + verbose wire encoding)
Results discussion
XML complexity may be artefact of Expat parser (but fastest) MPICH: manual marshaling/unmarshalling PBIO: automatic marshaling, but manual type description OmniORB: automatic marshaling, IDL as type description GRAS: automatic marshaling & type description (IDL is C)
GRAS
GRE: GRAS in situ
SimIX
POSIX-like API on a virtual platform
11111111 00000000 API 11111111 00000000 GRDK GRE 1111 0000 SimGrid 1111 0000
With GRAS
Code
SURF
virtual platform simulator
XBT
Ongoing applications
Comparison of P2P protocols (Pastry, Chord, etc) Use emulation mode to validate SimGrid models Network mapper (ALNeM): capture platform descriptions for simulator Large scale mutual exclusion service
Future applications
Platform monitoring tool (bandwidth and latency) Group communications & RPC; Application-level routing; etc.
(90/96)
Agenda
Distributed Systems Experiments Methodological Issues Main Methodological Approaches Tools for Experimentations in Large-Scale Distributed Systems Resource Models Analytic Models Underlying SimGrid SimGrid Architecture and Features Overview of the SimGrid Components SimDag: Comparing Scheduling Heuristics for DAGs MSG: Comparing Heuristics for Concurrent Sequential Processes GRAS: Developing and Debugging Real Applications SMPI: Running MPI applications on top of SimGrid Conclusion
(91/96)
How it works
smpicc changes MPI calls into SMPI ones (gettimeofday also intercepted) smpirun starts a classical simulation obeying -hostfile and -np Runs unmodied MPI code after recompilation
Implemented calls
Isend; Irecv. Recv; Send. Wait; Waitall; Waitany. Barrier; Bcast; Reduce; Allreduce (cmd line option to choose binary or at tree) Comm size; Comm rank; Comm split. Wtime. Init; Finalize; Abort.
Future Work
Implement the rest of the API Test it more througfully Use it to validate SimGrid at application level (with NAS et Al.)
SimGrid for Research on Large-Scale Distributed Systems SimGrid Architecture and Features (92/96)
Agenda
Distributed Systems Experiments Methodological Issues Main Methodological Approaches Tools for Experimentations in Large-Scale Distributed Systems Resource Models Analytic Models Underlying SimGrid SimGrid Architecture and Features Overview of the SimGrid Components SimDag: Comparing Scheduling Heuristics for DAGs MSG: Comparing Heuristics for Concurrent Sequential Processes GRAS: Developing and Debugging Real Applications SMPI: Running MPI applications on top of SimGrid Conclusion
Conclusion
(93/96)
GRAS
GRE: GRAS in situ
Extreme Scalability for P2P Model-checking of GRAS applications Emulation solution ` la MicroGrid a
SimIX
POSIX-like API on a virtual platform
SURF
virtual platform simulator
XBT
Large community
http://gforge.inria.fr/projects/simgrid/ 130 subscribers to the user mailling list (40 to -devel) 40 scientic publications using the tool for their experiments
15 co-signed by one of the core-team members 25 purely external
LGPL, 120,000 lines of code (half for examples and regression tests) Examples, documentation and tutorials on the web page
Detailed agenda
Distributed Systems Experiments Methodological Issues Main Methodological Approaches
Real-world experiments Simulation
Simix
Big picture
SimGrid Architecture and Features Overview of the SimGrid Components SimDag: Comparing Scheduling Heuristics for DAGs MSG: Comparing Heuristics for Concurrent Sequential Processes
Motivations, Concepts and Example of Use Java bindings A Glance at SimGrid Internals Performance Results
Conclusion
(97/96)
Agenda
Model-Checking within SimGrid Introduction to Model-Checking Adding Model-Checking to SimGrid Current Status and Future Work
Back
(98/96)
Motivation
GRAS allows to debug an application on simulator and deploy it when it works Problem: when to decide that it works?
Demonstrate a theorem conversion to C dicult Test some cases may still fail on other cases
Model-checking
Given an initial situation (we have three nodes), test all possible executions (A gets rst message rst, B does, C does, . . . ) Combinatorial search in the tree of possibilities Fight combinatorial explosion: cycle detection, symmetry, abstraction
Model-checking in GRAS
First diculty: Checkpoint simulated processes (to rewind simulation) Induced diculty: Devise when to checkpoint processes Second diculty: Fight against combinatorial explosion
SimGrid for Research on Large-Scale Distributed Systems Bonus: Model-Checking in SimGrid
Back
(99/96)
Agenda
Model-Checking within SimGrid Introduction to Model-Checking Adding Model-Checking to SimGrid Current Status and Future Work
Back
(100/96)
Agenda
Model-Checking within SimGrid Introduction to Model-Checking Adding Model-Checking to SimGrid Current Status and Future Work
Back
(101/96)
Formal methods
Back
(102/96)
Proof of programs
In theory, applicable to any class of program In practice, quite tedious to use often limited to help a specialist doing the actual work (system state explosion)
Model-checking
Shows that a system:
(safety) never evolves to a faulty state from a given initial state (liveness) always evolve to the wanted state (stopping) from a given state (breaking)
Less generic than proof: lack of faulty states for all initial state? Usable by non-specialists (at least, by less-specialists)
Back
(103/96)
x: 0 A:{ , } B:{ , }
A B
A B A B
A B A B A B
A B A B
B A B A
x: 5 x: 7
x: 2 A
(104/96)
Back
(105/96)
(106/96)
Agenda
Model-Checking within SimGrid Introduction to Model-Checking Adding Model-Checking to SimGrid Current Status and Future Work
Back
(107/96)
(108/96)
No need to consider all possible instruction interleavings massive state space reduction is possible (but open research question)
Maestro Thread A Thread B
Simulation Kernel:
whos next?
Send "toto" to B
Receive from A
Receive from B
Send "blah" to A
(done)
Back
(done)
(109/96)
Checkpointing threads
Intercept memory allocation functions; use a special dynamic memory manager
#define malloc(a) my malloc(a) /* and friends */ Deroute heap in separate memory segments (with shm ops)
Also save/restore the stack (memcpy) and registers (setjmp) of the processes
SimGrid for Research on Large-Scale Distributed Systems Bonus: Model-Checking in SimGrid
Back
(110/96)
Agenda
Model-Checking within SimGrid Introduction to Model-Checking Adding Model-Checking to SimGrid Current Status and Future Work
Back
(111/96)
Current Work
Isolating each simulated processs address space Separating the network in a separated address space Support the other SimGrid APIs
Future Work
Exploit the heap symmetries (heap canonicalization) Implement partial order reductions Verication of LTL properties
SimGrid for Research on Large-Scale Distributed Systems Bonus: Model-Checking in SimGrid
Back
(112/96)
Agenda
Model-Checking within SimGrid Introduction to Model-Checking Adding Model-Checking to SimGrid Current Status and Future Work
Back
(113/96)
SimGrid Internals
Some Numbers
SimDag MSG SMPI SMURF
SimIX network proxy
GRAS
GRE: GRAS in situ
v3.3 is 120k sloc (w/o blanks; with comments) (Core lib: 47k; Tests: 69k; Doc:3.5k; +Build) v3.2 was 55k sloc (Core lib: 30k; Tests: 21k; Doc:3k; +Build)
SURF
SimIX
POSIX-like API on a virtual platform
XBT
Back
(114/96)
Agenda
Model-Checking within SimGrid Introduction to Model-Checking Adding Model-Checking to SimGrid Current Status and Future Work SimGrid Internals SURF
Big Picture Models How Models get used Actions and Resources Writing your own model Adding new kind of models
Back
(115/96)
SURF
Big picture
Resources provide power; created through the XML platform le Actions are consumption of this power; created by upper layers Designed with model extensibility and simulation eciency in mind
object oriented, but not in a canonical way
Models
They express how resources get consumed by actions Act as class for action objects (heavy use of function pointers), and as fabric Several kind of models exist (one of each kind is in use during simulation)
CPU model: fabric of compute and sleep actions Link model: fabric of communicate actions Workstation model: aggregation pattern of CPU and link Routing: not exactly a model; provide the list of links between two hosts
Back
(116/96)
class surf_routing_t
void void (*set) (double date, void *function, void *arg)() void int (*get) (void **function, void **arg)()
static surf_action new_compute() static surf_action new_sleep() double get_cpu_speed(void* cpu) Boolean get_cpu_state(void* cpu)
static surf_action new_communicate() void get_link_bandwidth(void* link) void get_link_latency(void* link) void get_link_state(void* link)
(117/96)
111 000 111 000 111 1 000 0 1111 1 0000 0 1111 1 0000 0 11111 00000 11111 00000
11 00 11 00 11 00 11 00 11 00
111 000 1 0 11 00 111 000 1 0 11 00 111 000 111 000 1 0 11 00 111111 000000 1 0 11 00
11 00 1 0 11 00 1 0
Bonus: SimGrid Internals
Simulated time
Back
(118/96)
111 000 111 000 111 1 000 0 1111 1 0000 0 1111 1 0000 0 11111 00000 11111 00000
SimGrid for Research on Large-Scale Distributed Systems
11 11111 00 00000 11 111 00 000 11111 00000 111111 11 000000 00 111 000 11 111 00 000 111111 000000 11 1111111 00 0000000 11 00 111 000
Bonus: SimGrid Internals
Simulated time
(119/96)
/* Handle every events occurring before min */ while ((next_event_date = tmgr_history_next_date(history)) != -1.0) { if (next_event_date > NOW + min) break; /* no further event before min */ /* apply event by updating models */ while((evt=tmgr_history_get_next_event_leq(history, next_event_date, &value, &resource))){ if (resource->model->model_private->resource_used(resource)) min = next_event_date - NOW; /* evt changes a resource currently used. Change min */
recall
/* update state of the model according to event */ resource->model->model_private->update_resource_state(resource, evt, value, NOW + min); } } NOW = NOW + min; /* Increase the simulation clock (NOW is returned by SURF_get_clock() ) */
/* Ask models to update the state of actions they are responsible for according to the clock * xbt_dynar_foreach(model_list, iter, model) model->model_private->update_actions_state(NOW, min);
SimGrid for Research on Large-Scale Distributed Systems Bonus: SimGrid Internals
Back
(120/96)
the hard thing is about sharing Most models use a Linear MaxMin solver (lmm) to compute sharing
Other binds to external tool (gtnets), or have no sharing (constant, timer) Comes down to a linear system where actions are variable and resource constants Disclamer: I only partially understand lmm internals for now
Back
(121/96)
lmm actions
lmm variable allowing the system to return the share gained by that action boolean indicating whether its currently suspended
(122/96)
This is done (usually during init) with surfxml add callback Ex: surfxml add callback(STag surfxml host cb list, &parse host) Attributes accessible through globals: A surfxml tag attrname Ex: host->power = sscanf("%f",A surfxml host power)
(123/96)
Parser callbacks Plug your model in surf config.c so that users can select it from cmd line Possibly update the DTD to add the new info you need at instanciation
Guidelines
Reusing the existing is perfectly ne (everything is in there for lmm based models more to come) Please do not dupplicate code (as too often done till now) (at least if you want to get your code integrated in the SVN)
SimGrid for Research on Large-Scale Distributed Systems Bonus: SimGrid Internals
Back
(124/96)
(125/96)
typedef struct { double current; double max; tmgr_trace_event_t event; } s_surf_metric_t; typedef struct { s_surf_resource_t generic_resource; lmm_constraint_t constraint; e_surf_resource_state_t state_current; tmgr_trace_event_t state_event; s_surf_metric_t power; } s_surf_resource_lmm_t, *surf_resource_lmm_t;
surf_resource_t surf_resource_new(size_t childsize, surf_model_t model, char *name, xbt_dict_t p surf_resource_t res = xbt_malloc0(childsize); res->model = [...] return res; } surf_resource_lmm_t surf_resource_lmm_new(size_t childsize, /* for superclass */ surf_model_t model, char *name, xbt_dict_t props, /* our args */ [...]) surf_resource_lmm_t res = (surf_resource_lmm_t)surf_resource_new(childsize,model,name,props); res->constraint = [...] return res; } link_CM02_t CM02_link_new([...]) { link_CM02_t res = (link_CM02_t) surf_resource_lmm_new(sizeof(s_link_CM02_t), [...]); [...] } Back
SimGrid for Research on Large-Scale Distributed Systems Bonus: SimGrid Internals
(126/96)
We may be able to ease this task too (but is there a real need?)
SimGrid for Research on Large-Scale Distributed Systems Bonus: SimGrid Internals
Back
(127/96)
Extending it
It is fairly easy to add new models for existing resource kinds It is quite long (and somehow dicult) to add new resource kinds
Future work
Ongoing cleanups not completely nished
Still some dupplicated code Resource power not handled consistantly accros models
New models:
Highly scalable ones in USS-SimGrid project Compound ones where CPU load reduce communication abilities Multi-core (pick your favorite one)
SimGrid for Research on Large-Scale Distributed Systems Bonus: SimGrid Internals
Back
(128/96)
Agenda
Model-Checking within SimGrid Introduction to Model-Checking Adding Model-Checking to SimGrid Current Status and Future Work
Back
(129/96)
Simix
Back
(130/96)
Agenda
Model-Checking within SimGrid Introduction to Model-Checking Adding Model-Checking to SimGrid Current Status and Future Work
Back
(131/96)
Global Elements
Simix state is contained in the following global data structure: typedef struct SIMIX_Global { smx_context_factory_t context_factory; xbt_dict_t host; xbt_swag_t process_to_run; xbt_swag_t process_list; xbt_swag_t process_to_destroy; smx_process_t current_process; smx_process_t maestro_process; ... };
Back
(132/96)
Back
(133/96)
Agenda
Model-Checking within SimGrid Introduction to Model-Checking Adding Model-Checking to SimGrid Current Status and Future Work
Back
(134/96)
Simix Process 1
The process is the a central element in Simix. It is represented by the following datastructure: struct s_smx_process { ... char *name; smx_host_t smx_host; smx_context_t context; ex_ctx_t *exception; int blocked : 1; int suspended : 1; int iwannadie : 1; smx_mutex_t mutex; smx_cond_t cond; xbt_dict_t properties; void *data; };
Back
(135/96)
Simix Process 2
Simix keeps for each process: an execution context an exception container its running state (blocked, suspended, killed) pointers to the mutex or condition vars where the process is waiting user level data provided by the user
Back
(136/96)
Context
The Simix contexts are an abstraction of the execution state of a process plus an interface for controlling them.
Back
(137/96)
Model-Checking within SimGrid Introduction to Model-Checking Adding Model-Checking to SimGrid Current Status and Future Work SimGrid Internals SURF
Big Picture Models How Models get used Actions and Resources Writing your own model Adding new kind of models
Simix
Big picture
(138/96)