A Runtime Framework for Parallel Programs
Joy Mukherjee
Dissertation submitted to the Faculty of the
Virginia Polytechnic Institute and State University
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
in
Computer Science & Application
Dr. Srinidhi Varadarajan (Chair)
Dr. Naren Ramakrishnan
Dr. James D. Arthur
Dr. Calvin J. Ribbens
Dr. Scott F. Midkiff
August 16, 2006
Blacksburg, Virginia
Keywords: Parallel Programs, Legacy Procedural codes, Lightweight Threads, Component
Composition, Runtime Linking and Loading, Dynamic Adaptation.
Copyright 2006, Joy Mukherjee
A Runtime Framework for Parallel Programs
Joy Mukherjee
ABSTRACT
This dissertation proposes the Weaves runtime framework for the execution of large scale
parallel programs over lightweight intra-process threads. The goal of the Weaves framework
is to help process-based legacy parallel programs exploit the scalability of threads without any
modifications. The framework separates global variables used by identical, but independent,
threads of legacy parallel programs without resorting to thread-based re-programming. At
the same time, it also facilitates low-overhead collaboration among threads of a legacy parallel
program through multi-granular selective sharing of global variables.
Applications that follow the tenets of the Weaves framework can load multiple identical, but
independent, copies of arbitrary object files within a single process. They can compose the
runtime images of these object files in graph-like ways and run intra-process threads through
them to realize various degrees of multi-granular selective sharing or separation of global
variables among the threads. Using direct runtime control over the resolution of individual
references to functions and variables, they can also manipulate program composition at fine
granularities. Most importantly, the Weaves framework does not entail any modifications to
either the source codes or the native codes of the object files. The framework is completely
transparent.
Results from experiments with a real-world process-based parallel application depict that
the framework can correctly execute a thousand parallel threads containing non-threadsafe
global variables on a single machine—nearly twice as many as the traditional process-based
approach can—without any code modifications. On increasing the number of machines, the
application experiences super-linear speedup, which illustrates scalability. Results from another similar application, chosen from a different software area to emphasize the breadth
of this research, show that the framework’s facilities for low-overhead collaboration among
parallel threads allows for significantly greater scales of achievable parallelism than technologies for inter-process collaboration allow. Ultimately, larger scales of parallelism enable
more accurate software modeling of real-world parallel systems, such as computer networks
and multi-physics natural phenomena.
iii
"I'd rather be a could-be if I cannot be an are;
because a could-be is a maybe who is reaching for a star.
I'd rather be a has-been than a might-have-been, by far;
for a might have-been has never been, but a has was once an
are."
--Milton Berle
iv
To Dadun.
To my grandparents.
To my parents.
To Mona.
v
Acknowledgements
I take this opportunity to thank the people who helped me in various ways throughout the
time-span of this work:
❼ My advisor, Dr. Srinidhi Varadarajan, who has been a friend, a philosopher and a
guide. He helped me at all stages of this work, from conceptualization to implementation.
❼ My Ph.D. committee—Dr. Naren Ramakrishnan, Dr. James D. Arthur, Dr. Calvin J.
Ribbens, and Dr. Scottt F. Midkiff—for their suggestions, support and encouragement.
❼ Dr. Naren Ramakrishnan, for helping me apply this work to various practical software
problems and for guiding me while publishing related results.
❼ Dr. Calvin J. Ribbens, for his help with innumerable official issues and for his technical
inputs.
❼ Dr. Godmar Back, for his technical suggestions.
❼ The GNU Compiler Collection (GCC) mailing lists.
❼ The staff at the Department of Computer Science. In particular, Ginger Clayton and
Melanie Darden .
❼ Sara Thorne-Thomsen for her help with reviews of this dissertation.
vi
❼ My friends who made my stay a mix of more fun and less work: Omprakash Ser-
esta, Bharath Ramesh, Arvind Kumar Sharma, Ankit Singhal, Anil Bazaz, Akbar
Rizvi, Deepak Bhojwani, Sayed Ali Yawar, Navrag B. Singh, Sameer Mulani, Dhaval
P. Makecha, Jon Bernard, Karthik Channakeshava, Veena Basavraj, Ayush Gupta.
❼ My colleagues at the Computing Systems Research Laboratory (CSRL): Bharath Ramesh,
Craig Bergstrom, Patrick Liesveld, Vedvyas Duggirala, Hari Krishna Pyla, Pilsung
Kang, Lee B. Smith, Joe Ruscio, Chris Knestrick.
❼ Special thanks to Craig, Hari and Chris for allowing me to use extracts from their
work.
❼ My family for their support and love: (Late) Dolgobinda Mukherjee and Umarani
Mukherjee; (Late) Satyendra Nath Banerjee and Sumita Banerjee; Swadhin Mukherjee and Anjali Mukherjee; Swaraj Mukherjee, Gargi Mukherjee and Puja Mukherjee;
Anjan Banerjee and Mamata Banerjee; Arpan Banerjee.
❼ The staff at The Cellar Restaurant in downtown Blacksburg.
❼ Lastly, my fiancée Monalisa Chatterjee, for her patience and support during the tough-
est stages of this work.
Joy Mukherjee
vii
Table of Contents
1 Introduction
1
1.1
Problem Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.1.1
Overall Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.1.2
Research Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
Solution Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.2.1
Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.2.2
Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.2.3
Portability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.3
Lateral Technological Advances . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.4
Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.2
2 Motivating Applications
14
2.1
Network Emulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
Network Emulation and Threads . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.1.1
viii
2.1.2
Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
Parallel Scientific Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.2.1
Computational Background . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.2.2
PDE Solvers and Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
2.2.3
Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
2.2
2.3
3 Related Work
35
3.1
Concurrent Approaches (Linda) . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.2
Compositional Approaches (PCOM2) . . . . . . . . . . . . . . . . . . . . . . . .
37
3.3
Component-based Approaches (OOP) . . . . . . . . . . . . . . . . . . . . . . . .
39
3.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
4 The Weaves Framework
43
4.1
Component Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
4.2
Developmental Aspects of Weaved Applications . . . . . . . . . . . . . . . . . .
48
4.3
Implementation and Preliminary Evaluation . . . . . . . . . . . . . . . . . . . .
53
Load and Let Link (LLL) Weaves’ Runtime Loader and Linker . . . . . . . .
55
4.3.1
ix
4.3.2
The LLL Loader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
4.3.3
The LLL Linker
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
4.3.4
Strings: Continuations and Evaluation . . . . . . . . . . . . . . . . . . . . . .
63
4.3.5
Portability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
4.4
Properties of Weaved Applications . . . . . . . . . . . . . . . . . . . . . . . . .
68
4.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
5 Case Studies
75
5.1
Using Weaves for Network Emulation . . . . . . . . . . . . . . . . . . . . . . . .
76
5.1.1
A Simple Instance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.1.2
Experimental Corroboration . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
5.1.3
Contextual Advances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
Using Weaves for Scientific Computing . . . . . . . . . . . . . . . . . . . . . . .
84
5.2.1
A Simple Instance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
5.2.2
Experimental Corroboration . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
5.2.3
Contextual Advances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
5.2.4
Configuring Weaves for HPC . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2
5.3
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
x
6 Concluding Remarks
110
6.1
Salient Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.2
Other Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.3
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7 Ongoing Work
116
7.1
Adaptivity of Weaved Applications . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.2
Dynamic Code Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.3
Dynamic Code Swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.4
Dynamic Code Overlaying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.5
Other Aspects of Ongoing Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Bibliography
131
Vita
141
xi
List of Figures
2.1
The DCEE test-bed called the Open Network Emulator (ONE). The ONE
needs to model thousands of simultaneously (parallelly) running real-world
network applications on a single workstation. . . . . . . . . . . . . . . . . . .
2.2
A simple network model with 2 telnets running over a single IP stack. The
composition emulates a single virtual host. . . . . . . . . . . . . . . . . . . .
2.3
16
19
The composition shown in Figure 2.2 modeled under (a) process per virtual
node model and (b) threads model. Neither of these models can emulate
the desired real-world behavior without significant changes to telnet/IP codes
and/or extra overhead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4
20
Advanced network scenarios entail selectively sharing different independent
IP stacks among different sets of application (telnet, ftp) threads. . . . . . .
xii
22
2.5
(Above) Composite multi-physics problem with six sub-domains. (Below) A
network of collaborating solvers (S) and mediators (M) to solve the composite
PDE problem. Each mediator is responsible for agreement along one of the
interfaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6
PDEs defined over six sub-domains of the boiling mechanism shown in Figure
2.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7
24
25
Typical solver and mediator codes. A solver takes as input a PDE structure
identifying the domain, operator, right side, boundary conditions, and computes solutions. A mediator accepts values from solvers, applies relaxation
formulas, and returns improved boundary condition estimates to the solvers.
PDE solve and Relax soln routines are chosen from a PSE toolbox. . . . . .
2.8
Simple instance of collaborating PDE solvers. Mediator M12 relaxes solutions
from solvers S1 and S2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1
4.3
30
Components of a Weaved application: modules, weaves, strings, and the monitor. All components are intra-process runtime entities. . . . . . . . . . . . .
4.2
28
48
(a) A generic Weaved application. (b) Bootstrap pseudo-code for setting up
the tapestry. (c) The corresponding configuration file. . . . . . . . . . . . . .
51
Development of weaved applications. . . . . . . . . . . . . . . . . . . . . . .
52
xiii
4.4
Comparison of context switch times of threads, processes, and strings. The
baseline single process application implements a calibrated delay loop of 107
seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5
66
A sample tapestry essentially a complete parallel application executing as
a single OS process. The figure shows the individual weaves (w), their constituent modules (m), strings (s), and their composition reflecting the structure of the application as whole. Identical shapes imply identical copies of
a module. The lines connecting the modules imply external references being
resolved between them. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1
70
Modeling the simple network scenario of Figure 2.2 using the Weaves framework. (a) Weaved setup of the tapestry. (b) Bootstrap pseudo-code (c) Configuration file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2
78
Weaved set up of the experimental network scenario. Both clients used identical real-world codes as did the servers. The IP stacks used identical real-world
codes. The two hosts were completely independent, but ran within a single
OS process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3
80
The Open Network Emulator (ONE) models thousands of simultaneously
(parallely) running real-world network nodes and applications on a single machine.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xiv
82
5.4
The dONE exhibits super-linear speedup when emulating real-world network
nodes and applications. This figure is reproduced from [BVB06]. . . . . . . .
84
5.5
The simple PDE solver scenario of Figure 2.8. . . . . . . . . . . . . . . . . .
86
5.6
A possible Weaved realization of the Figure 5.5 scenario. S1, S2, and M12
are composed into separate weaves Wv1, Wv2 and Wv3. External references
from M12 are explicitly bound to definitions within S1 and S2.
5.7
. . . . . . .
87
An alternate Weaved realization of the Figure 5.5 scenario. (a) Weaves Wv1
and Wv2 compose M12 with S1 and S2 respectively. (b) Typical code for S1
and S2. (c) Code for M12 if it needs all solutions. (d) Code for M12 if it uses
solutions as and when needed and available. . . . . . . . . . . . . . . . . . .
5.8
Continuations help map a single module to different weaves. (a) The tapestry
setup. (b) An imaginary partial tapestry. . . . . . . . . . . . . . . . . . . . .
5.9
90
92
Weaving unmodified agent-based codes. Solver and mediator modules are
composed into different weaves, but share a single thread-based MPI emulator. 93
5.10 Weaved setup of the experimental PDE solver (d03edfe) scenario. . . . . . .
98
5.11 Scalability of Weaved scientific applications: Experimental results indicate
that the Weaves framework can help applications exploit the scalability of
threads without requiring modifications to traditional procedural processbased programs. The framework effects zero-overhead encapsulation of solvers. 100
xv
5.12 Weaved setup of the experiment using Sweep3D solvers. . . . . . . . . . . . . 102
5.13 Comparison of performance results of Weaved Sweep3D against LAM-based
and MPICH-based Sweep3D. The performance of the Weaved realization
matched that of the LAM-based and MPICH-based realizations as long as the
number of strings/processes was less than the number of processors. When the
number of strings/processes was increased beyond the number of processors
(8), the Weaved realization performed much better. . . . . . . . . . . . . . . 103
5.14 Relationship between the Weaves framework and (a) a problem solving environment and (b) a performance modeling framework. Advanced configurations of the Weaves framework for scientific computing: (c) Weaved scientific
codes running over MPI-SIM and (d) Weaved scientific codes over Weaved
MPI implementations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.1
(a) Normal loading and linking. (b) Weaved application linking. . . . . . . . 118
7.2
Modeling network dynamics using the Weaves framework. . . . . . . . . . . . 120
7.3
Dynamic code swapping using the Weaves framework. . . . . . . . . . . . . . 123
7.4
Dynamic code pruning in memory constrained Weaved applications. . . . . . 125
7.5
Automatic adjustment of Weaved applications to available software infrastructure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
xvi
List of Tables
4.1
The API of the Weaves framework. Actions, inputs, and compositional issues
associated with each API. Further detials are mentioned under Implementation. 50
xvii
Chapter 1
Introduction
Parallel computing systems are becoming pervasive. At one end are parallel and distributed
systems such as clusters and distributed supercomputers and at the other are low-cost multicore personal computers. The demand for such a range of parallel computing has fueled
the development and adoption of parallel programs at all levels. As a result, software developers are incorporating parallelism into all sorts of software applications, ranging from
compute-intensive programs such as scientific simulations to day-to-day utility programs
such as browsers.
An important effect of the increased adoption of parallel systems is the programming of
many contemporary applications explicitly for parallel and distributed platforms. Software
modeling of common physical phenomena such as the weather, real-world networks such
as the Internet, and so on comprise an inherent degree of parallelism. Weather modeling
1
Joy Mukherjee
Chapter 1. Introduction
2
consists of the joint effect of simultaneous, or parallel, software modeling of factors such as
winds, ocean currents, the Sun, and so forth. Modeling the Internet consists of joint effects
of simultaneous, or parallel, software modeling of network applications, protocol stacks,
and so forth. Applications that model such real-world phenomena need to account for the
parallelism inherent in the corresponding physical manifestations. It is therefore, natural
and intuitive to program such applications with explicit support for parallelism.
1.1
Problem Synthesis
Many contemporary applications require support for large-scale parallelism. In general,
applications such as weather simulations and computer network emulation produce better
results and more accurate models when they can exploit larger scales of parallelism. To a
certain extent, the logic behind this general observation is intuitive. For instance, consider
network emulation. Larger scales of parallelism can help network emulators model a greater
number of simultaneous network entities such as computers (or nodes), network applications,
protocol stacks and so forth. These increased numbers, in turn, facilitate more accurate
characterization of network dynamics such as traffic, load and topology. Ultimately, such
characterization helps stringently test new protocols.
Such applications benefit from various mechanisms that aid large-scale parallelism. Multimachine clusters and supercomputers facilitate large-scale parallelism through increased
hardware resources. At the same time, mechanisms for better exploitation of an individual
Joy Mukherjee
Chapter 1. Introduction
3
cluster node or low-end shared-memory multi-processor (SMP) machine can also contribute
to greater parallelism. This research focuses on the increased exploitation of an individual
multi-core computer node or a single SMP machine for larger scales of parallelism.
Another element of this research is that it deals with applications that use legacy procedural
codes. Many software applications that require support for large-scale parallelism also need to
reuse legacy procedural codes developed, validated and verified through decades of research
and usage. Network emulation and parallel scientific computing are two software areas that
consist of many such applications.
❼ Network Emulation: As mentioned earlier, network emulation benefits from large-scale
parallel modeling of simultaneous network nodes and applications. At the same time,
network emulation entails the use of codes for TCP/IP1 stacks, telnet, and so forth.
To model the exact behavior of real-world networks, emulators must use operationally
correct versions of these codes. However, the real-world versions have evolved through
decades of regular use, and it is extremely difficult to develop, verify and validate new
versions that can comprehensively replace their real-world counterparts. Therefore,
network emulators reuse unmodified legacy real-world versions of these codes, which
are written in procedural programming languages such as C.
❼ Scientific Computing: Scientific modeling of natural phenomena is another instance
of large-scale parallel applications that use legacy procedural codes. Consider, for
instance, simulating a gas turbine. Mathematical modeling of such a multi-physics
1
TCP stands for Transmission Control Protocol [Ste97]. IP stands for Internet Protocol [Kne04].
Joy Mukherjee
Chapter 1. Introduction
4
scientific problem replaces the main problem by a set of smaller problems (on simple
geometries) that need to be solved simultaneously. Corresponding software modeling
instantiates a parallel solver program for each of the simpler problems and multiple
parallel mediator programs to facilitate collaboration between the solvers. A typical
problem is a complex composition of thousands of solvers and mediators. At the same
time, solvers and mediators use scientific codes from different problem solving environment (PSE) toolboxes. Typical PSE toolboxes contain a legacy of computational
routines verified and validated over decades of research. Most of these codes are written
in procedural languages such as C and FORTRAN.
These two radically different software areas emphasize the widespread use of legacy procedural codes. From the perspective of this research, the three most important properties of
legacy procedural codes are:
❼ They cannot be modified without serious consequences. The sheer size of some of these
codes (such as the IP stack), the tremendous investments associated with them (such
as decades of investment in scientific solvers) and the enormous software engineering
task of revalidating modified versions are some important reasons behind this property.
❼ They often use global variables. For instance, telnet and IP stack codes are rich in
global variables [Var02]. So are several scientific solvers [NAG06a].
❼ They are programmed in procedural languages such as C (telnet and IP stack [Var02])
and/or FORTRAN (scientific solvers [NAG06a]).
Joy Mukherjee
Chapter 1. Introduction
5
To summarize, the problem domain of this research consists of large-scale parallel applications
that use legacy procedural codes.
1.1.1
Overall Goal
Before proceeding, it is important to clarify certain terms that are used repeatedly throughout the rest of this document:
❼ The phrase Legacy Parallel Programs is used to identify the problem domain, that is,
large-scale parallel applications that use legacy procedural codes.
❼ The word threads is used to identify lightweight intra-process threads such as POSIX
threads [NBF96] and GNU’s user-level threads [Eng06].
❼ The word transparent is used to indicate “no code modifications”.
Broadly, contemporary parallel programs follow one of two execution paradigms, threads or
processes. Most legacy parallel programs use processes to realize parallel flows of execution.
However, threads are lightweight and facilitate larger scales of parallelism [NBF96] than
processes. Typical SMP workstations and multi-core processors can support more concurrent
threads than processes. Furthermore, creation and runtime management of threads require
less operating system (OS) action, which reduces overall overhead.
Most importantly, almost all parallel programs require some sort of collaboration between
parallel flows of execution. For instance, a case of network emulation can require running
Joy Mukherjee
Chapter 1. Introduction
6
multiple parallel applications, such as telnet, over one IP stack (modeling multiple telnets on
one virtual node). Here, parallel flows of execution for each telnet must collaborate within
the IP stack. Also, as mentioned earlier, parallel mediators facilitate collaboration among
solvers in multi-physics scientific simulations.
Processes use inter process communication (IPC) for such collaboration. However, IPC
requires operating system (OS) action [Ram04], which adds extra overhead, thereby limiting
the scalability of parallelism. In contrast, because threads run within a single process, interthread collaboration can be realized through low-overhead sharing of global variables.
Therefore, threads can help legacy parallel programs exploit larger scales of parallelism over
individual nodes of a cluster and over individual SMP machines. Such exploitation leads to
better utilization of the overall resources and ultimately facilitates more accurate modeling
of large-scale parallel phenomena such as networks and multi-physics problems.
The overall goal of this research is to help “unmodified” legacy parallel programs exploit the
scalability provided by threads.
1.1.2
Research Challenges
The fundamental problem obstructing progress towards this goal is best explained by an
example. Suppose that a legacy parallel program comprises two identical, but independent,
parallel flows of execution, each with its own copy of all global variables. The processes
paradigm that is traditionally used for running such programs automatically creates virtual
Joy Mukherjee
Chapter 1. Introduction
7
machine abstractions to encapsulate each parallel flow of execution and the associated global
variables. However, under the threads paradigm, the two flows of execution (intra-process
threads) end up sharing all global variables and inadvertently interfere with each other
thereby leading to erroneous behavior. For instance, in a case of network emulation, running
two independent telnets over threads results in the inadvertent sharing of telnet’s global
variables among the two supposedly independent threads. Such sharing leads to erroneous
modeling.
To the best of our knowledge, no technology exists that can alleviate this problem without
code modifications. However, as mentioned earlier, modification of legacy parallel programs
is not preferable. This observation leads to the first challenge encountered in this research:
the need to transparently separate global variables used by identical, but independent, threads
of a legacy parallel program.
Separation of global variables conflicts with low-overhead sharing of global variables for
collaboration, one of the main benefits of using threads in the first place. This conflict
can only be addressed through selective sharing of global variables among threads of legacy
parallel programs. Furthermore, such selective sharing needs to be exercised at arbitrary
granularities, from a single variable to entire components. For example, emulation of multiple
telnets on one virtual node requires the sharing of all global variables in an entire IP stack
component among multiple telnet threads. Again, certain parallel scientific applications
require the sharing of individual solution variables and boundary condition variables among
different solver and mediator threads. The next chapter illustrates these examples in detail.
Joy Mukherjee
Chapter 1. Introduction
8
To the best of our knowledge, at present, multi-granular selective sharing of global variables
among threads can only be realized through explicit programming according to the tenets
of the threads paradigm. Because legacy parallel programs traditionally use processes for
execution, most of their codes do not subscribe to such constructs. However, reprogramming legacy procedural codes according to thread-programming constructs is taboo in this
research’s problem domain. This observation leads to the second challenge encountered in
this research:
the need to transparently realize multi-granular selective sharing of global variables among
the threads of a legacy parallel program.
1.2
Solution Approach
Threads are runtime entities. They exist only at the runtime stage of a program’s life
cycle. Hence, this research takes an entirely runtime approach to address the challenges it
encounters. At runtime, legacy procedural codes are available as native-code objects only.
Therefore, this research approaches the encountered challenges at the level of native-code
objects. At no point does the solution approach entail any modifications, either to source
codes or to native codes. The approach is completely transparent.
Joy Mukherjee
1.2.1
Chapter 1. Introduction
9
Proposed Work
This research proposes the Weaves runtime framework for execution of unmodified legacy
parallel programs over intra-process threads. The term “Weaved applications” is used for
programs that follow the tenets of the Weaves framework. Weaved applications can load
encapsulated modules from native-code object files. They can load multiple identical, but
independent modules within a single process without any modifications to the concerned
object file. Every module encapsulates all its global variables within itself.
Just as the compile-time linking of object files creates an executable program, the runtime
composition of a set of modules creates a weave. A weave is, therefore, an intra-process
subprogram that can support a flow of execution. Weaves composed from disjunctive sets
of modules are completely independent subprograms within a single process. Additionally,
the framework allows a single module to be shared among multiple weaves, which can be
leveraged to realize arbitrary graph-like sharing of modules among different weaves. By
allowing direct runtime control over the resolution of individual references to functions and
variables in a module’s code, the Weaves framework empowers programs with the ability to
manipulate weave composition at fine granularities. When references from different weaves
resolve to a single function or a single variable, the concerned function or variable is shared
among the different weaves. Thus, direct control over the resolution of individual references
can extend selective sharing to finer granularities.
All the components of Weaved applications, including the fundamental module, are intra-
Joy Mukherjee
Chapter 1. Introduction
10
process runtime entities. Also, the Weaves framework does not entail any code modifications
either to the source codes of modules to the native codes of relocatable objects. The framework is completely transparent.
The Weaved version of a legacy parallel program can run threads through any weave, because
a weave is an intra-process subprogram that can support a flow of execution. Threads running
through independent weaves do not share any global variables. Thus, the Weaves framework
transparently separates global variables used by identical, but independent, threads of a
legacy parallel program. Furthermore, threads running through weaves that selectively share
modules or variables, experience all the elements of this sharing. Therefore, the Weaves
framework transparently realizes low-overhead collaboration through multi-granular selective
sharing of global variables among threads of a legacy parallel program.
Together, the facilities of the Weaves framework help “unmodified” legacy parallel programs
exploit the scalability provided by threads. Threads allow for larger scales of parallelism
through better utilization of resources on a single multi-core node or SMP workstation. Ultimately, larger scales of parallelism aid accurate modeling of large-scale parallel phenomena
such as networks and multi-physics problems.
1.2.2
Usability
At the basic level, the Weaves framework offers its services as a library. It supports simple Application Programming Interfaces (APIs) for loading modules, composing weaves, resolving
Joy Mukherjee
Chapter 1. Introduction
11
individual references and executing threads. Users can explicitly program the composition
of Weaved applications using these APIs. The framework also provides a meta-language for
specifying the composition of a Weaved application in a configuration file and a script that
automatically creates and runs the application from the meta-description.
Essentially, configuration files encode the composition of modules in a Weaved application,
where modules are runtime images of object files. This property makes them very similar to
Makefiles, which encode the composition of object files in an executable program or a shared
library. Consequently, from the usability perspective, composing Weaved applications is
comparable to writing Makefiles.
1.2.3
Portability
The Weaves framework is currently implemented on GNU/Linux over three architectures:
x86, x86 64, and ia64. The implementation is heavily dependent on the Executable and
Linkable File Format (ELF) [TIS95], the format of native relocatable objects used by most
GNU/Linux systems. Weaves’ runtime loader and linker called Load and Let Link (LLL)
implements the core aspects of the framework, which include loading modules, composing
weaves and direct control over the resolution of individual references.
Since most Linux-based systems subscribe to ELF and related semantics, the implementation
is fairly architecture-neutral. A port to the Power PC architecture is currently underway.
Porting to different operating systems such as Windows and OS-X poses some problems,
Joy Mukherjee
Chapter 1. Introduction
12
because they do not comply with the ELF standard. However, most regular operating
systems, including OS-X and Windows, allow the object file based decoupling of applications.
Even though they have different formats, these object files are similar in structure and content
to ELF relocatable objects. Theoretically, therefore, the Weaves framework is portable across
a wide range of operating systems and architectures.
1.3
Lateral Technological Advances
The Weaves framework institutes lateral advances in software areas that frequently encounter
legacy parallel programs. This research demonstrates these advances through experiments
with large real-world network emulations and parallel scientific applications. These experiments in two radically different domains of parallel systems emphasize the broad impact of
the Weaves framework.
The results from the experiments with network emulation show that Weaved realizations
transparently exploit overall resources for larger scales of parallelism. A Weaved emulation
can emulate a thousand virtual nodes over a single physical machine. On increasing the
number of machines, the emulation speeds up super-linearly i.e. on increasing the number
of machines by a factor of N, the time taken to run a certain emulation drops by a factor
greater than N. In effect, such super-linear speedup exemplifies the scalability of Weaved
network emulations.
The results of the experiments with parallel scientific applications show that Weaved realiza-
Joy Mukherjee
Chapter 1. Introduction
13
tions can support nearly twice the number of unmodified and non-threadsafe parallel solvers
as the corresponding process-based realizations can. The results also show that Weaved applications can transparently exploit low-overhead collaboration among parallel solver threads
through user-level sharing of global variables, which allows for significantly greater scales of
achievable parallelism than corresponding IPC-based collaboration does.
These results together demonstrate that the Weaves framework is instrumental in helping
legacy parallel programs transparently exploit overall resources for larger scales of parallelism. Ultimately, larger scales of parallelism enable more accurate testing of network protocols and more accurate modeling of multi-physics phenomena. These lateral advances,
therefore, substantiate the efficacy of the Weaves framework.
1.4
Organization
The rest of this thesis is organized as follows. The next chapter, chapter 2 details on motivating applications from the areas of network emulation and parallel scientific computing.
Chapter 3 discusses related work. Chapter 4 presents the elements of the Weaves framework.
Chapter 5 illustrates Weaves-based approaches to network emulation and parallel scientific
computing through various case studies. It also discusses the experiments and the associated
results that elucidate the benefits of the Weaves framework. Chapter 6 contains concluding
remarks. Finally, chapter 7 briefs on ongoing work.
Chapter 2
Motivating Applications
This chapter presents in detail motivating applications from the areas of network emulation and parallel scientific computing, that illustrate the specific requirements motivating
the current work. This thesis resorts to these applications at various points to elucidate
key concepts and insights. These applications, chosen for the purposes of illustration only,
provide examples of two radically different areas of parallel software systems to emphasize
the impact of this research on the broad range of parallel computing systems.
2.1
Network Emulation
Network emulation requires support for large-scale parallelism to model real-world networks
such as the Internet. Furthermore, it requires real-world network applications and protocol
codes that are typically legacy procedural codes. Therefore, many instances of network
14
Joy Mukherjee
Chapter 2. Motivating Applications
15
emulation fall within the domain of legacy parallel programs that can benefit from the
Weaves framework. The following background of network emulation highlights in greater
detail its pertinence to this research.
The last several years have seen the deployment of network protocol development environments that allow users to create complex controlled experimental test-beds to verify and
validate network protocols. SOftware researchers broadly classify protocol development environments into (a) Network simulators and (b) Direct code execution environments (DCEE).
While simulators such as NS [NsN06, HK97, FV06], OPNET [OPN05, OPN06, Yuj01], REAL
[Kes97], x-kernel [HP88, BP96], PARSEC [Ter01] and dummynet [Riz97] offer efficient eventdriven execution models, they require that the protocol under test be written in their event
driven model. Designers can refine the simulated protocol and then convert it to a real-world
implementation. The basic problem is that there is no easy way to ensure the equivalence of
the simulated protocol and its real-world version. A secondary issue in network simulators
stems from their clean-room implementation. Because the TCP/IP protocol stack in simulators is written from scratch, it does not exactly emulate real-world protocol stacks. However,
the idiosyncrasies of real-world TCP/IP can significantly affect performance [AP99, Pax99].
DCEEs such as ENTRAPID [HSK99, EII00], NIST network emulator [CS03], ModelNet
[VYW+ 02] and MARS [AS94] solve the verification and validation problems by directly executing unmodified real-world code in a network test-bed environment. Network emulation
naturally benefits from large-scale parallel modeling of simultaneous network nodes, and applications. Larger scales of parallelism can help network emulators model a greater number
Joy Mukherjee
Chapter 2. Motivating Applications
16
of simultaneous network entities such as nodes, network applications, protocol stacks and
so forth. The increased numbers, in turn, facilitate more accurate characterization of network dynamics such as traffic, load, and topology. Ultimately, accurate characterization of
networks helps stringently test new protocols.
Figure 2.1: The DCEE test-bed called the Open Network Emulator (ONE). The ONE needs
to model thousands of simultaneously (parallelly) running real-world network applications
on a single workstation.
As Figure 2.1 shows, the DCEE testbed called the Open Network Emulator (ONE)1 [Var02]
needs to model thousands of simultaneously (parallel) running real-world network applications on a single machine. The goals of the ONE project are:
1. To provide a protocol development environment that models large-scale real-world
1
Varadarajan envisioned the ONE for large-scale network emulations.
Joy Mukherjee
Chapter 2. Motivating Applications
17
networks (hundreds of thousands of virtual network nodes).
2. To support direct execution of unmodified protocol and application code.
2.1.1
Network Emulation and Threads
These goals put the ONE squarely within the domain of large-scale parallel applications
that use legacy procedural codes. The first goal, a development environment for modeling
large-scale networks, requires support for large-scale parallelism. Larger scales of parallelism
can help the ONE model a greater number of simultaneous network entities such as nodes,
network applications, protocol stacks and so forth. The increased numbers, in turn, facilitate more accurate characterization of network dynamics such as traffic, load and topology.
Ultimately, such characterization helps stringently test new protocols.
The second goal, direct execution of unmodified protocol and application code, implies use
of traditional procedural codes. Specifically, it entails the use of codes for TCP/IP protocol
stacks, telnet, ftp, and so on. To model the exact behavior of real-world networks, the ONE
must use operationally correct versions of these codes. However, because the real-world
versions have evolved through decades of regular use, it is extremely difficult to develop, verify
and validate new versions that can comprehensively replace their real-world counterparts.
For this reason, the ONE reuse unmodified legacy real-world versions of these codes that are
written in procedural programming languages such as C.
Currently, DCEEs model all virtual network nodes and applications as parallel OS processes.
Joy Mukherjee
Chapter 2. Motivating Applications
18
The large context switch time of processes and OS limits on the maximum number of processes (of the order of hundreds on typical workstations of today) inherently restrict the
scalability of process-based solutions.
Even on a cluster supercomputer with hundreds of nodes, the ONE needs to model thousands
of parallel network applications on a single physical machine (Figure 2.1). We argue that
intra-process threads can help the ONE model larger networks than current process-based
schemes [Var02]. Typical workstations of today can support more concurrent threads than
processes [NBF96]. Therefore, the ONE can model a greater number of simultaneous network
applications as parallel threads than as parallel processes. Furthermore, creation and runtime
management of threads incur less OS action, which reduces overall overhead.
Lastly, but most importantly, threads run within the same process. This property can
facilitate low-overhead collaboration among parallel network applications being modeled by
the ONE through sharing of global variables. Such low-overhead collaboration, in turn, can
be instrumental in lowering the overhead of modeling large-scale networks. The following
discussion illustrates this idea.
Figure 2.2 shows a network scenario where two telnet applications run over a single IP stack.
To model this scenario using processes, the ONE must link the telnet application against
the IP stack and run two instances of the resultant program. As Figure 2.3(a) shows, such
a process-based approach completely separates each instance of telnet. However, this model
is erroneous, because network traffic from various applications can interfere with each other
within the IP stack. Consider, for instance, a network node (machine) that supports both
Joy Mukherjee
Chapter 2. Motivating Applications
19
Figure 2.2: A simple network model with 2 telnets running over a single IP stack. The
composition emulates a single virtual host.
real time applications such as videoconferencing as well as best effort applications such as
file transfer progams (FTP). The traffic from the real time application interferes with the
best effort FTP application within the IP stack, resulting in less than adequate performance
for the real-time application.
For this reason, the ONE cannot model the joint effect of two telnets running over the same
IP stack through this approach. While it is possible to synchronize the IP stacks on the two
telnet processes through IPC mechanisms, such an approach entails either code modifications
or extra overhead or both. Hence, IPC either violates transparent reuse of legacy codes or,
at best, limits the scalability of network emulations, because IPC consists of high-overhead
OS actions [Ram04]).
Figure 2.3(b) shows a model based on threads. In this case, the ONE must link the telnet
application against the IP stack as before, and initiate two threads through the resultant
program. Both threads run through the same instance of the IP stack. Thus, the model
Joy Mukherjee
Chapter 2. Motivating Applications
20
Figure 2.3: The composition shown in Figure 2.2 modeled under (a) process per virtual node
model and (b) threads model. Neither of these models can emulate the desired real-world
behavior without significant changes to telnet/IP codes and/or extra overhead.
benefits from low-overhead collaboration among the telnet threads within the shared IP
stack. The actual real-world semantics of the two telnets interfering within the IP stack is
closely captured without entailing any code modification or extra overhead.
2.1.2
Challenges
A major problem with the thread-based approach arises from updates to global variables.
Consider the Figure 2.3 scenario once again. Telnet contains global variables and is nonthreadsafe. Because threads share all global variables, a telnet thread modifying a global
variable can inadvertently change the state of the other unrelated telnet thread causing
erroneous behavior. The ideal solution to this problem requires two copies of all the global
variables used in the telnet application. In programs that explicitly follow the threads
Joy Mukherjee
Chapter 2. Motivating Applications
21
paradigm, the sharing of global variables is intentional. In the scenario discussed here, this
sharing is neither intentional nor necessarily desirable.
The two conflicting needs at the crux of the ONE’s problems when using threads are:
❼ The need to avoid sharing global variables between two telnet threads when they run
through the telnet code.
❼ The need to share global variables between the same two threads when they run through
the IP stack code.
Currently, there exists no technology that can fulfill the first need without modifications or
additions to the telnet codes. However, as mentioned earlier, being a DCEE test-bed that
runs real-world codes, the ONE does not prefer these modifications. Therefore, the first need
of the ONE leads to the first research challenge, the need to transparently separate global
variables used by identical, but independent, threads of a legacy parallel program.
The second need of the ONE sharing a single IP stack’s global variables between two otherwise independent telnet threads. Furthermore, emulating advanced network scenarios, such
as one depicted in Figure 2.4, entails selectively sharing different independent IP stacks
among different sets of application threads.
To our knowledge, selectively sharing IP stacks among application threads can only be realized by explicitly reprogramming the IP stack code following constructs of the threads
paradigm. However, since telnet and the IP stack are legacy procedural codes, such repro-
Joy Mukherjee
Chapter 2. Motivating Applications
22
Figure 2.4: Advanced network scenarios entail selectively sharing different independent IP
stacks among different sets of application (telnet, ftp) threads.
gramming is not preferable. The second need of the ONE, therefore, leads to the second
research challenge, the need to transparently realize multi-granular selective sharing of global
variables among the threads of a legacy parallel program.
2.2
Parallel Scientific Computing
Parallel scientific computing requires support for large-scale parallelism to model physical
phenomena such as the weather. Furthermore, it requires using scientific solver programs
that are typically legacy procedural codes. Therefore, many instances of parallel scientific
computing fall within the domain of legacy parallel programs that can benefit from the
Weaves framework. The following background of parallel scientific applications highlights
in greater detail its pertinence to this research. Compared to network emulation, parallel
scientific computing comprises more illustrative instances of multi-granular selective sharing
of global variables among threads of legacy parallel programs.
Joy Mukherjee
Chapter 2. Motivating Applications
23
Parallel scientific applications frequently use collaborating partial differential equation (PDE)
solver programs [DHRR99] to model heterogeneous multi-physics problems. Consider modeling physical phenomena within a gas turbine. This software application requires combining
simultaneous (parallel) models for heat flows (throughout the engine), stresses (in the moving parts), fluid flows (for gases in the combustion chamber), and combustion (in the engine
cylinder). The mathematics of the problem describes each constituent model by a PDE
with various formulations for the geometry, operator, right side, and boundary conditions.
The basic idea here is to replace the multi-physics problem by a set of smaller problems (on
simple geometries) that need to be solved simultaneously while satisfying a set of interface
conditions.
2.2.1
Computational Background
Figure 2.5 shows how a set of smaller problems can replace a multi-physics problem. The
computational basis of this idea consists of the interface relaxation approach to support a
network of interacting PDE solvers [MR92]. Computational modeling of the multi-physics
problem distinguishes between solvers and mediators. A PDE solver is instantiated for
each of the simpler problems and a mediator is instantiated for every interface, to facilitate
collaboration between the solvers. The mediators are responsible for ensuring that the
solutions of the solvers match at the interfaces. The term “match” is defined mathematically
(e.g., the solutions should join smoothly at the interface and have continuous derivatives) or
by the physics of a specific problem (e.g., conservation constraints).
Joy Mukherjee
Chapter 2. Motivating Applications
24
Part 2
Part 1
Part 3
Part 5
Part 6
S1
S4
S2
4
M
13
23
M1
M12
M
Part 4
M34
S3
35
5
M
M46
M4
S6
M56
S5
Figure 2.5: (Above) Composite multi-physics problem with six sub-domains. (Below) A
network of collaborating solvers (S) and mediators (M) to solve the composite PDE problem.
Each mediator is responsible for agreement along one of the interfaces.
Joy Mukherjee
Chapter 2. Motivating Applications
25
Since the interface relaxation approach is often confused with classical domain decomposition, it is helpful to highlight the differences. In domain decomposition, one PDE problem is
split into sub-domain problems of the same type that the (unknown) solution values at the
interface points interconnect, thereby creating a single underlying PDE for all sub-domains.
In interface relaxation, each sub-domain can have its own PDE and the interface conditions
are generally derived from the underlying physics of the problem. Schwartz splitting [Mu99]
is a meeting point between these two extremes where the sub-domains have the same PDE,
but are coupled through continuity of solution and its derivatives. Interface relaxation is,
therefore, the most general approach to modeling complex physical phenomena.
UUxx + (1 + U)Uyy + aU(1 + U) = b(x2 + y2 - 2) Domains 1,4
Uxx/(1 + (x - y)2) + Uyy/(1 + (4x - 5y)2) + cU/(101 + U) = 0 Domains 2,3
Uxx + Uyy - d(Ux + Uy) + cU = 0 Domains 5,6
Figure 2.6: PDEs defined over six sub-domains of the boiling mechanism shown in Figure
2.5.
As an example, Figure 2.6 shows the model of a boiling mechanism formulated by adapting
ideas from [Ric98]. The formulation comprises six sub-domains and the following PDEs
defined over them (). Here, Uxx and Uyy denote second order partial derivatives; Ux and Uy
denote first order derivatives of the unknown function U(x,y). The PDEs are not common
across the sub-domains. The PDEs in each sub-domain must be solved in an inner loop
of the interface relaxation routine which then applies an ‘averaging’ or ‘smoothing’ formula
to ensure that the solutions agree across sub-domain boundaries. The simplest approaches
Joy Mukherjee
Chapter 2. Motivating Applications
26
simply exchange solution values across the boundaries, but there are more complex relaxation
formulas as well [RTV99].
Figure 2.7 illustrates typical solver and mediator codes. A solver takes as input a PDE
structure identifying the operator, geometry, right side and boundary conditions, and computes solutions using some computational routine (PDE solve), such as finite difference or
finite element approximation. The PDE problem characteristics determine the choice of the
PDE solve routine from a problem solving environment (PSE) [VR05] toolbox. Typical PSE
toolboxes contain a legacy of computational routines verified and validated over decades of
research. Even if PDE solve routines implement different algorithms, they could use identical global symbols and signatures. Further, a certain problem instance might entail the
same solver on multiple sub-domains or require different solvers or both. After computing
solutions, a solver passes the results to mediators and waits until the mediators report back
fresh boundary conditions. Upon the receipt of new conditions, it re-computes the solutions
and repeats the whole process till a satisfactory state is reached. From the solver perspective, the semantics of interaction with the mediators are not much different from a functional
interface.
The computational definition of a mediator states that it should be “capable of accepting
values from solvers, apply relaxation routines (Relax soln) and return improved values to
the solvers” [Ric98]. A mediator should, therefore, be able to collaborate with multiple
solvers through exchange of solution structures and boundary condition variables. As in the
case of the solvers, the problem characteristics dictates the choice of the Relax soln routine
Joy Mukherjee
Chapter 2. Motivating Applications
27
from legacy PSE toolboxes. Once again, multiple Relax soln routines can expose identical
names and signatures, and a certain problem instance might entail identical or different
mediators or both. Another important property is that different mediators and associated
relaxation routines can either (1) require all solutions at once, or (2) use solutions as they
are needed and become available.
The properties of parallel scientific applications that are relevent to this research are summarized as follows:
1. They use thousands of parallel solvers and mediators to model realistic problem instances such as turbines and heat engines.
2. They reuse unmodified legacy computational routines and programs from scientific
PSE toolboxes that have been verified and validated over decades of research.
2.2.2
PDE Solvers and Threads
These properties put PDE solver programs within the domain of large-scale parallel applications that use legacy procedural codes. The first property, requires support for large-scale
parallel solvers and mediators. Larger scales of parallelism can help realize a greater number
of simultaneous solvers and mediators. The increased numbers, in turn, facilitate fine-grain
decomposition of multi-physics problems. Ultimately, fine-grain decomposition aids accurate
characterization of physical phenomena. The second property implies use of traditional procedural codes written in C and FORTRAN (languages that have found traditional favor in
Joy Mukherjee
Chapter 2. Motivating Applications
PDE struct
{
domain;
operator;
right side;
boundary_cond_1;
...
boundary_cond_n;
...
}
Toolbox
PDE_solve (...) { … }
PDE_solve (...) { … }
28
/* Solver code */
…
function solver (…) {
…
do {
… // wait for new conditions
PDE_solve (…);
… // report new solutions
} while (…)
...
}
Solutions for
mediators
Solutions from
solvers
Relax_soln (...) { … }
Relax_soln (…) { … }
Boundary
conditions for
solvers (PDE
structs)
/* Mediator code */
…
function mediator (…) {
…
do {
… // wait for solution(s)
Relax_soln (…);
… // report new conditions
} while (…)
...
}
Figure 2.7: Typical solver and mediator codes. A solver takes as input a PDE structure
identifying the domain, operator, right side, boundary conditions, and computes solutions.
A mediator accepts values from solvers, applies relaxation formulas, and returns improved
boundary condition estimates to the solvers. PDE solve and Relax soln routines are chosen
from a PSE toolbox.
Joy Mukherjee
Chapter 2. Motivating Applications
29
scientific computing due to generally better performance and lower overhead). Typical PSE
toolboxes contain a legacy of computational routines verified and validated over decades of
research.
The traditional software approach to collaborating PDE solvers consists of agent technology [DHRR99] in highly distributed environments such as clusters and distributed memory
machines. Agent-based schemes model simultaneous solvers as parallel processes running
on different machines in a distributed computing system. They also model mediators in
the same manner. Agent-based solutions use message passing for exchange of solutions and
boundary conditions among solvers and mediators. In highly distributed environments, with
solvers running on different physical machines, message passing is often the only means of
exchanging solution or boundary condition information. Message passing also facilitates flexible composition of varied multi-physics problems through transparent decoupling of solvers
and mediators.
Intra-process threads can help model larger PDE solver instances than current agent-based
schemes. Typical SMP workstations of today can support more concurrent intra-process
lightweight threads than processes. A thread-based approach, therefore, can realize a greater
number of simultaneous solvers and parallel and mediators than process-based agents. Furthermore, the creation and runtime management of threads incur less OS action, which
reduces overall overhead [NBF96].
More importantly, intra-process threads run within the same process. This property can
be used to facilitate low-overhead collaboration among solvers and mediators through user-
Joy Mukherjee
Chapter 2. Motivating Applications
30
level sharing of solution structures and boundary condition variables. Compared to message
passing among process-based agents, which entails OS-level IPC actions, such user-level
sharing of solutions and boundary variables incur lower overhead.
2.2.3
Challenges
The first issue with a thread-based approach arises when PDE solver instances comprise
multiple identical solvers and mediators. Figure 2.8 illustrates this issue. It depicts a simple
parallel scientific application, comprising two solvers (S1 and S2) and one mediator (M12).
If S1 and S2 are identical, they cannot be run over threads within the same process unless
they do not contain global variables. However, many legacy scientific solvers contain global
variables and are non-threadsafe [NAG06a]. If such solvers are used, a thread-based realization leads to an inadvertent sharing of the global variables between S1 and S2, thereby
S2
S1
resulting in erroneous behavior.
M12
Figure 2.8: Simple instance of collaborating PDE solvers. Mediator M12 relaxes solutions
from solvers S1 and S2.
Currently, no technology exists that can address this issue without modifications or additions
Joy Mukherjee
Chapter 2. Motivating Applications
31
to the solver and mediator codes. However, modifications or additions to scientific programs
are taboo due to decade long investments in verification and validation of legacy solver codes.
Essentially, this issue leads to the first research challenge, the need to transparently separate
global variables used by identical, but independent, threads of a legacy parallel program.
At the same time, realistic multi-physics scenarios require exchange of solutions and boundary conditions among solvers and mediators. To exploit low-overhead collaboration, the
solver and mediator threads need to share solution data structures and boundary condition
variables. Due to potentially random scenarios, such sharing needs to be selectively instituted at the fine granularity of individual data structures and variables. For instance, in the
scenario depicted in Figure 2.5, solver thread S4 needs to share different boundary variables
with four possibly identical mediator threads M14, M34, M45, and M46.
Currently, fine-grain selective sharing of boundary variables among solver and mediator
threads can only be realized by explicitly reprogramming the solver and mediator codes
following constructs of the threads paradigm. However, because solvers and mediators comprise legacy procedural codes, such reprogramming is not preferable. This observation leads
to the second research challenge, the need to transparently realize multi-granular selective
sharing of global variables among the threads of a legacy parallel program.
Joy Mukherjee
2.3
Chapter 2. Motivating Applications
32
Summary
This chapter has provided detailed descriptions of examples from the areas of network emulation and parallel scientific computing. These example applications have illustrated the
specific requirements that motivate the current work. The radically different software areas
of these motivating applications emphasize the broad impact of this work.
These descriptions have demonstrated that each of the motivating applications fall within
the problem domain of this work, large-scale parallel applications that use legacy procedural codes. Larger scales of parallelism can help network emulators model a greater number
of simultaneous network entities such as nodes, network applications, protocol stacks and
so forth. The increased numbers, in turn, facilitate more accurate characterization of network dynamics such as traffic, load and topology. Ultimately, accurate characterization of
networks helps stringently test new protocols.
In the area of parallel scientific computing, larger scales of parallelism can help realize a
greater number of simultaneous solvers and mediators. The increased numbers, in turn,
facilitate fine-grain decomposition of multi-physics problems. Ultimately, fine-grain decomposition aids accurate characterization of physical phenomena.
This chapter has illustrated the manner in which our motivating applications benefit from
the overall goal of this research. Even on a cluster supercomputer with hundreds of nodes,
large-scale network emulation needs to model thousands of parallel virtual nodes and applications on a single physical machine. Because SMP workstations of today can support
Joy Mukherjee
Chapter 2. Motivating Applications
33
more threads than processes, thread-based approaches can help model larger networks than
current process-based schemes. Furthermore, creation and runtime management of threads
incur less OS action, which reduces overall overhead. Most importantly, intra-process threads
run within the same process. This property facilitates low-overhead collaboration through
user-level sharing of global variables among emulated parallel network applications.
Similarly, in scientific computing, thread-based approaches help realize larger parallel scientific solver instances than current agent-based schemes.
Here also, threads facilitate
low-overhead collaboration among parallel solvers and mediators through sharing of global
boundary condition variables, which lowers the overhead of modeling complex multi-physics
problem scenarios.
Finally, this chapter has illustrated the manner in which the motivating applications lead to
the research challenges. Certain cases of network emulation require running multiple identical, but independent telnet threads within a single process. Certain scientific applications
require running multiple identical, but independent, PDE solver threads within a single process. Since telnet and PDE solvers are legacy procedural codes that frequently contain global
variables, these requirements lead to the first research challenge, the need to transparently
separate global variables used by identical, but independent, threads of a legacy parallel
program.
Again, certain cases of network emulation require selectively sharing different IP stacks
among multiple independent telnet threads. At the same time, certain scientific applications
require arbitrary sharing of individual boundary variables among various independent solver
Joy Mukherjee
Chapter 2. Motivating Applications
34
and mediator threads. These requirements together lead directly to the second research
challenge, the need to transparently realize multi-granular selective sharing of global variables
among the threads of a legacy parallel program.
Chapter 3
Related Work
This chapter presents a comparison of the Weaves framework, and its benefits for legacy
parallel programs, to related work. Currently, no technology exists that can address either
of the challenges outlined in this research without instrumenting modifications on legacy
procedural codes. At a more fundamental level, to the best of our knowledge, thread-based
execution of unmodified legacy parallel programs that contain global variables and are nonthreadsafe has never been attempted. Additionally, most related attempts resort to explicit
programming based on the threads paradigm to facilitate selective sharing of global variables
among threads.
Comparing the Weaves framework to some powerful approaches to parallel programming
underscores its uniqeuness and highlights its capability of running process-based legacy parallel programs over threads without any modifications. For simplicity and clarity, scientific
35
Joy Mukherjee
Chapter 3. Related Work
36
applications comprising collaborating PDE solvers (see Chapter 2) are used to illuminate
the strengths and weaknesses of different approaches. Finally, for conciseness, each of these
related approachs has been analyzed using a single representative work from the literature
on it.
3.1
Concurrent Approaches (Linda)
Intra-process threads naturally lead to concurrent programming techniques, where multiple
control flows execute within one process. Concurrent programming [GB82, CG89, DGTY95,
Fos96, Sat02, OAR04, BKdSH01] also offers a way around IPC overhead—global in-memory
data structures—for inter-flow collaboration. Linda [GB82, CG89] is an exemplary representative of general concurrent programming techniques, and therefore, the strengths and
weaknesses of a Linda-based approach are broadly valid across all concurrent approaches.
In a Linda-based scheme, each solver and mediator instance is a tuple implemented as a
lightweight thread inside a single process (tuple space). Boundary conditions and solution
structures are manifested as data tuples that are written to and read from the same global
tuple space. This scheme is scalable and allows low-overhead collaboration among solvers
and mediators through multi-granular selective sharing of global variables.
The biggest problem with a Linda-based approach is “lack of encapsulation in a global tuple
space” [Fos96] within a single process. Certain scientific applications comprise multiple
identical, but independent, legacy PDE solvers that contain global variables. Concurrent
Joy Mukherjee
Chapter 3. Related Work
37
techniques, as Linda exemplifies, offer no mechanism to transparently separate these solvers
in intra-process environments.
In contrast, a Weaves-based approach facilitates true encapsulation of legacy procedural
codes in intra-process environments. It allows legacy parallel programs to run multiple
identical, but completely independent, threads within a single process. Each thread has
its own independent set of global variables. Most importantly, facilitating encapsulation
does not entail any modifications to the concerned codes either at the source-level or at the
native-code level.
Another problem with Linda is its radically different programming methodology of the tuple
space, which has hindered adoption. In contrast, the Weaves framework stays within the
broad realms of procedural programming. The current work can, however, easily merge with
a tuple space model1 .
3.2
Compositional Approaches (PCOM2)
Most parallel frameworks are targeted at increasing efficiency of applications through parallelization. Darlington, Guo, To and Yang [DGTY95] use special source-level constructs
called skeletons to provide compositional parallelism and reusability of existing sequential
code. Jade [RL98], ORCA [BBH+ 98] and Strand (PCN) [FOT92, FT89] address issues pertaining to parallelism, sharing of program elements, and code reuse. All of them, however,
1
See Mukhrejee [Muk02] for a discussion on the tuple-space aspects of the Weaves framework.
Joy Mukherjee
Chapter 3. Related Work
38
approach the problem from the source language perspective, use high-level constructs and
are ultimately based on using multiple processes.
Certain compositional parallel programming models [JL05, CCA04, Fos96, MDB03, SteJr90],
address componentization issues in concurrent languages. They use coordination languages
to componentized existing procedural codes and compiler-based techniques to compose a
parallel program. PCOM2 [MDB03] illustrates the general pros and cons of approaches in
this category. Consider a simple PCOM2 approach, a start component spawns solver and
mediator components that run over parallel threads and exchange information through global
data structures.
The setup is similar to a Linda-based approach as are the issues. The problem is that
none of these compositional schemes enable true encapsulation because all components still
inhabit the same namespace. Essentially, compositional parallel programming models cannot encapsulate multiple identical, but independent, copies of traditional procedural code
components. Consequently, they cannot prevent inadvertent interference of among threads
running identical, but independent, legacy PDE solvers containing global variables.
One advantage of some of these techniques is that being compiler-based they can automatically modify codes at compile time. However, even though automatic, such schemes
entail code modification (patching), which leads back to verification/validation issues within
the problem domain of legacy parallel programs. Moreover, developing such an automatic
patching tool that is generic across various parallel software systems is an enormous software
engineering task.
Joy Mukherjee
Chapter 3. Related Work
39
In contrast, a Weaves-based approach facilitates true encapsulation and multiple instantiation of legacy procedural codes in intra-process environments. Most importantly, in facilitating encapsulation, it does not entail any modifications to the concerned codes either at
the source-level or at the native-code level.
3.3
Component-based Approaches (OOP)
It is possible to encapsulate and modularize traditional procedural codes at runtime using
component-based technologies such as CORBA [OMG06], COM/DCOM [MS06], and Object
Oriented Programming (OOP). Technologies such as CORBA and DCOM are mainly meant
for distributed environments. OOP, on the other hand, is more pertinent to intra-process
environments. Hence, OOP is used as a general representative of other technologies in this
category.
In an OOP-based [CK01, PCB94, LM01] approach, solvers and mediators are classes that
are instantiated and composed depending on a particular problem instance. Encapsulation
and multiple instantiation primitives of OOP help separate identical solver and mediator
components along with their global variables. Multiple parallel intra-process threads are
fired off at entry functions of each solver and mediator to execute a PDE solver application.
The solvers and mediators can communicate through user-level shared data structures at
various granularities. Alternatively, mediator components (objects) can be shared between
multiple solver threads by passing them as parameters to the associated solver objects. Both
Joy Mukherjee
Chapter 3. Related Work
40
these mechanisms allow low-overhead selective access to shared global variables from within
solver and mediator components.
The problem with this approach is that solver and mediator codes must be available as
classes, a source-level construct specific to object oriented languages (OOL). To get around
this, solvers and mediators must be reprogrammed or the traditional procedural codes must
be wrapped within classes written in an OOL. However, while recoding violates the transparency requirement, wrapping is complex and potentially expensive in terms of performance.
Finally, even if solver and mediator classes are available, OOP creates a gap between the
graphical network-like description of solvers and mediators in a scientific application and its
actual runtime image. To bridge this gap, users much program the application structure
in an OOL to instantiate and compose objects from classes, a non-trivial task because automatically translating a high-level meta-description or a graphical problem representation
to OOP code requires sophisticated tools specific to collaborating PDE solver applications.
The OOP approach, therefore, lacks flexibility and domain neutrality.
In contrast, a Weaves-based approach facilitates transparent encapsulation and multiple
instantiation of legacy procedural codes in intra-process environments. It does not entail any
source-level intervention, such as wrapping. A Weaves-based approach, therefore, maintains
the simplicity and efficiency of normal procedural programming. Furthermore, a Weavesbased approach transparently facilitates multi-granular selective sharing of global variables
among parallel threads of legacy parallel programs. Compared to programming complex
program structures in OOP, writing a Weaves-based configuration file is simpler and more
Joy Mukherjee
Chapter 3. Related Work
41
intuitive.
3.4
Summary
This chapter has presented a comparison of the Weaves framework, and its benefits for
legacy parallel programs, to related work. This comparison shows that currently, no technology exists for transparent separation, or transparent multi-granular selective sharing, of
global variables among threads of a legacy procedural program. It also shows that, at a fundamental level thread-based execution of unmodified legacy parallel programs that contain
global variables and are non-threadsafe has never been attempted. Most related attempts
resort to explicit programming based on the threads paradigm to facilitate selective sharing
of global variables among threads.
Comparing the Weaves framework to some powerful approaches to parallel programming
highlights its strenghts. Existing concurrent programming techniques such as Linda [GB82,
CG89] offer no mechanism to transparently global variables of identical, but independent,
threads legacy parallel programs. Existing compositional parallel programming models such
as PCOM2 [MDB03] suffer from similar constraints. Existing component-based technologies
such as OOP [CK01] need to wrap legacy procedural codes within classes, which adds extra
overhead.
On the other hand, a Weaves-based approach facilitates transparent runtime (intra-process)
encapsulation and multiple instantiation of legacy procedural codes. It does not entail source-
Joy Mukherjee
Chapter 3. Related Work
42
level intervention, such as wrapping, or code modification. The Weaves framework maintains
the simplicity and efficiency of normal procedural programming. Furthermore, a Weavesbased approach facilitates transparent multi-granular selective sharing of global variables
among parallel threads of legacy parallel programs. Writing a configuration file for a Weavesbased parallel program is simpler and more intuitive than programming the structure of a
parallel program in OOP.
Chapter 4
The Weaves Framework
This chapter provides a description of the elements of the Weaves runtime framework for parallel programs. “Weaved applications” is the terminology used for programs that subscribe
to the tenets of the Weaves framework. The chapter consists of four parts: the definition of
the components of a Weaved application, the developmental aspects of Weaved applications,
the details of the current implementation including preliminary evaluation, and a discussion
of some of the properties of Weaved application that illustrate the manner in which the
framework helps address the research challenges outlined in Chapter 1.
Because threads are runtime entities, both the challenges involve runtime aspects. Therefore, the Weaves framework takes a runtime approach to address them. At runtime, legacy
procedural codes are available as native code objects only. Hence, the framework approaches
the challenges at the level of native code objects. To a great extent, a native code based
43
Joy Mukherjee
Chapter 4. The Weaves Framework
44
approach helps avoid modifications to legacy procedural codes.
To facilitate transparent separation of global variables between identical threads, the first
challenge, the framework takes a component-based approach. In effect, this approach helps
load encapsulated native code object components into the runtime environment of programs.
Furthermore, the framework treats relocatable objects (.o files) as loadable components.
Relocatable objects can be flexibly created at multiple granularities without concerns such
as referential completeness. They also maintain information on individual references (to
functions and variables) in their code. These properties of relocatable objects help the
framework facilitate transparent multi-granular selective sharing of global variables among
threads, the second challenge.
Since they are referentially incomplete, relocatable object components are not runnable by
themselves. One or more of these components must be composed (linked together) into
referentially complete program images. The Weaves framework takes a runtime approach to
the composition of relocatable object components. Along with the multi-granular decoupling
inherent relocatable object components, this approach to runtime composition empowers
the framework with the ability to transparently realize arbitrary selective sharing of global
variables among threads of legacy parallel programs. Finally, at no point does the Weaves
framework resort to any code modifications either at the source level or at the level of
relocatable objects (native code patches). The framework is completely transparent.
Joy Mukherjee
4.1
Chapter 4. The Weaves Framework
45
Component Definitions
The Weaves framework defines the following components of a Weaved application. The
terminology that follows was chosen (a) to avoid overloading established terms; and (b) to
make it clear that the framework is not bound to any established paradigm:
❼ Module: A module is the intra-process runtime image of a native code relocatable
object (.o file) [TIS95]. It is the main unit of encapsulation in the framework. A module
in the Weaves framework corresponds to an object in OOP. It can be programmed in
any compiled language (such as C, C++, and FORTRAN). Each module defines its
own namespace and encapsulates all its global variables. Weaved applications can load
multiple identical modules from a single relocatable object file (multiple instantiation)
without requiring any modifications to the concerned object. Encapsulation allows each
copy to have its own independent namespace and global data within the address space
of a single process. The Weaves framework allows modules to be dynamically loaded
from relocatable objects without requiring referential completeness. Consequently, a
module’s code can contain undefined references to external program elements (in short,
external references). Weaved applications have direct access to most1 global references2
contained in a module’s code. They can control the resolution of each individual global
reference.
❼ Weave: A weave is at the core of the Weaves framework. It consists of a collection of one
1
Aliased references such as dynamically assigned pointer variables are not handled at this time.
2
“Global references” must resolve to global functions and variables. They subsume external references.
Joy Mukherjee
Chapter 4. The Weaves Framework
46
or more modules composed (linked) together at runtime. A weave can support a flow
of execution. It is an intra-process subprogram that unifies namespaces of constituent
modules. Therefore, identical modules should not be included within a single weave
(just as two copies of the same object file cannot be linked into the same executable
program). However, different weaves can comprise similar, but independent, modules.
Going a step further, the Weaves framework allows a single module to be part of
multiple weaves. This property lays the foundation for selective sharing within the
Weaves framework.
Just as the compile-time linking of object files creates an executable, the dynamic composition of modules creates a weave. Both comprise resolution of external references
among constituent objects/modules [TIS95]. However, there are certain differences:
– Weave composition is an intra-process runtime activity (where modules are intraprocess runtime entities).
– Weave composition does not necessitate the resolution of all references within
constituent modules.
– Weaved applications can exercise direct control over the resolution of individual
global references (to definitions such as functions and variables) to fulfill referential
completeness or to transcend weave boundaries/limitations.
– External references from a shared module might need to resolve to different definitions in different weaves. Using late binding mechanisms, the Weaves framework
performs weave-dependent resolution of such external references at runtime.
Joy Mukherjee
Chapter 4. The Weaves Framework
47
❼ String: A string is the unit of execution in the framework. It is an intra-process thread
that can be dynamically initiated and managed. A string can be either a kernel-level
(OS-level) thread or a user-level thread. A string executing within a referentially complete weave is similar to a process running through the runtime image of an executable
program. Multiple strings can simultaneously (parallely) run through the same or
different weaves.
❼ Co-strings: Co-strings are strings that execute within the same weave.
❼ Monitor: Because a Weaved application must load modules and compose weaves at
runtime, it requires a minimal bootstrapping module, which runs on its main process
thread. This bootstrapping module is responsible for setting up the Weaved application, that is, loading modules, composing weaves and starting strings. The framework
permits an application to customize the main processs functionality after it has started
strings. The term ‘Monitor’ is used for the main process, because it can be customized
to monitor and externally/asynchronously modify a Weaved application’s constitution/functionality at runtime. Such rewiring capabilities of Weaves are, however, not
pertinent to the central theme of this thesis. Interested readers can consult Mukherjee
and Varadarajan [MV05b] for more information.
❼ Tapestry: A tapestry is a single Weaved application comprising an entire composition
of modules, weaves, strings and a monitor. The physical manifestation of a tapestry is
a single operating system (OS) process.
Joy Mukherjee
48
Chapter 4. The Weaves Framework
Figure 4.1 depicts the components of a Weaved application. A Weaved application (tapestry),
including all strings and the monitor, runs as a single OS process.
Compile-time
bootstrap.c
bootstrap.o
Runtime
bootstrap
executable
System
Linker
Weaves
.c
.o
module
.c
.o
module
Weave
.f
.o
Monitor
module
String
Process thread
Figure 4.1: Components of a Weaved application: modules, weaves, strings, and the monitor.
All components are intra-process runtime entities.
4.2
Developmental Aspects of Weaved Applications
At the basic level, the Weaves framework offers its services as a library. It supports 5
simple Application Programming Interfaces (APIs). Table 4.1 describes actions, inputs,
Joy Mukherjee
Chapter 4. The Weaves Framework
49
and compositional issues associated with each API. Further detials are mentioned under
Implementation.
Users can program a bootstrap module to load modules, compose weaves and initiate strings
using these APIs. They must then compile the boostrap module, link it to the Weaves library
and run it as a normal executable program (Figure 4.1). Figure 4.2(b) depicts bootstrap
pseudo code for the generic Weaved application shown in Figure 4.2(a). The framework
also provides a meta-language for specifying application tapestries in a configuration file and
a script that automatically generates a bootstrap module, builds it and runs the resultant
executable program. Figure 4.2(c) depicts the configuration file for the Weaved application
shown in Figure 4.2(a). One direction of current research on Weaves aims at an integrated
Graphic User Interface (GUI) for tapestry specification and automatic execution.
Figure 4.3 diagrammatically illustrates the complete process for developing a general Weaved
application. An important aside is that the actual runtime structure of a Weaved application
matches the high-level composition specification. This correspondence between execution
and specification, and the minimal application-specific information in bootstrap modules,
make automatic tapestry generation simple and generic across a diverse set of applications.
In fact, we have used the same meta-language and script to generate tapestries for various
network emulations as well as different parallel scientific applications.
The Weaves framework also provides several general purpose APIs:
❼ Query APIs provide information about a string, weave or module (The framework
Joy Mukherjee
50
Chapter 4. The Weaves Framework
Table 4.1: The API of the Weaves framework. Actions, inputs, and compositional issues
associated with each API. Further detials are mentioned under Implementation.
API
Action
Inputs
Issues
module_load
Loads a runtime module
from an object file on
disk.
Location of the object file
and an identifier (32-bit
unsigned integer) for the
module.
Splitting an application
into appropriate object
files. Providing a unique
identifier for the module.
weave_compose
Composes together a set
of modules into a weave
at runtime.
The constituent module
identifiers and an identifier
(32-bit unsigned integer) for
the weave.
Assuring the
compatibility of the
constituent modules.
Assuring completeness
and unambiguousness of
the weave. Providing a
unique weave identifier.
resolve_ref
Binds a reference to a
definition.
The symbols for the target
reference and the target
definition. The identifiers of
the modules that contain the
reference and the definition.
Assuring the existence
and the compatibility of
the target reference and
the target definition.
string_init
Starts a string through a
weave.
The entry-function of the
concerned weave and the
required function parameters.
Deciding on the
underlying thread
package. Specifying the
entry-function and
associated parameters
according to the rules of
the thread package being
used.
string_cont*
*(This API provides
support for ‘string
continuations’. String
continuation is a
specialized function. Its
utility is explained in
Chapter 5 (Use Cases).
Continues the execution
of a certain string into
another weave.
The identifier of the other
weave.
Assuring the similarity
or compatibility of the
current and the other
weaves. Details on string
continuations are
mentioned in the next
section
(Implementation).
Joy Mukherjee
Chapter 4. The Weaves Framework
string 0
51
string 1
module 0
(file1.o)
module 1
(file2.o)
weave 1
weave 0
module 2
(file3.o)
(a)
void bootstrap ()
{
…
/* load modules */
mod[0] = load_module (“file1.o”);
mod[1] = load_module (“file2.o”);
mod[2] = load_module (“file3.o”);
…
/* Compose weaves */
weave[0] = compose_weave (mod[0], mod[2]);
weave[1] = compose_weave (mod[1], mod[2]);
…
/* 1 string through each weave */
string_init (weave[0], foo);
string_init (weave[1], bar);
…
}
#modules
file1.o: 0
file2.o: 1
file3.o: 2
/* object file: module ID */
/* object file1 is loaded as module 0 */
/* object file 2 is loaded as module 1 */
/* object file 3 is loaded as module 2*/
#weaves
0: 0, 2
1: 1, 2
/* weave ID: module ID, module ID, … */
/* weave 0 composes modules 0 and 2 */
/* weave 1 composes modules 1 and 2 */
#strings
0: 0, foo (…)
1: 1, bar (…)
(b)
/* string ID: weave ID, start function */
// start string 0 in weave 0 at fn. foo ()
// start string 1 in weave 1 at fn. bar ()
(c)
Figure 4.2: (a) A generic Weaved application. (b) Bootstrap pseudo-code for setting up the
tapestry. (c) The corresponding configuration file.
Joy Mukherjee
52
Chapter 4. The Weaves Framework
Specification
(code/metadescription/UI)
Source
Code 2
Source
Code 3
Weaved Application
(Single OS Process)
Source
Code 1
Module 1
Object
file 3
String 1
Object
file 2
Linker/Loader
Object
file 1
The
Weaves
Bootstrap
module
String 2
Module 2
Weave 1
Weave 2
Module 3
Figure 4.3: Development of weaved applications.
assigns a globally unique identifier to every module, bead, weave and string).
❼ Destruction APIs destroy and free memory allocated for modules and weaves.
❼ String APIs are based on those of the underlying thread. They control string func-
tionality at runtime, including initialization, termination, runtime management and
scheduling. The framework provides a special API for string continuation because it
consists of framework-specific action. For initialization, termination, runtime management and scheduling of strings, the framework allows applications to directly call
underlying thread APIs.
❼ Global APIs start, stop, pause and/or restart the entire tapestry. These APIs, though
not directly related to this thesis, are useful for the dynamic configuration and com-
Joy Mukherjee
Chapter 4. The Weaves Framework
53
position of Weaved applications3 .
Currently, composing Weaved applications entails writing bootstrap modules or configuration
files. Essentially, bootstrap modules or configuration files encode the composition of modules
in a tapestry, where modules are runtime images of relocatable object files. This property
makes them very similar to Makefiles, which encode the composition of object files in an
executable or a shared library. Consequently, from the usability perspective, composing
Weaved applications is comparable to writing Makefiles.
For simple tapestries, bootstrap modules and configuration files are fairly simple. However,
larger tapestries, especially those involving individual reference resolutions, can complicate
matters. These complications are comparable to those encountered while writing extensive
makefiles. A GUI for tapestry specification, which is one of our current research directions,
will make the framework more usable.
4.3
Implementation and Preliminary Evaluation
The Weaves framework is currently implemented on GNU/Linux. The implementation is
heavily dependent on the Executable and Linkable File Format (ELF) [TIS95], the format
of native relocatable objects used by most GNU/Linux systems. It is possible, therefore,
that to follow the discussion in this section requires the reader’s familiarity with the ELF
3
Mukherjee and Varadarajan have discussed the details of the dynamic aspects of Weaved applications
in [MV05b].
Joy Mukherjee
Chapter 4. The Weaves Framework
54
standard.
The basic implementations of a module, a weave, a string, co-strings, the monitor, and a
tapestry are fairly obvious from their definitions. A module is the runtime image of a relocatable object file, i.e. a ‘.o’ file. A weave is a collection of runtime images of relocatable
objects dynamically composed into a subprogram. Individual reference-definition bindings
consist of virtual address resolutions according to typical binding semantics [TIS95]. Currently, strings can be implemented as either kernel-level POSIX threads (pthreads) [BB06]
or as user-level GNU Pth [Eng06]. Co-strings are simply multiple strings running through
a single weave. The monitor is implemented as the main process thread. Finally, a tapestry
is a physical OS process.
This thesis provides a description of the implementation of the core aspects of the framework which are loading modules, composing weaves, direct control over reference-definition
bindings and string continuations. The implementations of other aspects such as remaining
string APIs, the monitor and the general purpose APIs are either trivial or not of great
significance to this thesis 4 .
The framework’s runtime loader and linker, called Load and Let Link (LLL) implements three
of the core aspects, loading modules, composing weaves, and direct control over referencedefinition bindings. Before discussing the details of LLL, it is important to mention a few
things about the data structures used for modules, weaves and strings. The implementation
of string continuations is discussed later in this chapter.
4
See Mukherjee [Muk02] for further details.
Joy Mukherjee
Chapter 4. The Weaves Framework
55
The data structure for a module contains an identifier (a 32-bit unsigned integer), a list for
storing global references in the module’s code (reference-list), a list for storing external
references in the module’s code (external-references) and a list for storing dependency
libraries of the module (dependency-list). The data structure for a weave contains an identifier (a 32-bit unsigned integer) and a list for storing constituent module identifiers. Among
other items, a string’s data structure maintains an identifier (a 32-bit unsigned integer), a
stack for storing weave identifiers (weave-stack), and a thread identifier (which depends on
the underlying thread package being used). The implementation also maintains three global
hash tables, one for storing pointers to the module structures, one for storing pointers to
the weave structures, and one for storing pointers to the string structures. Applications can
lookup the desired data structures these tables using module identifiers, weave identifiers
and string identifiers, respectively.
4.3.1
Load and Let Link (LLL)
Weaves’ Runtime Loader and
Linker
The Weaves framework requires extensive runtime loading and linking capabilities to flexibly
load and compose modules and randomly manipulate reference-definition bindings. Traditional native object loaders cannot support the flexibility demands of Weaves. For instance,
existing loaders cannot load multiple copies of a native code object. Again, typical loaders
do not provide an explicit interface to control the resolution of individual global references.
Joy Mukherjee
Chapter 4. The Weaves Framework
56
Therefore, Weaves provides its own tool—Load and Let Link (LLL)—for the dynamic loading
and linking of modules.
4.3.2
The LLL Loader
The LLL loader maps given relocatable object files from the disk to corresponding modules
in the memory. To load a module, Weaved applications must explicitly specify the location
of the concerned relocatable object file. Applications must also provide a unique module
identifier (a positive 32-bit unsigned integer).
If inputs, such as location and identifier, pass validity checks (existence of file and uniqueness
of identifier, respectively), the loader converts the relocatable object (.o file) to a shared
object (.so file). Most relocatable objects can be converted to respective shared objects
without any concerns5 . Conversion of a relocatable object to a shared object does not require
referential completeness. Although it does result in static resolution of certain locally scoped
references (such as those to static variables and read-only constants), global references are
not resolved and all information on them is maintained within the resultant shared object.
If the conversion is successful, the loader verifies the content and format of the resultant
shared object for compatibility. This verification requires detailed checks on ELF encoding
and platform specific factors such as instruction set architectures [FSF05]. Once the object passes the compatibility tests, the loader maps the file from disk to memory subject
5
To our knowledge, any relocatable can be converted to a corresponding shared object.
Joy Mukherjee
Chapter 4. The Weaves Framework
57
to rules/commands provided in the object encoding. Sometimes, an object is dependent
on libraries for utility functions. For instance, an object that uses printf is dependent on
glibc. When loading a module, LLL records every required dependency library in the module’s dependency-list. Nevertheless, since dependency libraries are not a part of the core
application, LLL relies on the underlying OS’s support to load them.
After mapping the module and its dependencies (using the OS’s loader), the LLL loader
attempts to resolve every global reference to a definition within the module. Global references
are looked up in the module’s relocation sections. Definitions are looked up in the module’s
symbol table. During this phase, the loader records each encountered global reference and all
associated information in the module’s reference-list. Such information is instrumental in
empowering Weaved applications with direct control over the resolution of individual global
references. If a certain resolution fails, the loader adds the concerned reference and all
associated information to the module’s external-references list. An unresolved reference,
therefore, does not hamper module loading. The loader follows regular resolution semantics
stated in the ELF standard [TIS95]:
❼ For a successful resolution, the target reference is bound to the virtual memory address
(VMA) of the concerned definition.
❼ For a failed resolution, the target reference is bound to a NULL value.
Modules are not runnable entities by themselves. Applications must compose them into
weaves, the principal runnable component (subprogram) as defined by the Weaves frame-
Joy Mukherjee
Chapter 4. The Weaves Framework
58
work, before execution. Because a weave can comprise multiple modules, cross-linking among
constituent modules has a higher precedence than linking to the dependency libraries. Hence,
the loader does not attempt to resolve a module’s references to its dependency libraries at
this stage.
The loader maps every module to a unique location in the memory. Hence, functions and
variables defined within a module remain independent of all other modules. Also, the loader
allocates and initializes an independent data structure for each module. This disjunction
among different modules enables complete encapsulation of each module. Furthermore, the
loader identifies every module by a unique 32-bit number that the user assigns to it. As long
as the user provides different identifiers, the loader has no issues in distinguishing multiple
identical modules loaded from a single object file. Therefore, in tandem with disjunctive
mapping, a number-based identification scheme enables multiple identical, but independent,
instantiations of a module. Note that the LLL loader does not entail any modifications to
relocatable object files.
4.3.3
The LLL Linker
The LLL linker binds together (composes) a set of modules to create a weave. To compose a
weave, a Weaved application must explicitly specify identifiers of all the constituent modules
and a unique identifier (a positive 32-bit unsigned integer) for the weave. Upon invocation,
the LLL linker checks the validity of the inputs. If all the module identifiers are valid, and
Joy Mukherjee
Chapter 4. The Weaves Framework
59
the weave identifier is valid as well as unique, the LLL linker composes a weave according to
the following steps:
1. It starts with the first module in the given set.
2. For each reference in the module’s list of external references, it looks up a compatible
definition in the other modules (specifically, the symbol tables of the modules) of
the set. If it finds a compatible definition, it resolves the reference to the definition
and continues searching through the rest of the modules. If it finds a second viable
definition, it signals ‘redefinition’ and terminates weave composition. If it finds one, and
only one, definition, it proceeds to the next external reference. If it does not find any
matching definition, it tries resolving to the dependency libraries of the module using
the OS’s services. If it still does not find a matching definition, it signals ‘unresolved
reference’ and proceeds to the next external reference. When all external references
are handled, it proceeds to step 3.
3. If available, it takes the next module in the set and goes to step 2. Otherwise it signals
‘successful weave composition’ and exits.
Step 2 above illustrates several interesting properties of the LLL linker:
❼ The linker does not exit successfully as soon as it finds a compatible definition in order
to stay within the traditional linking constructs of procedural programming. The
Weaves framework defines a weave as a subprogram. Unambiguousness is implied in
this definition. As a result, the linker must check for possible redefinitions.
Joy Mukherjee
Chapter 4. The Weaves Framework
60
❼ The linker might not detect multiple definitions in the absence of a corresponding
external reference. COnceptually, however, the encapsulation of modules implies that
each module privately scopes all its definitions within itself. An external reference
expands the scope of all corresponding definitions beyond their container modules,
thereby leading to potential ambiguity.
❼ Unresolved references do not lead to the termination of weave composition. Weaved
applications can exercise direct control over resolution of individual references to fulfill
referential completeness at a later time.
Like the LLL loader, the linker follows regular resolution semantics stated in the ELF standard [TIS95] in normal cases. However, shared modules present exceptional cases. As stated
earlier, external references from a shared module (shared references, for short) might need
to resolve to different definitions in the context of different weaves. The linker resolves
these shared references to a special dynamic linker function—rt lll—defined by LLL itself.
A string accessing a shared reference automatically invokes the rt lll function. When invoked, the function looks up a definition for the target reference within the scope of the weave
associated with the invoker string. This looking up process steps similar to those followed
during a normal weave composition (steps 1, 2, and 3 above), except that both multiple and
unresolved definitions lead to string termination. Also, if the looking up process is successful,
the target reference is not resolved to the obtained definition. Instead, rt lll dynamically
directs the invoker string to the obtained definition. Details of rt lll and its dynamic
mechanisms are similar to rtld (GNU libc’s runtime dynamic linker [FSF05]). However,
Joy Mukherjee
Chapter 4. The Weaves Framework
61
the implementation of rt lll is quite complicated. Because the implementation of rt lll
is tertiary to the main theme of this thesis, for simplicity and focus, we refrain from further
discussions of rt lll in this document6 .
The LLL linker also implements direct control over the resolution of individual global references. To invoke this service, Weaved applications must provide an identification tuple for a
target reference. The tuple contains at least two fields, the identifier of the module that contains the reference and the symbol associated with the reference. When a module contains
multiple references to a symbol, further information such as the container function, sequence
number and so forth is required. Once again, for conciseness, this thesis does not include
excessive details on this topic. Upon invocation, the linker looks up the target reference’s
entry in the associated module’s reference list. If no matching module or entry is found, the
linker signals an error. If multiple matches are found, the linker requests more information
from the application. If one and only one match is found, the linker proceeds to the next
step.
The next step consists of looking up a corresponding definition to which the target reference
is to be bound. For this purpose, Weaved applications must provide another identification
tuple. This tuple must identify a unique definition, such as a global function or a global
variable. The tuple has two fields, the identifier for a module and an associated symbol.
If the module identifier is valid, the linker looks up a definition for the given symbol in
the given module’s symbol table. If it finds a valid definition, the linker resolves the target
6
See Mukherjee and Varadarajan [MV05a] and Free Software Foundation [FSF05] for more details.
Joy Mukherjee
Chapter 4. The Weaves Framework
62
reference to that definition. Otherwise, it signals an error.
There are four interesting aspects of the resolution mechanism described above. Firstly, the
resolution of individual references is completely under application control. Unlike typical
linkers, and also unlike weave composition, the resolution is NOT based on symbol matching. Therefore, a reference to a symbol foo in one module can be arbitrarily resolved to
a definition bar in another. Secondly, the mechanism is not limited to a subset of global
references, such as undefined or external references. Any global reference, even one that
already links to a valid definition, can be re-linked using this mechanism. These two aspects
allow Weaved applications to arbitrarily control or modify the structure and functionality of
modules, weaves and tapestries.
Thirdly, the individual reference resolution mechanism does not check for compatibility,
such as type matching, between a reference and a definition. The main reason for this
aspect is that native code objects do not maintain high-level information on typing and
function signatures. Even traditional native object linkers cannot perform stringent checks
for type and signature compatibility between references and definitions (such checks are
usually performed at compilation). And fourthly, the direct resolution of an individual
reference overrides the previous or current resolution of that reference, even if it links to
rt lll. The third and fourth aspects are causes for concern. Weaved applications, therefore,
should carefully ensure that they provide compatible reference-definition pairs when invoking
the individual reference resolution API.
Joy Mukherjee
Chapter 4. The Weaves Framework
63
The LLL linker composes weaves, the principal runnable subprogram as defined by the
Weaves framework. Therefore, it is the most important part of the framework’s implementation. Based on user-specified numeric identifiers, it can compose and distinguish identical
weaves from similar sets of modules. Furthermore, using dynamic linking mechanisms, the
LLL linker single-handedly enables the selective sharing of modules among multiple weaves.
Lastly, with some help from the LLL loader, it enables direct control over the resolution of
individual references to arbitrary definitions. Like the LLL loader, the LLL linker does not
entail any modifications to object files.
4.3.4
Strings: Continuations and Evaluation
Strings are based on underlying threads. As mentioned earlier, strings can be initialized,
terminated, managed and scheduled using corresponding thread APIs directly. However,
string continuation comprises framework-specific action. Users must explicitly program a call
to the string continuation API into a module. A string dynamically invokes the continuation
API when it executes through a weave that contains the concerned module. As Table 4.1
describes, a call to continuation switches the caller string from its current weave to a different,
but compatible, weave (the target weave).
To invoke the continuation API, a Weaved application must provide the identifier of the
target weave. Upon invocation, the continuation API checks for the validity of the given
weave identifier. If the check is successful, it traces the following steps:
Joy Mukherjee
Chapter 4. The Weaves Framework
64
1. It issues a query to obtain the weave within which the string is currently executing
(the current weave).
2. It checks the compatibility of the current and the target weaves. If they are compatible7 , it proceeds to the next step. Otherwise, it signals an error and terminates the
continuation.
3. It pushes the current weave’s identifier onto the string’s weave-stack.
4. It looks up the next instruction following the continuation call in the current weave,
jumps to the corresponding instruction in the target weave and continues the execution
of the string from there.
A string can progressively call continuations. Each subsequent call pushes the current weave
identifier on the string’s weave-stack. A string can return to the immediately previous
weave by calling the end continuation API. The end continuation API works as follows:
1. It pops the immediately previous weave identifier from the caller string’s weave-stack.
2. If the weave-stack is empty, it signals an error and terminates the call to end continuation. Otherwise, it proceeds to the next step.
3. It looks up the next instruction following the end continuation call in the current
weave, jumps to the corresponding instruction in the previous weave and continues the
execution of the string from there.
7
Two weaves are compatible if and only if they comprise the same or identical modules.
Joy Mukherjee
Chapter 4. The Weaves Framework
65
String continuations are helpful in efficient information exchanges among strings running
through compatible weaves (similar strings). Using continuations, a certain string can switch
to the weave of another similar string and asynchronously modify its state. The other string
notices the effects of such modifications without any explicit action.
A simple experiment has confirmed that strings are indeed lightweight and have low-context
switching times. We created a baseline application that implemented a calibrated delay loop
(busy wait of 107 seconds) and then implemented thread-based, process-based and weaved
versions of the application. In each of these versions, there were n independent flows of
execution over the same code, where each flow of control executed a loop that did 1/nth the
baseline work. We measured the total time of execution in each case. Since each flow of
execution did 1/nth of the work and there were n flows, the total time taken should have
been the same as the baseline calibrated delay loop case, except for an additional context
switching cost.
Figure 4.4 shows the results of the experiment on a single processor AMD Athlon workstation
running Linux. The results show the run time for five versions of the experimental application: (a) baseline calibrated delay loop version, (b) pthread-based version, (c) Pth-based
version, (d) process-based version, (e) Weaved version over pthreads, and (f) Weaved version
over Pth. The results clearly show that the weaved implementations are significantly faster
than the process-based one even in this simple case, where the copy-on-write semantics of the
fork() call are very effective. Furthermore, the run time of the weaved version over pthreads
is very close to the base run time of the pthread-based version. The marginal variation is
Joy Mukherjee
Chapter 4. The Weaves Framework
66
180
Execution Time (seconds)
170
Processes
Pthreads
Pth
Weaves/Pthreads
Weaves/Pth
Baseline
160
150
140
130
120
110
100
0
100
200
300
400
500
600
700
800
900
1000
Number of Control Flows
Figure 4.4: Comparison of context switch times of threads, processes, and strings. The
baseline single process application implements a calibrated delay loop of 107 seconds.
Joy Mukherjee
Chapter 4. The Weaves Framework
67
due to the slightly higher weave creation cost, which is included in the run time. However,
the run time of the weaved version over Pth is higher than the base Pth-based case. This
increase in runtime occured because unlike pthreads, Pth is a user-level library and, hence,
suffers from timer inaccuracies inherent in user-level libraries.
4.3.5
Portability
Currently, the Weaves framework is implemented over three architectures: x86, x86 64, and
ia64. However, since most Linux-based systems subscribe to ELF and related semantics,
the implementation is fairly architecture-neutral and can easily be extended to architectures
such as Digital Alpha, Sun Solaris and Power PC. A port to Power PC is currently underway.
Porting to different operating systems such as Windows and OS-X poses bigger problems,
because they do not comply with the ELF standard. However, most regular operating
systems, including OS-X and Windows, allow the object file based decoupling of applications.
Even though they have different formats, these object files are similar in structure and content
to ELF relocatable objects. Theoretically, therefore, the Weaves framework is portable across
a wide range of operating systems and architectures.
Joy Mukherjee
4.4
Chapter 4. The Weaves Framework
68
Properties of Weaved Applications
Encapsulation enforces decoupling between modules. Coupled with facilities for multiple
instantiation, the encapsulation of modules empowers Weaved applications with the ability
to have identical, but completely separate, copies of a relocatable object component within
a single process. Furthermore, the ability to load referentially incomplete relocatable object
files allows Weaved applications to exploit runtime modularization/componentization at arbitrary granularities. To avail themselves of these facilities, Weaved applications do not need
to instrument any modifications on concerned relocatable objects.
Weaved applications realize each parallel program component as a weave, because a weave
is an intra-process subprogram that can support a flow of execution. Weaves composed
from disjunctive sets of modules are completely independent parallel subprograms within a
single process. The Weaves framework allows a single module to be part of multiple weaves.
Weaved applications can exploit this facility to realize arbitrary graph-like selective sharing
of modules among various weaves (Figure 4.5:
❼ A weave can share, or not share, any number of modules with another weave.
❼ Transitively, any number of weaves can share, or not share, a single module.
❼ A weave can share, or not share, different modules with different weaves.
When references from different weaves resolve to a single definition of a function or a variable,
the concerned definition is shared among the different weaves. Therefore, Weaved applica-
Joy Mukherjee
Chapter 4. The Weaves Framework
69
tions can exercise direct control over resolution of individual global references to extend
selective sharing among weaves to the fine granularity of individual functions and variables:
❼ A weave can selectively share any number of functions and variables with another
weave.
❼ Transitively, any number of weaves can share a single function or variable.
❼ A weave can selectively share different functions and variables with different weaves.
All the components of a Weaved application, including the fundamental module, are intraprocess runtime units. Also, Weaved applications need not instrument any code modifications on modules either at the source level or at the level of relocatable objects. They can
use the framework’s facilities transparently.
Legacy procedural codes are easily available as relocatable objects. The Weaved versions
of legacy parallel programs can transparently (without code modifications) load multiple
encapsulated modules of legacy procedural codes into intra-process runtime environments.
When they run strings through weaves composed from distinct sets of identical modules,
the strings do not share any global variables. This capability directly addresses the first
research challenge: the need to transparently separate global variables used by identical, but
independent, threads of a legacy parallel program.
Strings running through weaves experience all the elements of sharing reflected in the compositions of those weaves. Therefore, Weaved versions of legacy parallel programs can trans-
Joy Mukherjee
70
Chapter 4. The Weaves Framework
s1
s2
m1
s3
m2
w1
m3
w2
s5
s4
m4
m5
w3
w4
m6
w2
m8
w5
m7
w1
m10
w3
m9
single Os Process
Figure 4.5: A sample tapestry essentially a complete parallel application executing as a
single OS process. The figure shows the individual weaves (w), their constituent modules
(m), strings (s), and their composition reflecting the structure of the application as whole.
Identical shapes imply identical copies of a module. The lines connecting the modules imply
external references being resolved between them.
Joy Mukherjee
Chapter 4. The Weaves Framework
71
parently realize arbitrary multi-granular selective sharing of functions and variables among
strings. Because strings are essentially intra-process threads, this selective sharing among
strings addresses the second research challenge: the need to transparently realize multigranular selective sharing of global variables among the threads of a legacy parallel program.
Thus, Weaved versions of legacy parallel programs can run over lightweight intra-process
threads without concerns of inadvertent namespace collisions and unintentional sharing. Furthermore, they can exploit low-overhead collaboration among constituent parallel threads.
Together, these properties facilitate larger scales of parallelism for Weaved legacy parallel
programs through better utilization of resources on a single multi-core machine or SMP
workstation. Ultimately, larger scales of parallelism can facilitate more accurate modeling
of large-scale parallel phenomena such as networks and multi-physics problems.
4.5
Summary
This chapter has provided a description of elements of the Weaves runtime framework for
parallel programs. It has also illustrated the manner in which the framework addresses the
research challenges and helps reach the overall goal of helping “unmodified” legacy parallel
programs exploit the scalability provided by threads.
Weaved applications can load encapsulated runtime modules from relocatable object files.
They can load multiple independent modules from a single object file without entailing any
modifications to the concerned object file. By allowing direct runtime control over the resolu-
Joy Mukherjee
Chapter 4. The Weaves Framework
72
tion of individual references in a module’s code, the Weaves framework empowers programs
with the ability to manipulate their composition at fine granularities. Through modules,
the Weaves framework supports the transparent encapsulation and multiple instantiation of
legacy procedural codes in intra-process environments.
Just as the compile-time linking of object files creates an executable program, the runtime
composition of a set of modules creates a weave. A weave is, therefore, an intra-process
subprogram that can support a flow of execution. The framework allows a single module
to be shared among multiple weaves, which can be leveraged to realize arbitrary graphlike sharing of modules among the different weaves. Direct control over the resolution of
individual references can extend this selective sharing among weaves to finer granularities.
All components of Weaved applications, including the fundamental module, are intra-process
runtime units. Also, Weaved applications need not instrument any code modifications on
modules either at the source level or at the level of relocatable objects (native code patches).
They can avail themselves of the framework’s facilities transparently.
Legacy procedural codes are easily available as relocatable objects. Hence, the Weaved
versions of legacy parallel programs can transparently (without code modifications) load
multiple encapsulated modules of legacy procedural codes into intra-process runtime environments. When they run strings through weaves composed from distinct sets of identical
modules, the strings do not share any global variables. This capability directly addresses the
first research challenge: the need to transparently separate global variables used by identical,
but independent, threads of a legacy parallel program.
Joy Mukherjee
Chapter 4. The Weaves Framework
73
Strings running through weaves experience all the elements of sharing reflected in the compositions of those weaves. Therefore, Weaved versions of legacy parallel programs can transparently realize arbitrary multi-granular selective sharing of functions and variables among
strings. Since strings are essentially intra-process threads, this selective sharing among
strings addresses the second research challenge: the need to transparently realize multigranular selective sharing of global variables among the threads of a legacy parallel program.
Weaved versions of legacy parallel programs can run over lightweight intra-process threads
without concerns of inadvertent namespace collisions and unintentional sharing. Furthermore, they can exploit low-overhead collaboration among constituent parallel threads. Together, these properties facilitate larger scales of parallelism for Weaved legacy parallel programs through better utilization of resources on a single multi-core machine or SMP workstation. Ultimately, larger scales of parallelism can facilitate more accurate modeling of
large-scale parallel phenomena such as networks and multi-physics problems.
At the basic level, the Weaves framework offers its services as a library that supports simple
APIs. Users can explicitly program the composition of Weaved applications using these
APIs. The framework also provides a meta-language for specifying such compositions in a
configuration file and a script that automatically creates and runs corresponding Weaved
applications. Essentially, configuration files are very similar to Makefiles. Consequently,
from the usability perspective, composing Weaved applications is comparable to writing
Makefiles.
The Weaves framework is currently implemented on GNU/Linux over three architectures:
Joy Mukherjee
Chapter 4. The Weaves Framework
74
x86, x86 64, and ia64. Weaves’ runtime loader and linker called Load and Let Link (LLL)
implements the core aspects of the framework, which include loading modules, composing
weaves and direct control over the resolution of individual references.
The implementation is heavily dependent on the Executable and Linkable File Format (ELF)
[TIS95], the format of native relocatable objects used by most GNU/Linux systems. However, the implementation is fairly architecture-neutral. A port to the Power PC architecture
is currently underway. Porting to different operating systems such as Windows and OS-X
poses bigger problems. However, because most regular operating systems allow object file
based decoupling of applications, theoretically, the Weaves framework is portable across a
wide range of operating systems and architectures. Lastly, a preliminary experiment has
confirmed that strings are indeed lightweight and more scalable than processes.
Chapter 5
Case Studies
This chapter provides a demonstration of the utility of the Weaves framework in software areas that comprise legacy parallel programs by presenting various case studies from the areas
of network emulation and parallel scientific computing. Weaved instances of network emulation and parallel scientific programs are used to further illustrate the design, implementation
and development of Weaved applications. For instance, one of the case studies explains the
use of string continuations. Because these studies are for the purposes of illustration, we
have chosen them from two radically different domains of parallel systems to emphasize the
broad impact of the Weaves framework. These case studies were designed to exemplify the
manner in which Weaved instances of network emulation and parallel scientific computing
institute advances toward area-specific versions of our overall goal. Finally, this chapter also
presents the results of experiments with large real-world applications to substantiate the
effectiveness of this research.
75
Joy Mukherjee
5.1
Chapter 5. Case Studies
76
Using Weaves for Network Emulation
This section gives a description of the Weaved instances of network emulation (see Chapter
2). First, a simple hypothetical case study is used to describe a Weaves-based approach
to network emulation. This case study exemplifies the design, implementation and development of Weaved network emulations. Next, an experiment designed to corroborate that
Weaved instances of network emulation can indeed run multiple independent threads of
unmodified legacy network applications is detailed. The experiment also shows that the
Weaves framework can facilitate user-level selective sharing of multiple real-world IP stacks
among the variuos application threads. Then there follows an explanation of the results of
experiments with large real-world network emulations performed using the Open Network
Emulator (ONE) [Var02].
These results herald the ONE’s ability to correctly model larger scales of parallel real-world
network applications. In these experiments, the ONE used the Weaves framework to run
unmodified real-world network applications and protocol stacks (legacy procedural codes)
over intra-process threads. Furthermore, the ONE used the framework for low-overhead userlevel selective sharing of multiple TCP/IP stacks among various application threads. Such
uses of the Weaves framework were instrumental in helping the ONE transparently exploit
overall resources for larger scales of parallelism. Ultimately, larger scales of parallelism enable
more accurate testing of network protocols. The results, therefore, substantiate the efficacy
of the current research.
Joy Mukherjee
5.1.1
Chapter 5. Case Studies
77
A Simple Instance
Figure 5.1(a) depicts the simple hypothetical network scenario seen in Figure 2.2 To emulate
this scenario using the Weaves framework, a user must compile the telnet code into one
relocatable object (telnet.o) and the IP stack code into another relocatable object (ipstack.o),
load two distinct modules of telnet.o (module 0 and module 1) and one module of ipstack.o
(module 2), and compose two weaves as shown in the figure. The first weave (weave 0)
comprises modules 0 and 2; the second weave (weave 1) comprises modules 1 and 2. Then,
the user must start two strings, one each at the main entry functions of the telnet modules
(module 0 and module 1).
These two strings run through separate telnet modules and do not interfere with each other.
This Weaved realization helps separate the global variables of the two encapsulated telnet
modules. It does not require any modifications to the telnet objects, which can be compiled
from legacy procedural sources of telnet. The Weaved realization effectively addresses the
network emulation version of the first research challenge: the need to transparently separate
global variables used by identical, but independent, threads of a legacy parallel program.
Since both strings run through a single IP stack, they collaborate within the IP stack even
though they comprise separate telnets. The two telnets interfere with each other within the
stack, thereby emulating correct real-world operation. Advanced Weaved emulations, such
as the one depicted in Figure 5.2, selectively share different independent IP stacks among
different sets of application threads.
Joy Mukherjee
Chapter 5. Case Studies
string 0
78
string 1
module 0
(telnet.o)
module 1
(telnet.o)
weave 1
weave 0
module 2
(ipstack.o)
(a)
void bootstrap ()
{
…
/* load modules */
for ID = 0 TO 1
mod[ID] = load_module (“telnet.o”);
mod[2] = load_module (“ipstack.o”);
…
/* Compose weaves */
weave[0] = compose_weave (mod[0], mod[2]);
weave[1] = compose_weave (mod[1], mod[2]);
…
/* 1 string through each weave */
for ID = 0 TO 1
string_init (weave[ID], telnet_main);
…
}
(b)
#modules
/* object file: module ID */
telnet.o: 0, 1
/* object telnet is loaded as modules 0, 1 */
ipstack.o: 2
/* object file 3 is loaded as module 2*/
#weaves
0: 0, 2
1: 1, 2
/* weave ID: module ID, module ID, … */
/* weave 0 composes modules 0 and 2 */
/* weave 1 composes modules 1 and 2 */
#strings
/* string ID: weave ID, start function */
0: 0, telnet_main (…) // start string 0 in weave 0 at telnet’s entry fn.
1: 1, telnet_main (…) // start string 1 in weave 1 at telnet’s entry fn.
(c)
Figure 5.1: Modeling the simple network scenario of Figure 2.2 using the Weaves framework.
(a) Weaved setup of the tapestry. (b) Bootstrap pseudo-code (c) Configuration file.
Joy Mukherjee
Chapter 5. Case Studies
79
These simple hypothetical Weaved realizations do not entail any modifications to either
the telnet object or the IP stack object, which can be compiled from corresponding legacy
procedural sources. In effect, these Weaved realizations show that the framework addresses
the network emulation version of the second research challenge: the need to transparently
realize multi-granular selective sharing of global variables among the threads of a legacy
parallel program.
5.1.2
Experimental Corroboration
Figure 5.2 depicts the experiment that was designed to check whether Weaved instances of
network emulation can indeed run multiple independent threads of unmodified legacy network applications. The experiment also checks whther the Weaves framework can facilitate
user-level selective sharing of multiple real-world IP stacks among the variuos application
threads.
The figure shows a simple network with two hosts each running a server and a client. We
compiled real-world codes for the client, server, and the IP stack [Kne04] into relocatable
objects (client.o, server.o, and ipstack.o respectively). The codes were written in C and
contained global variables. We loaded two modules from each object (c1, c2 from client.o;
s1, s2 from server.o; and ip1, ip2 from ipstack.o) and composed four weaves as follows:
❼ weave wv1: c1 and ip1.
❼ weave wv2: s1 and ip1.
Joy Mukherjee
80
Chapter 5. Case Studies
❼ weave wv3: c2 and ip2.
❼ weave wv4: s2 and ip2.
Finally, we initialized four strings, one each at the start functions of c1, c2, s1, and s2. The
entire emulation, comprising two independent virtual hosts, operated as a single OS process
on a single processor AMD AthlonTM workstation (32-bit Intel architecture) running the
Linux OS.
c1
(client.o)
s1
(server.o)
wv1
wv2
c2
(client.o)
s2
(server.o)
wv3
wv4
ip2
(ipstack.o)
ip1
(ipstack.o)
Virtual host 1
Virtual host 2
Tapestry: Single OS process
Figure 5.2: Weaved set up of the experimental network scenario. Both clients used identical
real-world codes as did the servers. The IP stacks used identical real-world codes. The two
hosts were completely independent, but ran within a single OS process.
To test whether the setup worked correctly, we transferred two different files (f1 and f2)
between each client-server pair. Client c1 transferred file f1 to server s1 through IP stack
ip1, and client c2 transferred file f2 to server s2 through IP stack ip2. Server s1 wrote the
Joy Mukherjee
Chapter 5. Case Studies
81
received data to output file o1, and server s2 wrote the received data to output file o2. This
setup did not entail any modifications to the client, server and IP stack codes.
After the transfer, we used Linux’s diff command to compare files f1 and o1. The files
were identical, as were files f2 and o2. Furthermore, the differences between f1 and f2
were identical to those between o1 and o2. These results show that the transfers took place
properly. Unless the clients, servers and IP stacks were completely separate, the file transfers
would have interfered with each other resulting in corrupted outputs. Furthermore, unless
the two IP stacks were selectively shared between the different client-server pairs, the file
transfers between the pairs would have run into errors.
The results, therefore, reasonably corroborate the correctness of the Weaved setup, because
they show that the Weaves framework transparently facilitates (1) execution of legacy parallel
programs (clients and servers) over lightweight intra-process threads without inadvertent
sharing of global variables, and (2) user-level selective sharing of global variables (within
the IP stacks) among threads of such legacy parallel programs. In effect, these results are
indicative of the progress towards the overall goal of this research: helping “unmodified”
legacy parallel programs exploit the scalability provided by threads.
5.1.3
Contextual Advances
The lateral Open Network Emulator (ONE) project [Var02] has reached prototype capability.
Over the last few months, the distributed version of the ONE (dONE [BVB06]) was used to
Joy Mukherjee
Chapter 5. Case Studies
82
experiment with large real-world network emulations. In these experiments, the ONE used
the Weaves framework to run unmodified real-world network applications and protocol stacks
(legacy procedural codes) over threads. Furthermore, the dONE used the framework for lowoverhead user-level selective sharing of multiple TCP/IP stacks among various application
threads.
Figure 5.3: The Open Network Emulator (ONE) models thousands of simultaneously (parallely) running real-world network nodes and applications on a single machine.
These experiments were performed on the Linux 2.6.11 kernel over a cluster of eight dualCPU Opteron machines for a total of sixteen processors with 2 GB of physical memory per
machine. The machines were connected via a 10Gbps Infinicon Infiniband interface. Each
machine was configured based on Redhat Fedora Core 2 in a diskless configuration.
Figure 5.3, seen earlier in Figure 2.1, depicts the overall setup of the experiments. The
Joy Mukherjee
Chapter 5. Case Studies
83
dONE used the Weaves framework to run multiple parallel sender and receiver applications
over multiple IP stacks to emulate multiple virtual nodes within a single process on a single
physical machine. Collaboration across physical nodes was facilitated using the Message
Passing Interface (MPI/LAM [PTL06]).
The codes used in the experiments consisted of senders and recievers written in the C language and a real-world TCP/IP stack extracted from the Linux 2.4 kernel [Kne04]. The
TCP/IP stack contained a large number of global variables.
The results from the experiments show that all data transfers across various senders and
receivers were without errors. Furthermore, the results confirm that the dONE can correctly
model the real-world behavior of protocols such as TCP/IP. For a certain scenario of 50
sender-reciever pairs, the results are consistent and the variance negligible. Most pertinently,
the results herald that the dONE can emulate a thousand virtual nodes over a single physical
machine. On increasing the number of machines, the emulation speeds up super-linearly i.e.
on increasing the number of machines by a factor of N, the time taken to run a certain
emulation drops by a factor greater than N (Figure 5.4: This figure is reproduced from
[BVB06].). This super-linear speedup attests the scalability of the dONE.
The exact details of the experiment and further analyses of results are beyond the scope of
the current thesis. Interested readers can refer to [BVB06] for additional exposition.
These results show that the dONE can emulate large-scale networks consisting of thousands
of network applications and virtual nodes on a limited number of machines. The Weaves
Joy Mukherjee
84
Chapter 5. Case Studies
Speedup of a certain dONE emulation between 1 and 15 processors
25
Achieved Speedup
Linear Speedup (Ideal)
Speedup
20
15
10
5
0
1
3
5
7
9
11
13
15
Physical Processors
Figure 5.4: The dONE exhibits super-linear speedup when emulating real-world network
nodes and applications. This figure is reproduced from [BVB06].
framework is instrumental in helping network emulations transparently exploit the scalability
of threads for larger scales of parallelism, the overall goal of this research. Ultimately, larger
scales of parallelism enable more accurate testing of network protocols. The results, therefore,
substantiate the usefulness of the Weaves framework.
5.2
Using Weaves for Scientific Computing
This section provides a description of Weaved instances of parallel scientific applications that
comprise collaborating partial differential equation (PDE) solvers (see Chapter 2). First,
some simple hypothetical case studies are used to describe Weaves-based approaches to
Joy Mukherjee
Chapter 5. Case Studies
85
parallel scientific computing. These case studies exemplify the design, implementation and
development of Weaved scientific applications. One of the case studies explains the use of
string continuations. Next, experiments designed to corroborate that Weaved instances of
scientific applications can indeed run multiple independent threads of unmodified legacy
PDE solvers is detailed. One of the experiments also shows that the Weaves framework can
facilitate low-overhead collaboration between legacy PDE solver threads through user-level
selective sharing of global variables. Then there follows an explanation of the results of
experiments with large parallel scientific applications.
These experiments used the Weaves framework to run unmodified real-world scientific solvers
(legacy procedural codes containing global variables) over intra-process threads. The results
of one experiment show that, under identical resource limitations, a Weaved application can
support nearly twice the number of unmodified parallel solvers as the corresponding processbased application can. The results of another experiment show that a Weaved application
can transparently facilitate low-overhead collaboration among parallel solver threads through
user-level sharing of global variables, which allows for greater scales of achievable parallelism
than traditional MPI-based communication allows [KBA92, ANLUC06].
These results together indicate that the Weaves framework is instrumental in helping parallel
scientific applications transparently exploit overall resources for larger scales of parallelism.
Ultimately, larger scales of parallelism enable more accurate modeling of multi-physics phenomena. The results, therefore, substantiate the efficacy of the current research.
Joy Mukherjee
5.2.1
86
Chapter 5. Case Studies
A Simple Instance
The Weaves framework opens up three possibilities for realizing the simple hypothetical
instance of collaborating PDE solvers depicted in Figure 5.5 (seen also in Figure 5.5). The
first two approaches assume that solvers and mediators are designed with an awareness of
Weaves. The second approach uses string continuations for solver-mediator collaboration.
The third and most radical approach reuses unmodified solver and mediator codes from
S2
S1
traditional agent-based implementations.
M12
Figure 5.5: The simple PDE solver scenario of Figure 2.8.
Designing for Weaves
The design of Weaves-aware collaborating PDE solvers can follow different methods depending on the extent of affordable parallelism. If the number of available processors is large,
both solvers and mediators can afford independent parallel strings of execution. Otherwise,
only solvers can own a parallel string, while mediator modules are shared among adjacent
solver strings.
Joy Mukherjee
87
Chapter 5. Case Studies
Parallel Solvers and Mediators: In this method users must compile the solver and mediator instances into separate relocatable objects, solver.o and mediator.o, respectively. The
solver object must define and store the global variables for boundary conditions and solution
structures. The mediator object must make external references to boundary conditions and
solution structures.
String 1
String 2
String 3
/* Solver code */
...
global boundary_cond;
/* Solver code */
...
global boundary_cond;
function solver (…) {
…
do {
… // wait for new condition
PDE_solve (…);
… // report new solution
} while (…)
...
}
global solution_struct;
function solver (…) {
…
do {
… // wait for new condition
PDE_solve (…);
… // report new solution
} while (…)
...
}
global solution_struct;
S1, Wv1
/* Mediator code */
...
extern solution_struct_1;
extern solution struct_2;
function mediator (…)
{
…
do {
… // wait for solution(s)
Relax_soln (…);
… // report new conditions
} while (…)
...
}
extern boundary_cond_1;
extern boundary_cond_2;
S2, Wv2
Tapestry: single OS
process
M12, Wv3
Figure 5.6: A possible Weaved realization of the Figure 5.5 scenario. S1, S2, and M12 are
composed into separate weaves Wv1, Wv2 and Wv3. External references from M12 are
explicitly bound to definitions within S1 and S2.
Figure 5.6 depicts the Weaved setup. To compose and run a complete tapestry of collaborating PDE solvers, users must load 2 modules of solver.o (S1 and S2) and one module of
Joy Mukherjee
Chapter 5. Case Studies
88
mediator.o (M12). They must then compose each solver module into a solver weave (Wv1,
Wv2), the mediator module into a mediator weave (Wv3) and exercise direct control over the
resolution of M12’s references to bind them to the different definitions of the solution structures and boundary conditions within the two solvers. Lastly, users must initiate parallel
strings at the entry functions of the solvers and mediator.
The semantics of the overall application remain the same as in the traditional agent-based
implementation. Initially, the solvers read in the initial PDE structure and start the first
computations. At the end of a run, they write their solutions to their solution structures and
wait for fresh boundary conditions. The mediator performs relaxations on solver solutions
and writes new boundary conditions to the solvers. As soon as the new boundary conditions
become available, the solvers start off again and the loop is repeated till a satisfactory state
is reached.
Parallel Solvers and Shared Mediators: This method is based on the critical observation that neither the physics problem nor the computational basis behind an instance of
collaborating PDE solvers requires that mediators be parallel independent flows of execution.
The fact that they are is a direct result of the distributed agent model traditionally used to
solve such problem instances.
Here also, users must compile the solver and mediator instances into separate relocatable
objects, solver.o and mediator.o, respectively. The solver object must define the global
boundary condition variables and solution structures. Additionally, each solver component
Joy Mukherjee
Chapter 5. Case Studies
89
must explicitly call external mediator functions. The mediator object must make external
references to the boundary variables and solution structures.
This time, users must load 2 modules of solver.o (S1 and S2) and one module of mediator.o
(M12). Then they must compose S1 and M12 into weave Wv1, S2 and M12 into weave
Wv2, and exercise direct control over the resolution of M12’s references to bind them to
the different definitions of the solution structures and boundary conditions within the two
solvers. Finally, they must initiate one string at the entry function of S1 and another string
at the entry function of S2. As a result, while the solvers have their own parallel flows of
execution, the mediator does not. The mediator module is a part of the solver weaves.
Figures 5.7(a), 5.7(b), 5.7(c), and 5.7(d) show the corresponding tapestry layout and pseudocodes for the solvers and the mediator. Because M12’s external references resolve to definitions in both S1 and S2, Wv1 and Wv2 together ensure referential completeness of the
tapestry, even though they are individually incomplete. Moreover, the mediator function in
M12 is invoked twice (once each from S1 and S2) during a single relaxation act. An effective
relaxation can be performed by either (c) the last invocation when both the solvers’ solutions
are available, or (d) the first invocation, which uses the solvers’ solutions as and when they
are needed and become available. In each case, Weaves’ linking semantics transparently
make sure that the invocation has access to the contexts of both Wv1 and Wv2, or S1 and
S2.
Joy Mukherjee
Chapter 5. Case Studies
String1
String2
S2
S1
Wv1
Wv2
M12
(a)
/* Solver code */
...
global boundary_cond;
extern mediator (…);
function solver (…) {
…
do {
… // wait for new condition
PDE_solve (…);
… // report new solution
call mediator (…);
} while (…)
...
}
global solution_struct;
(b)
/* Mediator code */
...
extern solution_struct_1;
extern solution struct_2;
/* Mediator code */
...
extern solution_struct_1;
extern solution struct_2;
function mediator (…) {
if (solvers pending)
return;
else
…
Relax_soln (…);
… // report new conditions
}
extern boundary_cond_1;
extern boundary_cond_2;
function mediator (…) {
if (mediator active)
return;
else
…
Relax_soln (…);
… // report new conditions
}
extern boundary_cond_1;
extern boundary_cond_2;
(c)
90
(d)
Figure 5.7: An alternate Weaved realization of the Figure 5.5 scenario. (a) Weaves Wv1 and
Wv2 compose M12 with S1 and S2 respectively. (b) Typical code for S1 and S2. (c) Code
for M12 if it needs all solutions. (d) Code for M12 if it uses solutions as and when needed
and available.
Joy Mukherjee
Chapter 5. Case Studies
91
Special Use of String Continuations
This section, presents a singular Weaves-based method to realize a special case of the Figure
5.5 scenario using the string continuation API. Assume that:
1. S1 and S2 are identical solvers.
2. M12 uses one solution at a time as needed and available.
3. M12 returns an identical boundary condition to S1 and S2.
To set up the tapestry users must follow the same steps as outlined in the description of the
‘Parallel Solvers and Shared Mediators’ method. However, in this case, M12 needs to define
only two external references, one to a solution structure and one to a boundary condition.
Figure 5.8(a) depicts the code for M12. If S1 calls the mediator function first, the Weaves
framework transparently binds solution struct in M12 to the definition in S1, based on the
context of Wv1. When the mediator function needs solution information from S2, it invokes
an explicit continuation into Wv2. The continuation causes a change in namespace such that
solution struct maps to the definition in S2. When the mediator function has computed
a fresh boundary condition, it writes the new value to boundary cond in S2, because the
current context is Wv2. To write to S1, M12 issues an explicit call to end the continuation,
which restores the context to that of Wv1. The overall application follows the same iterative
semantics as in the other cases.
Joy Mukherjee
Chapter 5. Case Studies
String1
String2
String1
S1
S2
S1
Wv1
92
Wv2
M12
M1
M2
Wv1
/* Mediator code */
...
extern solution_struct;
function mediator (…) {
if (mediator active)
return;
else
…
// use current solutions
// for partial relaxations
continue (other_weave);
// complete relaxations
…
// report new condition
end_continuation ();
// report new condition
}
extern boundary_cond;
(a)
Wv2
/* Solver code */
...
global boundary_cond;
extern mediator (…);
function solver (…) {
…
do {
… // wait for new condition
PDE_solve (…);
… // report new solution
call mediator (…);
continue (other_weave);
call mediator (…);
end_continuation ();
} while (…)
...
}
global solution_struct;
(b)
Figure 5.8: Continuations help map a single module to different weaves. (a) The tapestry
setup. (b) An imaginary partial tapestry.
Joy Mukherjee
Chapter 5. Case Studies
93
Figure 5.8(b) depicts a partial tapestry where a solver, S1, is a part of two weaves comprising
identical mediators, M1 and M2. The solver (S1) consists of a single string, declares a
single external reference to the mediator function and uses string continuations to invoke the
identical, but independent, definitions of the mediator function in M1 and M2.
Reusing Agent-based Codes
The Weaves framework supports unmodified reuse of the agent-based solver and mediator
codes. An agent-based approach assumes some form of an underlying message passing library,
such as MPI [ANLUC06], that implements dependable communication primitives. The solver
and mediator agents call the functions in the library to safely exchange boundary conditions
and solution structures.
String1
sa1
Wv1
String3
ma12
Wv3
String2
sa2
Wv2
comm
(MPI emulator for Intra-process threads)
Figure 5.9: Weaving unmodified agent-based codes. Solver and mediator modules are composed into different weaves, but share a single thread-based MPI emulator.
Assuming MPI for solver-mediator communication, Figure 5.9 illustrates a Weaves-based
method to set up a hypothetical tapestry that reuses the unmodified agent-based codes.
Joy Mukherjee
Chapter 5. Case Studies
94
According to this method, users must compile the solver and mediator agent codes into
the relocatable objects, s agent.o and m agent.o, respectively. Additionally, they must program a communication component that emulates dependable message transfers between
intra-process threads using user-level global variables and must compile it into an object
communicator.o. The functional interfaces of the communication component must be identical to those of the original communication library. Users must then load modules sa1 and
sa2 from s agent.o, module ma12 from m agent.o, module comm from communicator.o and
compose weaves as follows:
❼ weave wv1: sa1 and comm.
❼ weave wv2: sa2 and comm.
❼ weave wv3: ma12 and comm.
Finally, they must initialize three strings, one each at the start functions of sa1, sa2, and
ma12. The solver and mediator agents run as independent virtual machine abstractions
within a single process unaware of Weaves.
Summary
All the different Weaved realizations of the hypothetical collaborating PDE solver application
run identical, but independent, threads of the solver programs. They transparently reuse
legacy procedural codes for solvers and mediators even though they contain global variables.
Joy Mukherjee
Chapter 5. Case Studies
95
The Weaves framework helps separate the global variables of encapsulated solver threads
without any code modifications. In effect, the Weaved realizations address the scientific
computing version of the first research challenge, the need to transparently separate global
variables used by identical, but independent, threads of a legacy parallel program.
Furthermore, different Weaved realizations embody different granularities of selective sharing of global variables among parallel solver and mediator threads. While the agent-based
realization embodies selective sharing of an entire communication module, the others demonstrate selective sharing at the fine granularity of individual solution structures and boundary
condition variables. All the Weaved realizations reuse unmodified legacy scientific routines
(PDE solver and Relax soln) from standard PSE toolboxes. Therefore, the Weaved realizations address the scientific computing version of the second research challenge, the need to
transparently realize multi-granular selective sharing of global variables among the threads
of a legacy parallel program.
However, the different Weaves-based methods exhibit a trade-off between extent of overall
transparency and the overhead of inter-thread collaboration. The most transparent method
reuses the unmodified agent-based traditional procedural codes, but incurs some extra overhead due to the copying of data during exchanges through the communication module. On
the other hand, the other methods directly share solutions structures and boundary variables
for lower overhead, but require globally stored solution structures and boundary condition
variables to allow direct external access.
As an aside, in all Weaved methods, the runtime image of a tapestry closely matches the
Joy Mukherjee
Chapter 5. Case Studies
96
graphical representation of the hypothetical collaborating PDE solvers application. Furthermore, the bootstrap code in any method requires minimal information about the internal
codes of solvers and mediators, which shows that the Weaves framework is fairly domain
neutral.
5.2.2
Experimental Corroboration
Figure 5.10 shows a special Weaved instance of parallel PDE solvers designed to check
whether Weaved scientific applications can transparently run multiple identical, but independent, threads of solvers that contain global variables.
The experimental setup followed the basic Weaves-based semantics depicted in Figure 5.9.
We assumed that all solvers were identical, but independent. We obtained solver codes from
The Numerical Algorithm Group (NAG) [NAG06b] website under a trial license. Specifically,
we used the Fortran Library Mark 21 for Intel Linux, compiled using GNU’s Fortran Compiler
g77. We chose the example solver program for Finite Difference Equations (Elliptic PDE,
Multigrid Technique) provided along with the library to represent generic solvers. To model
realistic long running problems and to aide timing measurements, we wrapped the main entry
function in a loop of 10000 iterations. To ensure non thread safety, we added FORTRAN’s
DATA [NAG06a] directive for dummy initialization of a few elements of the solver’s input
PDE structure1 .
1
From the programming perspective, FORTRAN variables are, largely, locally scoped. However, when
associated with certain directives, some of these variables are allocated globally as a part of a program’s
Joy Mukherjee
Chapter 5. Case Studies
97
Our target solver program had three relevant characteristics:
1. It comprised validated, commercial, procedural, process-based, non-threadsafe and
non-reentrant FORTRAN code.
2. It contained global variables.
3. Sources for its core code were NOT available. Only compiled object files were available
as a library (.a archive).
To test that the solver program was non-threadsafe, we ran multiple threads through a single,
otherwise correct, program. The outputs of the resultant threaded program were wrong,
thereby indicating that the threads interfered with each other and disrupted the common
global variables. This anomaly was transparently evaded by using the Weaves framework.
We compiled and linked the example solver program (d03edfe.f) with the required code
objects (d03edf*.o) from the archive to generate a relocatable object (d03edfe.o). We loaded
multiple modules of d03edfe.o, composed each module into an independent weave and started
strings through each weave at the main entry function of the solver. All the strings ran to
completion, producing results that correctly matched with normal process-based execution.
The correctness of the results reasonably corroborates the correctness of the Weaved setup
and shows that the Weaves framework transparently facilitates execution of legacy parallel
binary image. From a threading perspective, therefore, these variables are invariably shared across all
threads, thereby potentially hampering thread safety.
Joy Mukherjee
Chapter 5. Case Studies
String 1
String 2
98
String n
S1
S2
Sn
Wv1
Wv2
Wv4
Single OS Process
d03edfe.o
Figure 5.10: Weaved setup of the experimental PDE solver (d03edfe) scenario.
programs (non-threadsafe NAG solvers containing global variables) over lightweight intraprocess threads, the overall goal of this research.
5.2.3
Contextual Advances
To confirm the utility of the Weaves framework, we ran two experiments with large parallel
scientific applications. Both experiments used the Weaves framework to run unmodified realworld scientific solvers, legacy procedural codes containing global variables, over lightweight
intra-process threads.
Scalability of Weaved Scientific Applications
To judge the scalability of Weaved realizations of legacy parallel scientific applications, we
instantiated an NxN grid of solver (d03edfe) weaves and strings in the manner depicted in
Joy Mukherjee
Chapter 5. Case Studies
99
Figure 5.10. Each element of the grid consisted of an identical solver module loaded from
d03edfe.o and composed into an independent weave with a string started at the solver’s entry
function.
We then realized a process-based version of the same scenario. Here, a master process
spawned NxN processes using Linuxs fork() command. Each child process called execvp
to switch to an independent copy of the solver program. We used the same solver codes in
both process-based and Weaves-based realizations.
We varied N from 1 to 31 and noted the total time of execution, from the start of the first
solver to the end of the last, in every case. The outputs for both the Weaved and processbased realizations matched correctly upto N = 23. The Weaved realization correctly ran
up to nearly 1000 solvers with a perfectly linear increase in the total execution time. In
contrast, the process-based one scaled up to approximately 500 solvers only.
All runs were conducted on a 2GZ 32 bit Athlon processor with 1GB RAM running GNU/Linux.
All measurements were averaged over 10 runs. Figure 5.11 sketches performance data obtained from the experiment. The data show that a Weaved realization can support nearly
twice the number of unmodified non-threadsafe parallel solvers than a traditional processbased realization. These results indicate that the Weaves framework can help scientific applications exploit the scalability of threads without requiring modifications to the traditionally
process-based programs that contain global variables.
Joy Mukherjee
100
Chapter 5. Case Studies
Scalability (number of parallel solvers runnable on a single
processor Athlon32 over Linux)
600
400
Weaves
300
Processes
200
Processes limit
Execution time (seconds)
500
100
0
0
200
400
600
800
1000
Number of parallel solvers
Figure 5.11: Scalability of Weaved scientific applications: Experimental results indicate
that the Weaves framework can help applications exploit the scalability of threads without
requiring modifications to traditional procedural process-based programs. The framework
effects zero-overhead encapsulation of solvers.
Joy Mukherjee
Chapter 5. Case Studies
101
Scalable Collaboration among Weaved Parallel Solvers
The previous experiment indicated the scalability of a Weaved scientific application in terms
of the number of unmodified solver programs that can be accommodated on a single machine. Nevertheless, realistic scientific problems typically require collaboration amongst
solvers through inter-solver communication.
To judge the Weaves framework’s ability to transparently facilitate low-overhead collaboration among parallel solver threads through user-level sharing of global variables, we realized
a Weaved version of Sweep3D [KBA92], an application that uses solvers for 3-Dimensional
Discrete Ordinate Neutron Transport, on an 8-processor x86 64 SMP and compared the results with traditional MPI-based realizations. Figure 5.12 depicts the Weaves-based setup.
We developed a simple MPI emulator for user-level inter-thread communication through
shared global variables and compiled the emulator and the Sweep3D solver code into relocatable objects mpi.o and sweep3d.o, respectively. To setup a Weaved instance of Sweep3D
comprising nxm solvers (or a nxm split), we loaded nxm modules of sweep3d.o and one module
of mpi.o. We then composed nxm weaves, where each weave consisted of a unique sweep3d
module and the mpi module, and initialized nxm strings, one string at the entry function of
each Sweep3D module.
We reused unmodified legacy FORTRAN codes of Sweep3D for the Weaved realization.
Sweep3D codes contain FORTRAN’s COMMON directive for global allocation of certain
variables [NAG06a], that is, they contain global variables and are non-threadsafe. Linux’s
Joy Mukherjee
Chapter 5. Case Studies
102
objdump tool was used to verify that Sweep3D’s compiled object contained a number of
globally stored variables.
All the Sweep3D strings ran to completion, producing results that correctly matched with
normal process-based execution. This observation reasonably corroborates the correctness
of the Weaved setup. It also shows that the Weaves framework facilitates user-level selective
sharing of global variables within the MPI module among constituent threads of legacy
Sweep3D solvers.
String 1
sw 1
Wv 1
String 2
sw 2
Wv 2
String n
sw n
Wv n
mpi
(MPI emulator for Intra-process threads)
Figure 5.12: Weaved setup of the experiment using Sweep3D solvers.
We used a 150-cube input file with a 2x3 split (6 strings/processes) as a start point and increased the split to 2x4, 4x6, 6x9, and so on up to 10x15 (150 strings/processes). Figure 5.13
shows that the performance of the Weaved realization matched that of LAM-based [PTL06]
and MPICH-based [ANLUC06] realizations as long as the number of strings/processes was
less than the number of processors. When the number of strings/processes was increased
beyond the number of processors (8), the Weaved realization performed much better. There-
Joy Mukherjee
103
Chapter 5. Case Studies
fore, the Weaved realization clearly demonstrates scalable low-overhead collaboration. Both
the MPI-based realizations, compiled and run with shared memory flags for lowest overhead,
crashed beyond 24 processes (4x6 split).
Sweep3D 150-cube nxm (on 8 way x86_64 SMP)
120
Execution time (seconds)
100
80
lam usysv
mpich shmem
weaves
60
40
20
0
6
8
24
54
96
150
no. of processes/strings
Figure 5.13: Comparison of performance results of Weaved Sweep3D against LAM-based
and MPICH-based Sweep3D. The performance of the Weaved realization matched that of
the LAM-based and MPICH-based realizations as long as the number of strings/processes
was less than the number of processors. When the number of strings/processes was increased
beyond the number of processors (8), the Weaved realization performed much better.
The LAM-based and MPICH-based realizations perform poorly beyond the 2x4 split (8
processes) because they rely on OS-level shared memory schemes for inter process communication (IPC), which do not scale beyond the number of processors. Their reliance
on OS-level IPC is a direct consequence of following the process paradigm. The Weaves
Joy Mukherjee
Chapter 5. Case Studies
104
framework works around this problem by facilitating low-overhead collaboration among the
Sweep3D solver threads through a user-level sharing of global variables. In effect, these results show that the low-overhead collaboration facilitated by the Weaves framework allows
for greater scales of achievable parallelism compared to traditional MPI-based collaboration
[KBA92, ANLUC06].
Summary
Results from the first experiment show that Weaved scientific applications can support nearly
twice the number of unmodified non-threadsafe parallel solvers as corresponding processbased applications. Results from the second experiment show that Weaved applications can
transparently exploit low-overhead collaboration among parallel solver threads through a
user-level sharing of global varaibles, which allows for greater scales of achievable parallelism
than MPI-based collaboration allows.
Together, these results indicate that the Weaves framework is instrumental in helping legacy
parallel scientific applications transparently exploit the scalability of threads for larger scales
of parallelism, the overall goal of this research. Ultimately, larger scales of parallelism enable
more accurate modeling of multi-physics phenomena. The results, therefore, substantiate
the efficacy of the Weaves framework2 .
2
Data and codes for both experiments can be obtained from http://blandings.cs.vt.edu/joy.
Joy Mukherjee
5.2.4
Chapter 5. Case Studies
105
Configuring Weaves for HPC
This section provides an examination of various configurations of the Weaves framework for
high performance scientific computing (HPC) on shared memory multi-processor machines
(SMPs). Figure 5.14(a) shows one configuration invoking Weaves as part of a larger problem solving environment (PSE). Here, the PSE provides the tapestry specifications and the
module codes and uses the framework to compose and execute them. Figure 5.14(b) shows
another similar scenario with the Weaves framework operating within a larger performance
modeling framework [AS00], such as POEMS [ABB+ 00, BBD00]. Here, POEMS supplies
the modeling capability for performance characterization, while the Weaves framework takes
care of the low-level composition and execution of unmodified scientific codes. Thus, systems
such as POEMS can utilize the Weaves framework as a scalable and efficient substrate for
scientific modeling.
Figure 5.14(c) shows another configuration that POEMS exemplifies, the simulation of MPIbased real-world codes using the MPI-SIM [PB98]. MPI-SIM uses a multi-threaded architecture to simulate MPI. However, it assumes that the linked code base is thread-safe. This
assumption might not hold for distributed programs, especially those that use global variables. However, using the Weaves framework, the real-world codes can be run as independent
virtual threads, thus presenting a thread-safe view of the application to MPI-SIM. This configuration of the framework enables a wider variety of real-world parallel codes to be used
in simulations with MPI-SIM. We used a similar configuration for experiments with PDE
solvers and Sweep3D.
Joy Mukherjee
106
Chapter 5. Case Studies
PSE
Performance modeling
framework
Weaves
Weaves
(a)
Scientific code
(b)
Scientific code
LAM-MPI
Legend:
Weaved
code
MPI-SIM
(c)
(d)
Figure 5.14: Relationship between the Weaves framework and (a) a problem solving environment and (b) a performance modeling framework. Advanced configurations of the Weaves
framework for scientific computing: (c) Weaved scientific codes running over MPI-SIM and
(d) Weaved scientific codes over Weaved MPI implementations.
Joy Mukherjee
Chapter 5. Case Studies
107
Figure 5.14(d) depicts the most radical configuration of the Weaves framework for HPC. It
loads modules of a real-world MPI library (e.g., the LAM MPI [PTL06] implementations)
and links it to application modules. This configuration has the advantage of emulating realworld execution of not just the application, but its original communication interface, a truly
native virtual machine abstraction for MPI codes. It facilitates the study of the effects of
different MPI implementations on the performance of scientific codes. In this configuration,
the only simulated entity is the communication channel, not the operation of MPI. The Open
Network Emulator (ONE) has used a similar configuration of the framework for large scale
network emulations.
5.3
Summary
This chapter has provided a demonstration of the utility of the Weaves framework in software
areas that consist of legacy parallel programs. It has presented various case studies from
the areas of network emulation and parallel scientific computing. Certain Weaved instances
of network emulation and parallel scientific programs have further illustrated the design,
implementation, and development of Weaved applications. One of the case studies has
explained the use of string continuations. These case studies, purposely chosen from two
radically different domains of parallel software systems, emphasize the broad impact of our
work.
The experiments described in this chapter have corroborated that Weaved instances of net-
Joy Mukherjee
Chapter 5. Case Studies
108
work emulation and parallel scientific applications can transparently run multiple identical,
but independent, threads of legacy procedural programs that contain global variables. The
experiments have also indicated that:
1. Weaved instances of network emulation can selectively share unmodified real-world IP
stacks among independent telnet threads.
2. Weaved instances of scientific applications, such as Sweep3D, can selectively share
global varaibles among independent solver threads.
Together, these experiments show that the Weaves framework facilitates user-level selective
sharing of global variables among constituent parallel threads of legacy parallel programs.
Finally, we have explained the results of experiments with large real-world network emulations and parallel scientific applications. The results of experiments with network emulation
show that Weaved realizations of real-world network scenarios transparently exploit the scalability of threads for larger scales of parallelism. A Weaved emulation can emulate a thousand
virtual nodes over a single physical machine. On increasing the number of machines, the
emulation speeds up super-linearly i.e. on increasing the number of machines by a factor
of N, the time taken to run a certain emulation drops by a factor greater than N. In effect,
such super-linear speedup exemplifies the scalability of Weaved network emulations.
The results of the experiments with parallel scientific applications show that Weaved realizations can support nearly twice the number of unmodified and non-threadsafe parallel solvers
as the corresponding process-based realizations. The results also show that Weaved appli-
Joy Mukherjee
Chapter 5. Case Studies
109
cations can transparently exploit low-overhead collaboration among parallel solver threads
through user-level sharing of global variables, which allows for significantly greater scales of
achievable parallelism than MPI-based collaboration does.
These results together indicate that the Weaves framework is instrumental in helping “unmodified” legacy parallel programs exploit the scalability of intra-process threads for larger
scales of parallelism, the overall goal of this research. Ultimately, larger scales of parallelism
enable more accurate testing of network protocols and more accurate modeling of multiphysics phenomena. The results, therefore, substantiate the effectiveness of the Weaves
framework.
Chapter 6
Concluding Remarks
This thesis has proposed the Weaves runtime framework for parallel programs. Weaved
applications can load encapsulated runtime modules from relocatable object files, as well as
multiple independent modules from a single object file without entailing any modifications to
the concerned object file. By allowing direct runtime control over the resolution of individual
references in a module’s code, the Weaves framework empowers programs with the ability to
manipulate their composition at fine granularities. Through modules, the Weaves framework
supports the transparent encapsulation and multiple instantiation of legacy procedural codes
in intra-process environments.
Just as the compile-time linking of object files creates an executable program, the runtime
composition of a set of modules creates a weave. A weave is, therefore, an intra-process
subprogram that can support a flow of execution. The framework allows a single module to
110
Joy Mukherjee
Chapter 6. Concluding Remarks
111
be shared among multiple weaves that can be leveraged to realize arbitrary graph-like sharing
of modules among the different weaves. Direct control over the resolution of individual
references can extend this selective sharing among weaves to finer granularities.
All components of Weaved applications, including the fundamental module, are intra-process
runtime units. Also, Weaved applications need not instrument any code modifications on
modules either at the source level or at the level of relocatable objects (native code patches).
They can use the framework’s facilities transparently.
Legacy procedural codes are easily available as relocatable objects. Hence, the Weaved
versions of legacy parallel programs can transparently (without code modifications) load
multiple encapsulated modules of legacy procedural codes into intra-process runtime environments. When they run strings through weaves composed from distinct sets of identical
modules, the strings do not share any global variables. This capability directly addresses the
first research challenge: the need to transparently separate global variables used by identical,
but independent, threads of a legacy parallel program.
Strings running through weaves experience all the elements of sharing reflected in the compositions of those weaves. Therefore, Weaved versions of legacy parallel programs can transparently realize arbitrary multi-granular selective sharing of functions and variables among
strings. Since strings are essentially intra-process threads, this selective sharing among
strings addresses the second research challenge: the need to transparently realize multigranular selective sharing of global variables among the threads of a legacy parallel program.
Joy Mukherjee
6.1
Chapter 6. Concluding Remarks
112
Salient Contributions
The main contribution of the Weaves framework lies in its ability to facilitate execution
of “unmodified” legacy parallel programs over lightweight intra-process threads without inadvertent sharing of global variables among constituent threads. This contribution allows
legacy parallel programs to exploit the scalability of threads without any modifications.
A second contribution of the Weaves framework lies in its ability to transparently facilitate
low-overhead user-level sharing of global variables among parallel threads of legacy parallel programs. This contribution allows exploitation of low-overhead collaboration between
constituent parallel threads of a legacy parallel program.
Together, these contributions facilitate larger scales of parallelism for Weaved legacy parallel
programs through better utilization of resources on a single multi-core machine or SMP
workstation. Ultimately, larger scales of parallelism can facilitate more accurate modeling
of large-scale parallel phenomena such as networks and multi-physics problems.
The Weaves framework institutes lateral advances in software areas that frequently encounter
legacy parallel programs. This research has demonstrated these advances through experiments with large real-world network emulations and parallel scientific applications. These
experiments in two radically different domains of parallel systems emphasize the broad impact of the Weaves framework.
The results from the experiments with network emulation show that Weaved realizations
transparently exploit overall resources for larger scales of parallelism. A Weaved emulation
Joy Mukherjee
Chapter 6. Concluding Remarks
113
can emulate a thousand virtual nodes over a single physical machine. On increasing the
number of machines, the emulation speeds up super-linearly i.e. on increasing the number
of machines by a factor of N, the time taken to run a certain emulation drops by a factor
greater than N. In effect, such super-linear speedup exemplifies the scalability of Weaved
network emulations.
The results of the experiments with parallel scientific applications show that Weaved realizations can support nearly twice the number of unmodified and non-threadsafe parallel solvers
as the corresponding process-based realizations can. The results also show that Weaved applications can transparently exploit low-overhead collaboration among parallel solver threads
through user-level sharing of global variables, which allows for significantly greater scales of
achievable parallelism than MPI-based collaboration allows.
These results together demonstrate that the Weaves framework is instrumental in helping
legacy parallel programs transparently exploit overall resources for larger scales of parallelism. Ultimately, larger scales of parallelism enable more accurate testing of network protocols and more accurate modeling of multi-physics phenomena. These lateral advances,
therefore, substantiate the efficacy of the Weaves framework.
6.2
Other Aspects
At the basic level, the Weaves framework offers its services as a library that supports simple
APIs. Users can explicitly program the composition of Weaved applications using these
Joy Mukherjee
Chapter 6. Concluding Remarks
114
APIs. The framework also provides a meta-language for specifying the composition of a
Weaved application in a configuration file and a script that automatically creates and runs
the application from the meta-description. Essentially, configuration files are very similar to
Makefiles. Consequently, from the usability perspective, composing Weaved applications is
comparable to writing Makefiles.
Another noteworthy aspect of the Weaves framework is its current implementation. The
Weaves framework is currently implemented on GNU/Linux over three architectures: x86,
x86 64, and ia64. Weaves’ runtime loader and linker called Load and Let Link (LLL) implements the core aspects of the framework, which include loading modules, composing weaves
and direct control over the resolution of individual references. The implementation is fairly
architecture-neutral. A port to the Power PC architecture is currently underway. Porting
to different operating systems such as Windows and OS-X poses some problems. However,
since most regular operating systems allow object file based decoupling of applications, theoretically, the Weaves framework is portable across a wide range of operating systems and
architectures.
6.3
Summary
Parallel computing systems are becoming pervasive. An important effect of the increased
adoption of parallel systems is that many contemporary applications are being explicitly
programmed for large-scale parallelism. These applications benefit from mechanisms that
Joy Mukherjee
Chapter 6. Concluding Remarks
115
facilitate better exploitation of an individual machine or a single shared-memory multiprocessor (SMP) for larger scales of parallelism.
This research has proposed the Weaves runtime framework for parallel programs. The Weaves
framework exploits lightweight intra-process threads and user-level low-overhead collaboration among parallel threads to facilitate larger scales of parallelism on an individual computer
node or a single SMP machine. The framework is particularly beneficial to traditionally
process-based parallel applications that use legacy procedural codes. The reason for these
benefits is that the framework does not entail any modifications to these codes that have
been validated and verified over decades of research and usage even if the codes contain
global variables and are non threadsafe. To the best of our knowledge, the execution of
unmodified legacy parallel programs over lightweight intra-process threads has never been
attempted.
Ultimately, the Weaves framework helps legacy parallel programs transparently exploit overall resources for larger scales of parallelism. Larger scales of parallelism enable more accurate
software modeling of large-scale parallel phenomena, such as real-world networks and multiphysics natural occurrences. This research has experimentally demonstrated lateral advances
instituted by the Weaves framework in such diverse software areas as network emulation and
parallel scientific computing that comprise large-scale legacy parallel applications. These
lateral advances substantiate the efficacy of the current research.
Chapter 7
Ongoing Work
Ongoing work on the Weaves framework mainly focuses on reconfigurability or adaptivity
of unmodified applications. The framework’s facilities for flexible runtime loading and finegrain dynamic linking of native code objects allows for unforeseen and arbitrary expansion,
contraction and substitution in unmodified application code-bases at runtime. Such dynamic
code mutations are useful for (a) adaptive applications that need to rewire themselves in
response to dynamic conditions, (b) code swapping for mission critical systems and (c)
automatic code overlaying for programs in constrained environments. This chapter showcases
the adaptivity of Weaved applications through some case studies. It also mentions some other
aspects of ongoing work.
116
Joy Mukherjee
7.1
Chapter 7. Ongoing Work
117
Adaptivity of Weaved Applications
Extra facilities for runtime flexibility offered by a binary loading and linking framework are
available to many high-level languages, frameworks, and models. The Weaves framework’s
runtime support for loading modules at arbitrary granularities is useful to programs seeking
to expand their code base in a flexible manner. The ability of Weaved applications to load
referentially incomplete modules and control their composition at the fine granularity of
individual references is useful to programs trying to reduce resource consumption through
automatic code overlaying.
Current software development processes support a fair degree of modularity and interface
standardization. For instance, procedural programmers, who use languages such as C and
FORTRAN, often compartmentalize programs into separate source files and compile them
into corresponding object files (.o). Once they have fixed the cross-linking interfaces of
these object files, the codes contained therein can be changed. The problem, as Figure
7.1(a) depicts, is that this compartmentalization is restricted to the pre-runtime domain.
To create an executable, programmers must integrate the objects into one executable file,
which obfuscates the reified structure of the application at runtime.
In contrast, the Weaves framework allows applications load runtime modules from objects
files at arbitrary granularities without affecting the compartmentalization. Moreover, as
seen in Figure 7.1(b), the framework maintains the decoupling among modules even after
they are composed into a runable application weave. Applications can use these facilities to
Joy Mukherjee
118
Chapter 7. Ongoing Work
Runtime
Compile-time
.c
.o
.c
.o
Linker
.f
executable
.o
(a)
Compile-time
bootstrap.c
System
Linker
bootstrap.o
Runtime
bootstrap
executable
Weaves
.c
.o
module
.c
.o
module
Weave
.f
.o
Monitor
module
String
(b)
Process
thread
Figure 7.1: (a) Normal loading and linking. (b) Weaved application linking.
Joy Mukherjee
Chapter 7. Ongoing Work
119
exploit the source file based decoupling at runtime. This runtime decoupling, in turn, aids
interface-oriented code swapping or on-the-fly component replacement.
In effect, the Weaves framework can build a minimal functional program from a set of modules, while retaining the ability for future modifications. It facilitates runtime decoupling,
runtime composition, and dynamic adaptation of applications. Using direct control over the
resolution of individual references, Weaved applications can tune dynamic composition to
their needs. They can explicitly specify different dynamic resolution handlers for different
references. Using the ‘monitor’ component and the APIs for global control over a tapestry,
Weaved applications can asynchronously modify their constitution at runtime.
Whereas adaptive programming techniques, models and frameworks help architect application reconfigurability at a higher level, an orthogonal low-level framework such as Weaves
complements them by providing the necessary runtime support. Most importantly, the
Weaves framework can facilitate unforeseen adaptivity for legacy procedural programs without any modifications to their codes. Some aspects of the framework can be more productive
when coupled with strategies for dynamic state and stack manipulation [Hef04].
7.2
Dynamic Code Expansion
This section briefly exemplifies the Weaves framework’s ability to support arbitrary expansion in application code bases through a hypothetical case of network emulation. Weaved
network emulations can model dynamic characteristics of real-world networks. For instance,
Joy Mukherjee
120
Chapter 7. Ongoing Work
to emulate a new machine joining a network, a Weaved emulation simply needs to load a new
IP stack module. Again, as Figure 7.2 shows, to emulate a new FTP1 application joining an
existing machine, the emulation needs to load a FTP module, compose it with the concerned
IP stack and start a string at the new FTP’s start function. In Figure 7.2, part A depicts
an emulation of two network hosts. Each host initially consists of a telnet (T) running over
an IP stack (IP). The figure shows the weaves (W) and the strings (S). Part B shows the
configuration after an FTP (F) joins each host.
S1
T1
W1
IP1
S2
T2
W2
IP2
A
S3
S1
F1
T1
W3
W1
IP1
S2
S4
T2
F2
W2
W4
IP2
B
Figure 7.2: Modeling network dynamics using the Weaves framework.
We use the term ‘spatial adaptivity’ for dynamic inclusion of program elements, because it
institutes an unforeseen expansion in an application’s code space. Weaved applications can
dynamically compose or recompose old and new modules into different or additional weaves.
The Weaves framework does not entail any modifications to network application codes to
1
File Transfer Protocol
Joy Mukherjee
Chapter 7. Ongoing Work
121
facilitate spatial adaptivity.
7.3
Dynamic Code Swapping
Mission critical systems often consist of performance driven software programs that need to
run continuously over long periods of time. These long running programs cannot be easily upgraded without the ill effects of costly downtime. Moreover, many high-performance
programs use optimistic algorithms. In simple terms, optimistic algorithms advocate calling
a certain function among many alternatives based on a ‘best guess’ according to available
information. However, an optimistic selection can be erroneous, thereby requiring runtime
swapping out of the erroneous selection in favor of a better alternative function. The Weaves
framework’s support for application controlled dynamic linking of programs at multiple granularities is useful for dynamic code swapping.
To exemplify the Weaves framework’s ability to support dynamic code swapping, we ran
a simple experiment where a program component was asynchronously replaced by a better
implementation. This experiment consisted of an application to sort integers. The application’s main routine continually generated a set of random numbers and then called the sort
routine to sort the set through a standardized interface:
void sort (int *array, int size);
As Figure 7.3 shows, we programmed the sources for the main and the sort functions in
C in two files, main.c and sort.c, respectively, with an external reference (sort) from the
Joy Mukherjee
Chapter 7. Ongoing Work
122
former to the latter, such that an executable sorter was generated as:
gcc o sorter main.c sort.c
For Weaves-based modules, we compiled the unmodified source files into objects main.o
and sort.o. We loaded one module of each object, composed them into one weave and
started a string at the entry function in main.o. The main process thread, the monitor, then
waited for user commands to perform requested modifications. Our default sort routine was
bubblesort.
Midway through execution, without stopping the application, we asked the monitor to load
an implementation of mergesort as another module and dynamically redirected the reference
to sort in main to the definition mergesort in the new module. The output showed a
seamless handover from bubblesort to mergesort. We then retraced these steps with
an implementation of quicksort. The output again showed a seamless handover from
mergesort to quicksort.
This experiment shows that if a better code component becomes available, the Weaves
framework can help an application asynchronously switch to it at runtime. This facility of the
framework can help performance driven applications evolve over time and dynamically adjust
to unforeseen performance requirements. For mission critical systems, the Weaves framework
can facilitate runtime upgrading as and when better code components become available. The
experiment also shows that the Weaves framework does not entail any modifications to the
original program’s sources to facilitate code swapping.
Joy Mukherjee
Chapter 7. Ongoing Work
sort (array, size)
main (…)
{
{
// bubblesort
while (1) {
}
…
sort (array, size);
mergesort (array, size)
}
{
// mergesort
}
}
sort (array, size)
main (…)
{
{
// bubblesort
while (1) {
}
…
sort (array, size);
}
}
mergesort (array, size)
{
// mergesort
}
Figure 7.3: Dynamic code swapping using the Weaves framework.
123
Joy Mukherjee
7.4
Chapter 7. Ongoing Work
124
Dynamic Code Overlaying
With continuing trends in processor miniaturization, computing power is becoming more
commonplace. Embedded processors in handheld devices, cellular telephony and consumer
appliances account for the majority of the processor market. These devices have relatively
small memory footprints and experience dynamic resource constraints. In contrast, application software for these devices—user interface libraries, web browsers, email clients and
so forth--continue to increase in complexity. Using the Weaves framework’s support for dynamic loading and controlled composition of codes at various granularities, large programs,
such as web browsers, can automatically prune their codes at runtime in response to dynamic
resource constraints.
To showcase the ability of the Weaves framework to support dynamic overlaying of program codes, we describe experiences with a minimal web-browser called Dillo [Vik06] freely
available under the GNU Public License. We chose Dillo for 3 reasons:
1. Its source-code is freely available.
2. Its memory efficient minimalism is apt for mobile and limited-resource devices.
3. Its source is in traditional procedural C.
Even though programmed in C, the code for Dillo is fairly modularized with different components programmed in different source files. For simplicity, we arbitrarily split the application
into 4 modules (1, 2, 3 and 4) while maintaining, as far as possible, the linking sequence
Joy Mukherjee
Chapter 7. Ongoing Work
125
decreed by the application’s default building mechanism. Of these, module 2 consisted of
implementations for find text (searching for text strings) and select link (‘clicking’ on a
hyperlink) functionalities.
Module 1
Module 1
Module 1
find
Module 2
(Find and select)
select
Module 3
Module 3
Module 3
Module 4
Module 4
Module 4
Figure 7.4: Dynamic code pruning in memory constrained Weaved applications.
As Figure 7.4 shows, we loaded and linked the modules and started a string at the main
function of Dillo in module 3. The process thread was then used as the monitor. To exemplify
that the Weaved version of Dillo could automatically prune its code based upon unforeseen
memory constraints, we asked the monitor to asynchronously unload module 2 at runtime,
thereby reducing memory consumption at the expense of some functionality, find text and
select link. The rest of the application was left untouched and the unresolved references,
resulting from the unloading of module 2, were redirected to a single minimal function that
returned 0 (or NULL). As expected, the browser continued to run correctly and smoothly in
every respect, except that the find text and select link functionalities were not available.
We then split the object file corresponding to module 2 into separate objects for find text
and select link while the browser was still running, dynamically loaded the two new modules (find and select), and composed them with modules 1, 3 and 4. The browser could then
Joy Mukherjee
126
Chapter 7. Ongoing Work
be contracted, expanded, or upgraded at a finer granularity, that is, one could unload and
upgrade find text or select link individually.
Module 1
Module 2
Application
binary
Module 3
Module 4
lpng …
lpng …
Dependency libraries
Dependency libraries
Weaved browser
Normal browser
Module 1
Module 2
Application
binary
Module 3
Module 4
…
Dependency libraries
Weaved browser
…
Dependency libraries
Normal browser
Figure 7.5: Automatic adjustment of Weaved applications to available software infrastructure.
Figure 7.5 depicts another experience that demonstrates the utility of the Weaves framework
for automatic overlaying of programs. We built Dillo into its standard executable program as
well as into a Weaves-based version. The original and the Weaved versions of the browser ran
identically with identical software infrastructure requirements, such as dependency libraries.
Joy Mukherjee
Chapter 7. Ongoing Work
127
We then ported both of them to a different, but architecture-compatible, machine that
lacked a dependency library for portable network graphics, libpng [Roe06]. The original
executable failed to run due to the lack of a dependency library. The Weaved version,
however, promptly started up with all facilities except the portable network graphics. The
framework automatically redirected all unresolved references to libpng to a minimal NULL
returning function, thereby preventing adverse effects on the rest of the application.
Neither of the experiences described here demanded any modification to the Dillo browser’s
codes. These experiences show that the Weaves framework can be useful to applications that
run under dynamic and unforeseen resource constraints on miniature devices. The framework
can automatically contract an unmodified application according to available memory as
well as available software infrastructure. It can also dynamically reinstate pruned codes
when resource conditions are favorable. Finally, it allows runtime re-componentization of
unmodified applications at multiple granularities.
7.5
Other Aspects of Ongoing Work
This section presents various other aspects of the Weaves framework that are currently being
researched and experimented with. Brief discussions of these aspects follow:
❼ A comprehensive analyses of Weave’s loading and linking toolkit (Load and Let Link or
LLL) is being pursued. Focus is on performance overhead of LLL and testing LLL on
standard benchmarks. Specific test applications consist of network filtering programs
Joy Mukherjee
Chapter 7. Ongoing Work
128
and runtime program tracing utilities.
❼ Many programs comprise performance intensive core routines. These routines are
typically available as commodity libraries from various vendors. Currently, applications
link against one such library only. However, these routines are often tailored and
optimized differently by different vendors. The result is that each vendor’s product is
selectively better than others. This aspect of ongoing work seeks to use the Weaves
framework to selectively link to more than one of these similar libraries and use the
best routines from each, thereby enhancing the overall performance of an application.
❼ Interposition, or ‘call hijacking’ for code injection, has several advantages. For instance,
interposition facilitates low-overhead tracing of operational applications. Currently,
interposition is mainly restricted to calls from executables to libraries. By extending
compositionality to the finer granularity of relocatable objects, the Weaves framework
offers a way for utilizing interposition for fine-grain tracing. This aspect of ongoing
work intends to demonstrate the usefulness of the framework for fine-grain interposition
on some standard benchmarks and real-world programs.
❼ Temperature and power consumption control are gaining importance in the backdrop
of increased use of powerful supercomputers and clusters. Thermal control is currently
exercised at the hardware level through various throttling mechanisms. This aspect of
ongoing work attempts to utilize the Weaves framework to asynchronously slow down or
speed up applications depending on the temperature level detected by on-board sensors.
Joy Mukherjee
Chapter 7. Ongoing Work
129
Essentially, this work intends to use controlled dynamic composition of unmodified
programs to automatically insert halting code at function boundaries. Once the slowing
down of applications has allowed the cooling system enough time to dissipate the excess
heat, the halting code can be automatically removed and application execution can be
resumed at full speed.
❼ Finally, ongoing work also looks at advanced Weaved network emulations and Weaves-
based modeling of complex real-world scientific problems.
7.6
Summary
This chapter has presented aspects of ongoing work on the Weaves framework, which mainly
focuses on reconfigurability or adaptivity of unmodified applications. The framework’s facilities for flexible runtime loading and fine-grain dynamic linking of native code components
allows for unforeseen and arbitrary expansion, contraction and substitution in unmodified
applications at runtime. Ongoing work endeavors to use these dynamic code mutations for
(a) adaptive applications that need to rewire themselves in response to dynamic conditions,
(b) code swapping for mission critical systems and (c) automatic code overlaying for programs in constrained environments. This chapter has showcased the adaptivity of Weaved
applications through some case studies.
Other aspects of ongoing work include (1) testing and analyses of the Weaves framework’s
runtime loader and linker (LLL), (2) using the framework to selectively link a program to
Joy Mukherjee
Chapter 7. Ongoing Work
130
multiple commodity libraries for performance gains, (3) using control over linking of Weaved
programs for low-overhead tracing through fine-grain interposition, and (4) exploiting the
framework’s support for controlled program composition to throttle applications according
to temperature and power constraints. Finally, ongoing work also looks at advanced Weaved
network emulations and Weaves-based modeling of complex real-world scientific problems.
Bibliography
[ABB+ 00]
V. S. Adve, R. Bagrodia, J. C. Browne, E. Deelman, A. Dube, E. N. Houstis,
J. R. Rice, R. Sakellariou, D. J. Sundaram-Stukel, P. J. Teller, and M. K. Vernon.
POEMS: End-to-end performance design of large parallel adaptive computational systems. IEEE Transactions on Software Engineering, 26(11):1027–1048,
2000.
[ANLUC06] Argonne National Laboratory and University of Chicago.
MPICH2, 2006.
http://www-unix.mcs.anl.gov/mpi/mpich/, Last accessed on July 17 2006.
[AP99]
M. Allman and V. Paxson. On estimating end-to-end network path properties. In Proceedings of the ACM SIGCOMM ’99 Conference on Applications,
Technologies, Architectures, and Protocols for Computer Communication, pages
263–274. ACM Press, 1999.
[AS94]
C. Alaettingoglu and A. U. Shankar. Design and implementation of MARS: A
routing testbed. Journal of Internetworking: Research and Experience, 5(1):17–
41, 1994.
131
[AS00]
V. S. Adve and R. Sakellariou. Application representations for multiparadigm
performance modeling of large-scale parallel scientific codes. International Journal of High Performance Computing Applications, 14(4):304–316, 2000.
[BB06]
B. Barney. POSIX threads programming, 2006. http://www.llnl.gov/, Last
accessed on July 17 2006.
[BBD00]
J. C. Browne, E. Berger, and A. Dube. Compositional development of performance models in POEMS. International Journal of High Performance Computing Applications, 14(4):283–291, 2000.
[BBH+ 98]
H. E. Bal, R. Bhoedjang, R. F. H. Hofman, C. J. H. Jacobs, K. Langendoen,
and T. Rühl. Performance evaluation of the Orca shared-object system. ACM
Transactions on Computer Systems, 16(1):1–40, 1998.
[BKdSH01] M. Bhandarkar, L. V. Kale, E. de Sturler, and J. Hoeflinger. Object-based
adaptive load balancing for MPI programs. In Proceedings of the International
Conference on Computational Science, LNCS 2074, pages 108–117. Springer
Verlag, 2001.
[BP96]
L. S. Brakmo and L. L. Peterson. Experiences with network simulation. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling
of Computer Systems, pages 80–90. ACM Press, 1996.
[BVB06]
C. Bergstrom, S. Varadarajan, and G. Back. The distributed Open Network
Emulator: Using relativistic time for distributed scalable simulation. In Pro132
ceedings of the 20th IEEE/ACM/SCS Workshop on Principles of Advanced and
Distributed Simulation (PADS 2006), pages 19–28. IEEE Computer Society,
2006.
[CCA04]
The Common Component Architecture Forum. CCA: Common component architecture, 2004. http://www.cca-forum.org/, Last accessed on July 17 2006.
[CG89]
N. Carriero and D. Gelernter. Linda in context. Communications of the ACM,
32(4):444–458, 1989.
[CK01]
K. M. Chandy and C. Kesselman. Compositional C++: Compositional parallel
programming. Technical report, California Institute of Technology, Pasadena,
CA, USA, 2001.
[CS03]
M. Carson and D. Santay. NIST Net: A linux-based network emulation tool.
Computer Communication Review, 33(3):111–126, 2003.
[DGTY95] J. Darlington, Y. Guo, H. W. To, and J. Yang. Parallel skeletons for structured
composition. In Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 95), pages 19–28. ACM
press, 1995.
[DHRR99] T. T. Drashansky, E. N. Houstis, N. Ramakrishnan, and J. R. Rice. Networked
agents for scientific computing. Communications of the ACM, 42(3):48–54, 1999.
133
[EII00]
Ericsson IP Infrastructure (formerly Torrent Networks). Virtual Net: Virtual
host environment, 2000.
[Eng06]
R. S. Engelschall.
GNU Pth:
A user-level thread library,
2006.
http://www.gnu.org/software/pth/, Last accessed on July 17 2006.
[Fos96]
I. T. Foster. Compositional parallel programming languages. ACM Transactions
on Programming Languages and Systems, 18(4):454–476, 1996.
[FOT92]
I. T. Foster, R. Olsen, and S. Tuecke. Productive parallel programming: The
PCN approach. Journal of Scientific Computing, 1(1):51–66, 1992.
[FSF05]
Free Software Foundation. GNU C library, 2005. http://www.gnu.org/, Last
accessed on July 17 2006.
[FT89]
I. T. Foster and S. Taylor. Strand: A practical parallel programming tool. In
Proceedings of the North American Conference, pages 497–512. MIT Press, 1989.
[FV06]
K. Fall and K. Varadhan. Network emulation with the NS simulator, 2006.
http://www.isi.edu/nsnam/ns/doc/index.html, Last accessed on July 17 2006.
[GB82]
D. Gelernter and A. J. Bernstein. Distributed communication via global buffer.
In Proceedings of the Symposium on Princiles of Distributed Computing, pages
10–18. ACM Press, 1982.
[Hef04]
M. Heffner. A runtime framework for adaptive compositional modeling. Master’s
thesis, Virginia Tech, Blacksburg, VA, USA,2004.
134
[HK97]
A.
Helmy
and
S.
Kumar.
Virtual
InterNetwork
Testbed,
1997.
http://www.isi.edu/, Last accessed on July 17 2006.
[HP88]
N. C. Hutchinson and L. L. Peterson. Design of the x-kernel. In Proceedings
of the ACM Symposium on Communications Architectures and Protocols, pages
65–75. ACM Press, 1988.
[HSK99]
X. W. Huang, R. Sharma, and S. Keshav. The ENTRAPID protocol development environment. In Proceedings of the IEEE INFOCOM ’99 Conference on
Computer Communications, pages 1107–1115. IEEE press, 1999.
[JL05]
F. H. Carvalho Jr. and R. D. Lins. The # model: Separation of concerns
for reconciling modularity, abstraction and efficiency in distributed parallel programming. In Proceedings of the 2005 ACM Symposium on Applied Computation
(SAC 05), pages 1357–1364. ACM press, 2005.
[KBA92]
K. R. Koch, R. S. Baker, and R. E. Alcouffe. Solution of the first-order form of
the 3D discrete ordinates equation on a massively parallel processor. Transactions of the American Nuclear Society, 65(198), 1992.
[Kes97]
S. Keshav. Real 5.0 overview, 1997. http://www.cs.cornell.edu/skeshav/real/,
Last accessed on July 17 2006.
[Kne04]
C. Knestrick. LUNAR: A user-level stack library for network emulation. Master’s
thesis, Virginia Tech, Blacksburg, VA, USA, 2004.
135
[LM01]
H. P. Langtangen and O. Munthe. Solving systems of parallel differential equations using OOP techniques with coupled heat and fluid flow as example. ACM
Transactions On Mathematical Software, 27(1):1–26, 2001.
[MDB03]
N. Mahmood, G. Deng, and J. C. Browne. Compositional development of parallel programs. In Proceedings of the 16th Workshop on Languages and Compilers
for Parallel Computing (LCPC03), pages 109–126. Springer Verlag, 2003.
[MR92]
H. S. McFaddin and J. R. Rice. Collaborating PDE solvers. Applied Numerical
Mathematics, 10:279–295, 1992.
[MS06]
Microsoft Corporation. COM: Component object model technologies, 2006.
http://www.microsoft.com/com/default.mspx, Last accessed on July 17 2006.
[Mu99]
M. Mu. Solving composite problems with interface relaxation. SIAM Journal
on Scientific Computing, 20(4):1394–1416, 1999.
[Muk02]
J. Mukherjee. A compiler directed framework for parallel compositional systems.
Master’s thesis, Virginia Tech, Blacksburg, VA, USA, 2002.
[MV05a]
J. Mukherjee and S. Varadarajan. Develop once deploy anywhere: Achieving adaptivity with a runtime linker/loader framework. In Proceedings of the
4th workshop on Reflective and adaptive middleware systems (ARM ’05). ACM
Press, 2005.
136
[MV05b]
J. Mukherjee and S. Varadarajan. Weaves: A framework for reconfigurable
programming. International Journal for Parallel Programming (Special Issue),
33(2):279–305, 2005.
[NAG06a]
Numerical Algorithms Group.
NAG library manual: Thread safety, 2006.
http://www.nag.co.uk/downloads/fldownloads.asp, Last accessed on July 17
2006.
[NAG06b]
Numerical Algorithms Group. NAG software downloads: Fortran library, 2006.
http://www.nag.co.uk/downloads/fldownloads.asp, Last accessed on July 17
2006.
[NBF96]
B. Nichols, D. Buttlar, and J. P. Farrell. Pthreads Programming: A POSIX
Standard for Better Multiprocessing. O’Reilly, 1996.
[NsN06]
NsNam. The network simulator – ns, 2006. http://nsnam.isi.edu/, Last accessed
on July 17 2006.
[OAR04]
OpenMP Architecture Review Board. Official OpenMP specifications, 2004.
http://www.openmp.org/drupal/node/view/8, Last accessed on July 17 2006.
[OMG06]
Object Management Group. CORBA basics, 2006. http://www.omg.org/, Last
accessed on July 17 2006.
137
[OPN05]
OPNET Technologies. OPNET modeler: Network modeling and simulation
environment, 2005. http://www.opnet.com/products/modeler/home.html, Last
accessed on July 17 2006.
[OPN06]
OPNET Technologies. OPNET, 2006. http://www.opnet.com, Last accessed on
July 17 2006.
[Pax99]
V. Paxson. End-to-end internet packet dynamics. IEEE/ACM Transactions on
Networking, 7(3):277–292, 1999.
[PB98]
S. Prakash and R. Bagrodia. MPI-SIM: Using parallel simulation to evaluate
MPI programs. In Proceedings of the 1998 Winter Simulation Conference (WSC
98), pages 467–474. ACM press, 1998.
[PCB94]
S. Parkes, J. A. Chandy, and P. Banerjee. A library-based approach to portable,
parallel, object-oriented programming: Interface, implementation, and application. In Proceedings of the 1994 ACM/IEEE Conference on Supercomputing (SC
94), pages 69–78. IEEE/ACM, 1994.
[PTL06]
Pervasive Technology Labs at Indiana University. LAM/MPI parallel computing, 2006. http://www.lam-mpi.org/, Last accessed on July 17 2006.
[Ram04]
H. Ramankutty. Inter-process communication, 2004. http://linuxgazette.net/,
Last accessed on July 27 2006.
138
[Ric98]
J. R. Rice. An agent-based architecture for solving partial differential equations.
SIAM News, 31(6), 1998.
[Riz97]
L. Rizzo. Dummynet: A simple approach to the evaluation of network protocols.
ACM Computer Communications Review, 27(1):31–41, 1997.
[RL98]
M. C. Rinard and M. S. Lam. The design, implementation, and evaluation of
Jade. ACM Transactions on Programming Languages and Systems, 20(3):483–
545, 1998.
[Roe06]
G. Roelofs. libpng, 2006. http://www.libpng.org/, Last accessed on July 17
2006.
[RTV99]
J. R. Rice, P. Tsompanopoulou, and E. A. Vavalis. Interface relaxation methods
for elliptic differential equations. Applied Numerical Mathematics, 32:291–245,
1999.
[Sat02]
M. Sato. OpenMP: Parallel programming API for shared memory multiprocessors and on-chip multiprocessors. In Proceedings of the 15th International
Symposium on System Synthesis (ISSS 02), pages 109–111. IEEE Computer
Society, 2002.
[SteJr90]
G. L. Steele Jr. Making asynchronous parallelism safe for the world. In Conference Records of the 17th Annual ACM Symposium on Principles of Programming
Languages, pages 218–231. ACM Press, 1990.
139
[Ste97]
W. R. Stevens. UNIX network programming. Prentice-Hall of India, 1997.
[Ter01]
M. Terwilliger. PARSEC: Parallel simulation environment for complex systems,
2001. http://may.cs.ucla.edu/projects/parsec, Last accessed on July 17 2006.
[TIS95]
Tools Interface Standards Committee. Executable and Linkable Format (ELF)
Specification, 1995.
[Var02]
S. Varadarajan. Weaving a code tapestry: A framework for reconfigurable programming, DOE Early Career Proposal. 2002.
[Vik06]
J. Viksell. Dillo, 2006. http://www.dillo.org/, Last accessed on July 17 2006.
[VR05]
S. Varadarajan and N. Ramakrishnan. Novel runtime systems support for adaptive compositional modeling in PSEs. Future Generation Computing Systems
(Special Issue), 21(6):878–895, 2005.
[VYW+ 02] A. Vahdat, K. Yocum, K. Walsh, P. Mahadevan, D. Kostic, J. Chase, and
D. Becker. Scalability and accuracy in a large-scale network emulator. In Proceedings of the 5th Symposium on Operating System Design and Implementation
(OSDI ’02). USENIX Association, 2002.
[Yuj01]
L. Yujih. Distributed simulation of a large-scale radio network. OPNET Technologies’ Contributed Papers, 2001.
140
Vita
Joy Mukherjee was born on the 16th of May 1978 at Ranchi, India. He graduated with
a Bachelor of Technology (Honors) Degree in Computer Science and Engineering from the
Indian Institute of Technology, Kharagpur, India in July 2000. He received the degree of
Master of Science in Computer Science from Virginia Tech, Blacksburg, Virginia, USA in
December 2002. His academic interests include Computer Systems, Systems Support for Programming Languages, Adaptive Software Systems, Parallel Computing, Binary Technologies,
and Compilers. As asides he indulges in novels, progresses in the natural sciences, traveling,
sketching, music, and various outdoor sports. He is to join Oracle, India for research and
development of systems software support for reliable computer clusters.
141