Autonomic Computing - Principles, Design and Implementation
Autonomic Computing - Principles, Design and Implementation
Autonomic Computing - Principles, Design and Implementation
Philippe Lalanda
Julie A. McCann
Ada Diaconescu
Autonomic
Computing
Principles, Design and Implementation
Series editor
Ian Mackie
Advisory board
Samson Abramsky, University of Oxford, Oxford, UK
Karin Breitman, Pontifical Catholic University of Rio de Janeiro, Rio de Janeiro, Brazil
Chris Hankin, Imperial College London, London, UK
Dexter Kozen, Cornell University, Ithaca, USA
Andrew Pitts, University of Cambridge, Cambridge, UK
Hanne Riis Nielson, Technical University of Denmark, Kongens Lyngby, Denmark
Steven Skiena, Stony Brook University, Stony Brook, USA
Iain Stewart, University of Durham, Durham, UK
Undergraduate Topics in Computer Science (UTiCS) delivers high-quality instructional content for
undergraduates studying in all areas of computing and information science. From core foundational and
theoretical material to final-year topics and applications, UTiCS books take a fresh, concise, and modern
approach and are ideal for self-study or for a one- or two-semester course. The texts are all authored by
established experts in their fields, reviewed by an international advisory board, and contain numerous
examples and problems. Many include fully worked solutions.
Autonomic Computing
Principles, Design and Implementation
Philippe Lalanda
Laboratoire Informatique de Grenoble
Universit Joseph Fourier
Grenoble, France
Julie A. McCann
Department of Computing
Imperial College London
London, UK
Ada Diaconescu
Department of Computing and Networking
Tlcom ParisTech
Paris, France
ISSN 1863-7310
ISBN 978-1-4471-5006-0
ISBN 978-1-4471-5007-7 (eBook)
DOI 10.1007/978-1-4471-5007-7
Springer London Heidelberg New York Dordrecht
Library of Congress Control Number: 2013936543
Springer-Verlag London 2013
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and
executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this
publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publishers
location, in its current version, and permission for use must always be obtained from Springer.
Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations
are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of
publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for
any errors or omissions that may be made. The publisher makes no warranty, express or implied, with
respect to the material contained herein.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Foreword
vi
Foreword
The success of the initiative has already been indicated by the notable move of
autonomicity from the previously mentioned conferences and communities to a
standard topic in almost all computer- and communications-based conferences
and communities. Ironically, the final success of the initiative may be marked as it
no longer exists separately as a specialised field but as a standard, invisible and
integrated part of our systems and software engineering.
For the autonomic systems research and development to make further leaps
and bounds and move convincingly into the next decade to meet the Software Crisis
2.0 and its other longer term goals requires that it move beyond the research labs
and PhD programmes to our graduate, undergraduate and CPPD (continuous
professional and personal development) courses.
This book marks the enabler for that next stage.
University of Ulster, Northern Ireland
12 December 2012
Roy Sterritt
References
Fitzgerald, B.: Software crisis 2.0. IEEE Comput. 45(4) (2012)
Horn, P.: Autonomic computing: IBMs perspective on the state of information technology. IBM
T.J. Watson Labs, New York (2001)
Preface
vii
viii
Preface
Philippe Lalanda
Julie A. McCann
Ada Diaconescu
Acknowledgments
Many useful discussions with colleagues and students helped in the preparation of
this book. The authors would like to thank the following people who, without
reward, reviewed and critiqued the text; their comments and suggestions have been
invaluable in ensuring the quality of this book:
Luciano Baresi, Charles Consel, Clment Escoffier, Catherine Hamon, Roman
Kolcun, Pedro Martins, Iulian Neamtiu, Simon OKeefe, Alessandra Russo, Poonam
Yadav and Shusen Yang.
We are also grateful to Simon Rees of Springer who encouraged us to write the
book and provided invaluable assistance in the production of the final copy.
We would also like to thank our colleagues, friends and family for their constant
support, encouragement and patience. Ada thanks Mr. Smith for regularly changing
the subject. Julie thanks husband Grant and son Carteryou can now use my laptop
to watch 1950s cartoons. Philippe thanks his wife, now an expert in autonomic
computing, and his two sons, Grgoire and Arthurexperts to come!
ix
Contents
1
1
5
5
7
9
11
14
17
19
20
23
24
24
26
28
30
34
34
35
37
39
39
40
41
43
43
43
44
xi
xii
Contents
52
54
57
58
58
59
61
63
63
64
65
67
70
71
72
73
74
75
77
78
80
81
82
82
84
86
89
89
91
92
95
96
98
98
100
102
104
106
109
109
111
113
115
Contents
xiii
117
120
121
121
122
124
126
127
129
130
132
134
137
138
139
141
145
148
149
150
153
154
156
156
159
161
161
163
168
170
173
173
175
177
177
181
182
182
185
186
186
187
189
xiv
Contents
7.2
190
190
191
193
194
196
197
199
199
200
201
204
204
205
206
208
208
213
214
217
218
219
219
219
221
221
222
223
224
225
226
227
228
229
229
230
232
233
234
235
236
239
242
242
243
Contents
xv
245
247
249
251
251
251
253
254
259
260
263
264
265
265
268
270
271
272
273
276
277
278
279
Index ................................................................................................................
285
1.1
Software Complexity
1.1
Software Complexity
http://www.ludicorp.com/about.php (2012).
becoming more distributed, and they are getting larger. Systems counting many millions of lines of code, and which are subjected to thousands of updates per year, are
a frequent occurrence. They arguably constitute some of the most complex artefacts
ever built by human beings. To give an order of magnitude, David A. Wheeler estimated that version 7.2 of the RedHat Linux operating system is worth 8,000 person
years in terms of development time.2 As a comparison, the construction the Empire
State Building required only 3,500 person years!
It seems, however, that the situation is changing. New domains like Internet services,
cloud or pervasive computing are emerging and placing new demands on software
systems. Specifically, systems have to be even more distributed, more heterogeneous, more dynamic, developed more rapidly, etc. Many think we have reached a
barrier in terms of being able to overcome such complexities. Software engineers
are beginning to feel that they are unable to anticipate, design and maintain such
systems using traditional approaches.
As a result, there has been a push towards more automated approaches to help
develop and, above all, administrate and maintain software systems. IBM, in order
to refer to this new set of practices, coined the term autonomic computing.
Autonomic computing is the main topic of this book. It can be viewed as one
approach to the engineering of software systems and, as such, encompasses the broad
scope of sub-disciplines in the computing field, encompassing requirements engineering, software architectures, design, development, testing, maintenance, configuration
management and quality. For these reasons, we begin this book by providing a background introducing traditional software engineering approaches. We believe it is of
major importance to be familiar with such practices in order to understand why and how
they have adapted to face new challenges. Inversely, it cannot be denied that most of
these traditional techniques are still required in the implementation of future solutions.
In this introduction, we focus on the notion of customisable software processes to guide
the development and maintenance of software systems for it has had a deep and lasting
impact on software practices. The software engineering processes acknowledge that the
production of software systems can be managed and, in doing so, have encouraged the
controlled production of reliable, high-quality, cost-effective, software systems.
Software processes describe the activities that are to be performed to enable
software system production. There have been a number of models that describe the
individual activities that occur during this process and how those activities interact,
characterising the methodologies that define best practice. Activities are divided
into development and administration cycles. Development activities deal substantially with the production of programmes meeting specified requirements.
Administration activities are more concerned with system deployment and its
day-to-day management and maintenance. Identifying the commonality of activities
for a number of software development initiatives and then sharing the resulting
knowledge, artefacts, etc. have had a great impact on software practices. That is, it
has allowed the definition of the successful, repeatable techniques that are now
taught in universities and widely used by practitioners.
2
http://www.dwheeler.com/sloc/redhat71-v1/redhat71sloc.html.
1.2
1.2
1.2.1
Software Development
Software engineers identified the problem of complexity early in the history of computing.
The famous so-called software crisis appeared in the late 1960s, when the term software
engineering was coined. Put simply, software engineering focuses on how complex
computing projects can be designed and then managed over the life cycle of a project.
Software engineering can be defined as a systematic discipline that aims to
improve the specification, design, implementation and maintenance of software
systems by increasing their quality and cost-effectiveness. Precise definitions of
software engineering are not readily found; however, many books introduce the
subject in detail [3, 4]. Nevertheless, its concepts are derived from the fields of
mathematics, computer science and, of course, engineering practice.
Software engineering has developed successful techniques and processes to help
build programmes and conduct projects. Many techniques rely on the principles of
modularity and separation of concerns. Application of these principles to programming,
for instance, has led to the definition of structured programming, object-oriented
programming, software componentisation and so on. The implementation of these
principles to better achieve software projects has resulted in the definition of development processes.
Development processes have been defined in order to decompose the production
of software into a number of smaller and more controllable activities. Particularly, a
software development process specifies and organises a set of interrelated activities
that can be followed in order to properly deliver a quality software system. Because
different software systems are required for each specific situation, processes are
usually defined as models (i.e. defined in abstract terms) and then customised,
case-by-case, to meet the specific needs of each particular software project.
Development activities include requirement management, design specification,
implementation and the validation of software systems. A number of process models
have been proposed to coordinate and implement these activities. The Waterfall
model [5], for instance, relies on a sequential approach: requirements first, then
design, coding and testing. However, this approach does not provide the opportunity
to revise or review the work carried out in the initial stages of the life cycle when the
project is in the development process (i.e. one cannot revisit the requirements and
design stages when working in the development and testing phases).
Such sequential processes are less popular today because of their inability to deal
with change. They are simply not suited to the way in which software is produced
in the current fast-paced, dynamic business environments. Nowadays, the approach
to the development process is incremental. As illustrated by Fig. 1.1, software systems
are built through successive increments, where at each stage, requirements, design,
coding and testing activities are carried out. This process is repeated until the software
system is ready for delivery.
Each activity uses and produces software artefacts that are very diverse by nature.
They can include textual documents, structured texts, graphical models, source files
Design
Requirements
Implementation
Test
and binary files. These artefacts are complex because of their number and their volume.
They can be made of many interacting elements and are consequently hard to
integrate, administer and maintain. They are also characterised by a number of
traceability links. For instance, systems architectural decisions are made to meet
some requirements, so the pieces of code that was used to implement this should be
able to be traced back to an architectural design and then to its requirements, etc.
These artefacts constitute the base elements of every software project and determine its success. They must be modular, with strong cohesion and weak coupling.3
That is, where the cohesion of software artefacts is strong, its readability, maintainability and reuse are maximised. Likewise, minimising the coupling between artefacts is also good for readability, maintainability and reusability. These properties
are then of utter importance when it comes to software evolution. Well-structured,
coherent and decoupled artefacts favour evolution and limit the propagation of
uncontrolled side effects. Maintaining relationships between artefacts is highly
important. Not understanding the rationale behind an artefacts internal structure
and its relationships makes it almost impossible to update a system without the risk
of causing undesirable side effects.
A simple example could be where a component is used to determine the location
of the user. This component is used to direct music to the nearest speaker to that user.
The code that ships the music around the room uses a location component as a black
box (i.e. the system is not interested in how it calculates location, just the integers that
represent location co-ordinates resulting from executing the code in the component).
This system has the advantage of having its functions and services represented as
components, so it has the advantage of being decoupled and well structured. Therefore,
when we find a better way to get the users location, we can take out the old component
and plug in the new. However, side effects can still happen. Perhaps the precision of
the new location component is higher than the old one, so the numbers representing
the co-ordinates adhere to a smaller location grid space. When added to our music
system, this might have the side effect of not mapping to the locations of the speakers,
and therefore, the system will direct music elsewhere.
3
Cohesion is about the functional scope of an artefact (a component, a class, a design diagram, an
analysis diagram, etc.). Coupling is about the number and nature of relationships between
artefacts.
1.2
Release
Install
Update
Activate
1.2.2
Software Deployment
Deployment starts when a software system has been duly approved for delivery.
Its purpose is to produce a live software system to the user, and this may ensure that
it is deployed and running on the clients site. It handles the transfer, installation,
configuration and integration of concrete artefacts therein. It initiates the different
executable components of the software system and deals with subsequent updates.
Deployment is normally carried out by authorised administrators. It is traditionally decomposed into the following (sub-) activities (Fig. 1.2): release, retire, install,
uninstall, activate, deactivate, update and adapt [6].
Iterating through this list, the purpose of the release activity is to prepare the
software so that it can be transferred to the client. Simply put, it consists of packaging the constituents of the software with the information required by the deployment
processes that follow. De-release, or retirement, is the inverse activity of release.
It is carried out when a software system is obsolete or is no longer needed.
The installation activity inserts the software in the target environment and
configures it for execution. In the simplest cases, installation is about copying
files to a target execution infrastructure. Most of the time, however, it requires a
sequence of operations to be performed such as uncompressing files, selecting
locations for installation, getting appropriate permissions, configuring some
aspects of the software system and integrating the software system in the existing
computing infrastructure.
The activation activity comes after the software system installation. Its purpose is
to start the executable elements that have been previously installed. In the simplest
cases, activation consists of calling a unique binary file (i.e. a programme) with
the appropriate input parameters. In some situations, however, it requires several
programmes to be initiated, and these may be installed on different machines.
Un-installation is the inverse activity of installation. It is carried out when the
presence of the software system is no longer required. Deactivation is the opposite
activity of activation. It is done when the execution of a software system is no longer
required (or the service is no longer offered). Deactivation and de-installation are
complex activities in their own right due to code dependencies. Their implementation may imply the reconfiguration of components that have a dependency relationship with the element that is being decommissioned.
The purpose of the update and the adapt activity is to change parts of the (or indeed
the complete) software system that has been previously installed, which may or may
not be activated at this point. This activity is carried out as many times as necessary
during the lifetime of a software system. Updates are traditionally performed on the
client site by an administrator. More regularly, however, updates are initiated
remotely by a third party, for example, an operating system (OS) update from an OS
vendor, who controls delivery dates and update frequencies. Periodic security
updates to the software is a good example of this. Here, code that fixes security
vulnerabilities is developed and put at the clients disposal by software providers.
A security patch can then be inserted by the client administrator or remotely pushed
by the provider, with or without prior authorisation, depending on the vendoruser
agreement.
Institutively, the update activity would appear to be a simpler task compared
to installation and activation, but is it? Clearly, considerable configuration and
integration activities have to be carried out during the early phases of the life
cycle. However, a closer look reveals that the update activity is heavily constrained
and often very complex. It has to respect the computing environment that uses or
relies on the software component so as to not introduce new problems. Also, an
update must preserve data, states, intermediary results, etc. Poorly designed
updates can introduce new problems requiring the software to be regressed back
to its previous state.
Deployment is a key software activity. It is technically very challenging in the sense
that all sorts of operations are required: software artefacts have to be compressed, packaged,
transferred, uncompressed, copied, configured, integrated, started, modified, etc. In
addition, software systems can be required to stay in operation while further deployment
activities are executed, which clearly increases the level of difficulty.
Deployment has long been underestimated and only now receives due attention.
The recent availability of new tools, like the Chef configuration management tool,4
for instance, facilitates the work of administrators, providing higher-level languages
to automate complex infrastructures deployment. However, even with such new
generation tools, the level of complexity remains high.
The complexity of the deployment activities is actually at the heart of the motivation for autonomic computing. Performing the various deployment tasks, in fact, is
complex. For instance, the initial configuration stage can include hundreds of
parameters to be set. Also, software systems may have to be integrated with heterogeneous systems, whose lifetimes can be dynamic and can be spread over local or
wide-area networks. It is necessary, in this situation, to identify these systems
and their configurations to correctly install and run the software. Over time, as software and its underlying execution platform change, some deployment activities,
4
1.2
including configuration, must be repeated (see next subsection). This activity comes
at a great cost, the majority of which is not the computing infrastructure, but the time
taken and salaries required for the staff (system administrators, etc.) that are involved
with this process.
1.2.3
Software Maintenance
Maintenance starts after the softwares initial installation. Its purpose is to modify
the software being used in order to fix bugs, to improve quality of service or to
address new conditions for execution. Maintenance comprises a number of activities,
ranging from the simple reconfiguration of certain parameters to more complex
operations, like the development of new pieces of code or the migration to new
running platforms.
It is important to understand that maintenance is not limited to minor changes
in operational systems. Maintenance, in fact, has to deal with changing user requirements and operating environments (as explained in Sect. 1.1) and sometimes
requires that major updates are carried out. In recognition of this, Lehman termed
the maintenance function as evolutionary development [2].
Traditionally, maintenance activities are classified into four categories. Corrective
maintenance takes care of faults and errors detected in delivered software. Adaptive
maintenance is concerned with evolving the system to better match user and changes
in the systems environment. Perfective maintenance deals with evolutions in the
desired functions or related quality of service. Finally, preventive maintenance
targets the detection and correction of latent faults in the delivered software. Contrary
to common perception, correcting misbehaviours accounts for, on average, less
than 20 % of the total maintenance effort [7]. That means that around 80 % of the
maintenance effort is dedicated to software evolution (where adaptive maintenance
accounts for 25 %, perfective maintenance for 50 % and preventive maintenance
for only 5 %).
It is commonly accepted that maintenance is a complex, time-consuming activity
that can take far more time than the initial development of the software. In fact, it is
today acknowledged that between 50 and 80 % of effort spent on a computer system
will happen after it has been delivered to the customer. As detailed in the previous
sections, updating a software system is inherently complex. It can require changing
unstructured or badly structured code, often in situations where documentation is
lacking. As software ages, the software structure is likely to be altered by successive
updates, and, as a consequence, changes become difficult and risky to perform
(since side effects are more likely to occur).
Maintenance is carried out by one or several system administrators, often called
the sysadmin. The responsibilities of system administrators are many and vary
according to organisations. In most cases, they require high technical skills in order
to, for instance, configure databases, Web servers or networks, in accordance with
what is expected of the system. They also need in-depth expertise to be able to solve
problems affecting the behaviour of a software system: they need to understand
10
Design
Req.
Implem.
Test
New developments
Monitoring
Systems administrator
Deployment
Installed software
the purpose and nature of a software system in order to quickly determine what goes
wrong and to fix it (or to report it to interested parties).
The administration job has often to be carried out under considerable stress.
As we will see later in this and the next chapter, the complexity and stress faced
by administrators can lead to bad decision-making with, sometimes, undesirable
outcomes.
As illustrated by Fig. 1.3, we can distinguish two forms of administrative actions:
the adaptation of software already installed and the integration of newly developed
artefacts. New artefacts can correspond to expected updates or to specific answers
to requests initiated by the administrator. In the latter case, the term software patch
is often used in maintenance terminology.
Let us develop these two categories of actions. In the first case, the purpose is
essentially to monitor the software systems and to adapt the software artefacts that
are already present. There are a number of ways in which this can be achieved. First,
it can be executed through the appropriate tuning of configuration parameters,
which can be tricky in some instances. Indeed, many systems are characterised by
hundreds of parameters with tight interrelationships. Finding out the best values and
the right balance between values requires advanced skills and expertise that are not
easy to find. It may often require third party intervention of an expert or consultant.
Complexity is so considerable that in many cases, most parameters remain
unchanged, even at the cost of degraded performance. Adapting existing artefacts
may take other forms. It may require that some programmes or data is moved to
different computers or different middleware. That is, it can also involve the migration of some parts of the (or the complete) software to a new version of a supporting
1.3
Maintenance Challenges
11
middleware. Such an achievement may be long and difficult and may demand code
rewrites. In general, processes have been defined in order to guide such operations
and minimise trouble.
In certain cases, reconfiguring existing artefacts is not enough to enable the system
to adapt to new conditions. Here, deeper changes motivating system redevelopment
are needed. Note, new developments generally use the same tools and processes as
those used to develop the system initially (with the additional constraint that existing
software has to be accounted for). Also, system administrators use the same deployment primitives in order to install, integrate and activate updates. In this regard,
initial developments and subsequent maintenance-related developments are very
similar. Their shared purpose is to produce code meeting current requirements and
deploy it on the clients site.
Relatively little effort has been dedicated to the deployment and maintenance
activities (i.e. compared with the relative importance and cost of these phases). For a
long time, the software engineering activity was focused on the development phase.
This predominance is not anecdotal. This focus on the development aspects of software engineering is not surprising. Most research effort in the history of software
engineering has sought to improve the way we produce software systems that meet the
clients expectations, minimising the chance of misbehaviour at runtime, etc. [8].
In this context, the software administrator has a difficult and sometimes unacknowledged task. Software administrators often have to carry out delicate and
sometimes vital operations with poor tools that often are not well integrated or sufficiently abstract to aid the job. When a problem is detected, system administrators
are often faced with a dilemma. Either they update the system, without all the necessary knowledge to be certain that they can avoid undesirable side effects, or they
limit their actions and only report problems, waiting for problems to be fixed by
developers (termed as new developments in Fig. 1.3). The latter may receive low
prioritisation from project managers and from developers, and therefore, this task
may take some time.
Many think that this situation has worsened in recent years. Indeed advances in
hardware and networks, combined with ever growing demands for new features and
the increased pressure on time-to-market, have deeply changed the software industry.
Customers are longing for new software-based services, and companies are striving
to supply these new services as fast as possible. This raises serious challenges since
it means more complexity and more frequent updates, which in turn equates to more
functionality and more code being required to be delivered in increasingly shrinking
delivery times. Ensuring system correctness and dependability in these conditions
becomes a real challenge.
1.3
Maintenance Challenges
12
hardware performance is still increasing exponentially. Though, under reassessment due to the bounds of physics, Moores law is still in evidence. That is, we can
still observe that storage capacity and CPU speed will approximately double each
18 months. Also, all sorts of networks are spreading around us. They allow the
connection of a myriad of equipment, some with a relatively small footprint, but
powerful enough to host software-based functions. The pervasive and cloud computing paradigms, that either place computing into the fabric of the environment or
alternatively move storage and heavy computing into the cloud (a virtual computer
whose geographical location can be totally transparent to the user), merely reinforce this trend.
However, this constant evolution is not without serious problems. In fact, the
way software systems are currently developed and maintained is called into question. In short, the development of software systems has to become faster and more
agile, whereas maintenance has to be able to perform more functions more efficiently in order to remain in line with its environment.
To meet these demanding requirements, practitioners have adopted new development practices. First, the time when software was developed entirely for one project is
over. Instead, due to reduced costs and production delays, software development
is more like assembling external componentscalled COTS (components off the
shelf)which are often provided by third parties (such as corporations and
open-source communities). COTS can generally be configured before being executed
and administered during execution. But they are very heterogeneous. They frequently
come with their own configuration methods, tools and vocabulary. Thus, parts of a
single system can be configured via a specific XML-based language, for example,
while others by a command line interpreter or via a Web interface. Moreover, COTS
allow the configuration of many parameters, sometimes several hundreds, and this
facilitates the system being tailored to user preferences as well as helping it fit with the
current computing infrastructure that will be used to support it. Also, they have diverse
goals and evolution cycles, so, the maintenance of COTS-based systems requires
expertise in numerous technologies and tools. Moreover, it also requires being able to
follow changes of components that are by de facto beyond the users control.
Also, components that make up software applications are often distributed over
networks of different kinds. So, more of the software maintenance activity includes
the configuration and subsequent monitoring of a number of networks. Here again,
networks are often not under the system administrators sole control; they evolve
according to their own strategies and schedules, not specifically following the
exclusive needs of the software systems that use it. The proliferation of networks
also brings the development of new software distribution methods.
With the arrival of cloud computing, many services are now remotely available,
for instance, office automation suites such as Microsoft Office. These new applications constitute the main business of modern corporations like Google. However,
the externalisation of services and data storage has led to stringent requirements
regarding system availability and performance. Service consumers, especially corporations, may request availability rates of 99.999 %, in the knowledge that service
interruptions imply heavy financial penalties and thus will be minimised. In 2008,
1.3
Maintenance Challenges
13
Fig. 1.4 Percentage of errors ordered by cause for three Websites in 2000 [10]
for instance, the cost of downtime for a corporation like Amazon came to tens of
thousands dollars per minute.5
In order to reach such a quality, speaking in terms of performance and availability,
maintenance operations have to be performed very quickly and reliably. However,
currently, on large heterogeneous software, maintenance operations do not meet
these requirements.
Thus, software applications have become heterogeneous, networked and vital for
both the economy and the society overall. They are part of sophisticated ecosystems
and evolve in unstable, even unpredictable, contexts. As explained earlier, a direct
outcome is that maintenance has become increasingly complex and administrators
have to face increasing pressure.
Of course, companies are aware of this issue and a number of counter measures
have been taken. For instance, the heterogeneity and complexity of administration
tools have required the specialisation of administrative staff and the setting up of
specific training programmes. CISCO and Oracle, among others, provide qualification certificates to reward system administrators and show that they have successfully shown that they can control their specific system. Nevertheless, it is clear that
software systems made of networked heterogeneous elements are still difficult to
install, configure and maintain, not to mention, to optimise. Administrators, as
skilled as they may be, are reaching the limit of human capability. Also, the cost of
hiring experts is not affordable beyond a certain limit.
At the same time, the human resources needed for the deployment and maintenance of software has greatly increased over the last few years. Human beings are
increasingly more involved in the day-to-day operations of software systems.
However, until recently, human administration mistakes were not really taken into
account. That is, the administrator was not considered as a potential source of errors
during the deployment, updates and, more generally, maintenance and problem fixing stages of the software life cycle. This assumption is no longer reasonable. At the
beginning of 2000, many surveys have published the causes of errors and repair costs
in the information systems, for example, [9]. Figure 1.4 displays the result of a
6-month survey conducted in 2001 concerning three anonymous medium-sized
http://news.cnet.com/8301-10784_3-9962010-7.html.
14
Websites. It shows very clearly that most of the errors were caused by the operations, that is, the system administrators. Today, it is estimated that the system administrators themselves cause approximately 40 % of errors resulting in breakdowns.
It is foreseen that this situation is going to get worse. In 1991, Mark Weiser has
described a world in which computers would be omnipresent and transparent to
users [11]. This vision has given birth to the pervasive computing field that is getting more concrete around us. Soon, non-expert users will have to carry out some
form of software installation, maintenance, etc. The technical skills of such users
may be relatively low, and even if it is reasonable to think that these skills will
increase with experience, it is not likely that it would reach a sufficient level for
facing the complexity of current and future computing systems. But, more than a
problem of skill sets, the users are not interested in being system administrators;
they simply want the smooth operation of the software.
On the other hand, the environment in which pervasive applications evolve is
highly fluctuating and depends, for instance, on the operation of the network infrastructure, energy availability and other conditions (sound level, temperature, etc.)
that occur at each instant. An extreme example of this dynamism lies at the cyber
physical interfacethe crossing point where the computer system and the environment meet. This issue pertains mostly to modern embedded computing systems
where the extreme dynamism and interdependency between the critical components
and the physical environment are as yet not well understood. Here, the gap between
software engineering and systems engineering needs to be bridged to allow systems
to adapt to change.
1.4
Autonomic Computing
1.4
Autonomic Computing
15
16
Of course, these benefits are very appealing and have fostered lots of research
work around the world. This endeavour has taught us one thing, however, that
implementing autonomic solutions is also very challenging. That is, autonomic systems are more difficult to design, implement and validate than software systems
without the ability to self-manage. This is quite understandable; complexity cannot
just disappear. Like a computation law of thermodynamics, the complexity is simply moved from runtime to design time, as a countermeasure to the increasing complexity involved with runtime administration so that it can cope with dynamic,
fluctuating environments. Therefore, in order to unburden system administrators
and decrease ownership costs, autonomic software systems are certainly more difficult
to conceive and implement.
In fact, implementing autonomic solutions has a profound influence on most of
the software engineering activities previously presented in this chapter. Of course,
these activities still have to be used to produce autonomic software systems, but
they have to be refined to meet more ambitious goals. In particular, four major new
requirements have to be considered so that a computer system can be administered
with minimal human intervention:
A computer system must be able to monitor itself at runtime in order to know its
internal situation. It also has to monitor part of its execution environment in order
to follow relevant evolutions.
A computer system must be able to keep some knowledge about its goals, its past
and the current situation. Then, it has to integrate some type of reasoning capabilities to decide on corrective actions whenever needed.
A computer system must be able to adapt itself at runtime in order to implement
the corrective administrative actions that are required. Such adaptations must not
endanger or corrupt ongoing operations.
A computer system should provide a high-level interface, allowing human
administrators to specify or modify system goals, tune reasoning processes and
observe the system ability to attain its objectives.
In order to achieve these demanding requirements, most software engineering
activities must be revisited. The requirements phase, for instance, must decide on the
(types of) adaptations and monitoring data that are desired so that the system can selfmanage. Some high-level requirementsthe administrative goalsmust become
explicit and formally defined so as to be interpretable and manoeuvrable by the software
system, at runtime, since these drive the systems operation. That is, the self-managed,
autonomic system adapts its behaviour to best maintain these sets of goals.
As a self-managed system must be aware of its own operation, the design phase
must decide not only on the adaptation but also on the monitoring features that
must be incorporated in the software system. This can be complex and tricky.
Since monitoring can be extremely costly, only relevant information has to be collected. If possible, monitoring should also be configurable so as to be adapted to
the current needs of the system that strives to meet its goals. In some situations,
monitoring can even be disengaged for performance reasons. Anticipating and
allowing runtime adaptation is also very challenging. It means building up appropriate architectural styles and design approaches to enhance flexibility and enable
1.5
Book Structure
17
safe runtime change. It also calls for specific mechanisms preserving ongoing
computations during code adaptation. Finally, some part of the implementation
has to be self-described and possibly available online (e.g. from repositories) so
as to enable its automatic instantiation (deployment) or replacement depending on
the runtime needs.
To some extent, the software engineering phases are progressively pushed into
the runtime. The ultimate goal is then to extend or evolve current software engineering
practices so that they can be partially performed during runtime. That is, a software
system should be able to interpret (or even create) formal requirements (goals) at
runtime, to apply existing designs for adaptation, to create or change implementations, (re)deploy and (re-)instantiate them, etc.
Clearly, there is a strong relationship between autonomic computing and
software engineering. Autonomic computing will force software engineering to
come up with new techniques and new approaches to software development and
maintenance. This is truly an exciting challenge but a really difficult one indeed.
This observation is one of the early motivations of this book. Beyond necessary
explanations about the objectives and interests of autonomic computing, it seems
important to us to go through the different software engineering techniques that are
currently available for organising and developing self-managed software systems.
However, a comprehensive study of all modern software engineering techniques
is beyond the scope of this text. Instead, in introducing the field of autonomic computing, we discuss software engineering implicitly; we introduce the elements of
software engineering that are either relevant to a particular capability necessary for
making a system self-managing or that are impacted by the move towards autonomic
computing systems. More precisely, we present the principles and methodologies
applicable to building autonomic computing architectures (Chap. 4), enabling
systems to self-monitor (Chap. 5) and self-adapt (Chap. 6) and then the methods that
systems can use in order to make adaptation decisions (Chap. 7). Finally, we provide
some pointers on how software engineering can intervene for developing evaluation
solutions for self-managed systems (Chap. 8).
Unclear or immature software engineering techniques with respect to their applicability to autonomic computing are not addressed in this book. For instance, the
problem of discovering and formally representing requirements related to autonomic needs is not covered here for much research is still required in this subject.
However, the more futuristic aspects of this field will be touched upon in the conclusion
and the last sections of Chap. 9.
1.5
Book Structure
The structure of this book reflects the observations made in the previous section.
Specifically, this book is made of the following chapters:
Chapter 2: Autonomic Systems
The purpose of this chapter is to define the autonomic computing paradigm and
to introduce the related terminology. It discusses the main notions that are
18
1.6
Key Points
19
different forms of data sources and destinations. This chapter also presents ongoing
work offering further management capabilities and aiming to progress towards
endowing the Cilia technology with fully autonomic life-cycle management
capabilities.
Chapter 10: Future of Autonomic Computing and Conclusions
The purpose of this final chapter is to recap the key points tackled in this book
and to introduce the reader to the open issues in autonomic computing. Precisely,
this chapter aims to look ahead and foresee the future of autonomic computing.
The purpose of this book is hence to clarify the software engineering techniques
used by autonomic computing. It is a practical guide to introduce the concepts of
autonomic computing to advanced students, researchers and system managers alike.
Through the combined use of examples and practical projects, the aim is to enable
the reader to rapidly understand the theories, models, design principles and challenges of this subject while building upon their current knowledge, thus reinforcing
the concepts of autonomic computing and self-management.
We hope that this book allows the advanced computing student and researcher to
be able to consolidate their programming, artificial intelligence, systems architecture and software engineering courses to allow them to better architect robust yet
flexible software systems capable of meeting the computing demands for today and
in the future.
We also hope that this book can help those responsible for the development and
maintenance of real world systems currently in operation to understand the benefits
that the autonomic computing approach can bring. We hope that the concise nature of
this book allows them to rapidly catch up with the work that has been carried out in
this field as well as to get introduced to some fundamental aspects of self-management
that are beyond the scope of traditional computing training (e.g. control theory).
This should therefore provide a greater grounding in the subject, and when combined with the practical nature of the examples and projects, readers should be in a
better position to design and engineer self-management features into current systems
as well as developing strategies for the development of new systems.
1.6
Key Points
20
Much of computing engineering research effort has been dedicated to the development
phase of the software life cycle. Relatively little attention has been given so far to
the maintenance phase. Simply put, system administrators observe the systems
at runtime, change minor things when needed and, otherwise, send a request to
developers if something serious happens. However, this is no longer a suitable
approach when software gets complex and its environment ever changing.
Despite its inherent complexity, software has pervaded our professional and
social life and users want more functions today, accessible anywhere and anytime. These new demanding requirements change the way software systems
are structured and managed.
Great emphasis is now put on the runtime aspects of the software life cycle: software management gets more complex and ambitious. Engineers are beginning to
feel that they are unable to maintain new systems using traditional approaches.
Motivated by this problem, a major initiative came from IBM. This sparked off
the use of the term autonomic computing to characterise the notion of a computer
system that is able to adapt to internal and external change with minimal conscious intervention from the human. In the autonomic computing vision, human
administrators merely specify the computer systems high-level business goals
or policies, and software takes on this task through self-management.
Self-managed systems demand us to rethink most software engineering activities
in order to push them into the runtime. This book is structured according to
this statement. It seeks to provide software engineering ideas that are required
to understand and build autonomic systems.
References
1. Brooks, F.: No silver bullet: Essence and accidents of software engineering. In: Kugler, H.J.
(ed.) Information Processing 86, pp. 10691076. Elsevier, Amsterdam (1986). Reprinted in
Computer, 20, 4 (April 1987), pp. 1019
2. Lehman, M.M.: On understanding laws, evolution, and conservation in the large-program life
cycle. J. Syst. Softw. 1, 213221 (1980)
3. Sommerville, I.: Software Engineering, 9th edn. Addison Wesley, Boston (2010)
4. Ghezzi, C., Jazayeri, M., Mandrioli, D.: Fundamentals of Software Engineering. Prentice Hall,
Englewood Cliffs (1991)
5. Benington, H.D.: Production of large computer programs. In: Proceedings of the 9th
International Conference on Software Engineering (ICSE), Monterey, CA, USA, pp. 299310.
IEEE Computer Society Press, Los Alamitos (1987)
6. Carzaniga, A., Fuggetta, A., Hall, R.S., Van Der Hoek, A., Heimbigner, D., Wolf, A.L.: A
Characterization Framework for Software Deployment Technologies, Technical Report
CU-CS-857-98, Department of Computer Science, University of Colorado. http://serl.cs.colorado.
edu/~carzanig/papers/CU-CS-857-98.pdf, April 1998
7. Lientz, B.P., Swanson, E.B.: Software Maintenance Management: A Study of the Maintenance
of Computer Application Software in 487 Data Processing Organizations. Addison-Wesley,
Reading (1980)
8. Baresi, L., Ghezzi, C.: The disappearing boundary between development-time and run-time,
FSE-18, 711 Nov 2010, Santa Fe, New Mexico, USA (2010)
9. Patterson, D.A.: A simple way to estimate the cost of downtime. In: Proceedings of 16th
Systems Administration Conference, LISA, pp. 185188. http://roc.cs.berkeley.edu/papers/
Cost_Downtime_LISA.pdf (2002)
References
21
10. Patterson, D.A.: Availability and maintainability performance: new focus for a new century.
In: Key Note at Conference on File and Storage Technologies (FAST), vol. 2, Monterey,
CA (2002)
11. Weiser, M.: The computer for the 21st century. Sci. Am. 265(3), 6675 (1991)
12. Horn, P.: Autonomic Computing: IBMs Perspective on the State of Information Technology,
IBM. http://www.research.ibm.com/autonomic/manifesto/autonomic_computing.pdf (2001)
13. Boehm, B.: A view of 20th and 21st century software engineering. In: ICSE 2006: Proceedings
of the 28th International Conference on Software Engineering, pp. 1229. ACM, New York
(2006)
Autonomic Systems
The purpose of this chapter is to define the notion of autonomic systems and to
introduce related terminology. It discusses the main ideas that are essential to any
autonomic computing system, including the concepts of goal, context and self-*
capabilities.
The chapter also presents the initial motivations behind the autonomic computing
initiative. It subsequently discusses the relevance of these motivations in light of
both research and real-world implementations since this initiative was launched in
the early days of the millennium.
We highlight the most important benefits that autonomic computing promises
to bring to the IT domain as well as the equally important challenges that must
be surpassed before computer systems can be endowed with autonomic management capabilities. An incremental approach to autonomic computing is presented
in this context, proposing a five-step roadmap for progressively transforming
current IT systems from their current (non-autonomic) status to full autonomic
management support.
Finally, the chapter aims to position the relatively new autonomic computing
initiative with respect to similar technological fields, supported by industry, governments or academia, as well as with respect to existing computing domains. Further
relevant fields are discussed in the following chapter highlighting the inspiration
that autonomic computing has and can draw from existing domains.
23
24
2.1
Autonomic Computing
2.1.1
Definitions
Autonomic Systems
Administrator
Feedback
Goals
System
updates
Autonomic
system
Operational
Feedback
(state)
Computing environment
2.1
Autonomic Computing
25
Complexity also comes from the frequent runtime changes that affect both
business requirements and system implementation. Unpredictable change during
system execution is practically guaranteed to occur over the entire lifetime of any
complex system that interacts with our dynamic world. Change can be either intentional and carefully planned or unintentional (due, for instance, to external context
modification or internal failure). Swift interventions are required to warrant
correct, efficient and uninterrupted system execution. In the context of complex
systems, ensuring such timely interventions raises massive administrative challenges, incurring significant costs and risks. In these circumstances, the prime goal
of autonomic computing is to enable computing systems to autonomously deal
with (unpredictable) change, so as to fulfil the objectives they were constructed for.
Many administrative tasks are automated and carried out by autonomic processes
rather than by manual intervention.
In general, the term autonomic1 implies occurring involuntarily, unconsciously
or automatically, or resulting spontaneously, from internal causes (e.g. autonomic
reflexes). The term autonomous, originating from the Ancient Greek autonomos
(from autoself and nomoslaw), signifies ones capability of self-governance
or of defining ones own law, also implying self-containment and self-direction. In
the context of biology, autonomic implies being a part of, related to, or controlled
by the autonomic nervous system (ANS). Accordingly, autonomicity signifies the
state of being autonomic. In the context of philosophy, finally, the terms autonomy
[1]2 or autonomous have been used to signify ones ability to take ones own decisions, imposing ones free will and being independent of external control. In Kantian
philosophy, autonomy is also considered in relation to moral responsibility.
Automating a systems management function implies adding further system
complexity overall. Hence, paradoxically, dealing with existing system complexity
compels us to exacerbate this complexity. To escape from this apparent paradox, it
is important to note that the purpose of the autonomic computing paradigm is to
decrease system complexity as perceived by external administrators at the cost of
additional development being required to establish such a system. From this perspective, an autonomic computing system will absorb the complexity of commonly
manual administrative tasks and leave simplified, intuitive and high-level interfaces
usable by human system administrators. This approach will indeed increase internal
system complexity overall, but will do so at the added advantage of minimising
perceived system complexity for administrators and users.
Definitions based on combined, adapted input from Merriam-Websters online dictionary, 11th
editionhttp://www.merriam-webster.com, American Heritage Dictionary of the English
Language and Oxford Dictionarieshttp://oxforddictionaries.com
2
While a discussion on such matters would be well outside the scope of the current publication, it
could raise useful considerations regarding the purposes and limitations of the autonomous systems
we are going to build.
26
Autonomic Systems
World
Business environment
Business process
ICT
Autonomic
element
2.1.2
Goals
3
4
2.1
Autonomic Computing
27
Increase efficiency
OR
Increase throughput
OR
Maximize threads
Duplicate filters
They may originate directly or indirectly from business process models that have
obvious links to the business environment in terms of service expeditions and to the
world at large. For example, where a Service-Level Agreement (SLA) exists
between software providers and clients, it is usually refined into a set of ServiceLevel Objectives (SLO) that can be more easily measured and checked during runtime. Hence, there is a well-defined set of states that are required, a time period over
which those states must hold, and clear metrics that are specified.
Goal-based expressions can be highly complicated. In fact, they express the
administrators business expertise at a high level of abstraction. This expertise
guides administrators in their first steps when solving a problem. Figure 2.3
provides a partial example of such goal decomposition for a classic pipe and filter
software system, where data are successively transformed by filters interconnected
by pipes. We can see, for instance, that the goal Increase throughput can be refined
into either the subgoal Maximise threads or Duplicate filters. These represent the
choices available that will affect throughput increases, by either increasing the number of threads dealing with incoming data in selected filters or by duplicating some
selected filters and rearranging the data flows or by implementing both actions. In
all cases, parallelism is utilised to improve the performance of the pipe and filter
system. Otherwise, administrators have to decrease the data period, as indicated in
the figure, in order to avoid unsafe and inefficient pipe overflows due to memory or
buffer limitations for instance.
In simple autonomic systems, an administrator can globally specify abstract
goals that can be at different depths in the goal decomposition hierarchy, as depicted
in Fig. 2.3. However, the situation becomes significantly more complicated when
dealing with autonomic systems comprised of several autonomic elements (this
decomposition is discussed in more detail in Chap. 4). That is, the management
responsibility and function is distributed. Generally, in such cases, goals have to be
refined into different subgoals depending on the targeted autonomic elements. If we
return to the pipe and filter example, we can imagine that each filter is an autonomic
element. Then, each one may receive a different subgoal: one may have to create
new threads and carry out related administration actions, while others may have to
28
Autonomic Systems
duplicate themselves and, again, perform the necessary adjustments that this action
entails.
If sufficiently skilled, the administrator may decide on the subgoals for each
autonomic element. Otherwise, it is up to the autonomic system itself to decide on
the subgoal distribution starting from the overall goal specified. Several solutions
may be envisaged. Centralised solutions come down to the creation of a super autonomic element that is able to decide on the goals of the other autonomic elements.
More decentralised solutions demand some sort of collaboration among autonomic
elements that consider themselves as peers. Centralisation, of course, is easier to
implement and to control, but it does not scale. Decentralisation, on the other hand,
scales up more easily but is difficult to design, follow and control, even in some of
the simplest cases.
Finally, administrators can express further directives, including constraints and
strategy specifications. Constraints are invariants and have to be maintained during
all system evolutions. Strategies are specific directives set by the operator, often via
a domain-specific language. They influence the way in which a goal has to be
achieved by an autonomic element. More generally, administrators can even change
the directives and knowledge employed by an autonomic element, as necessary over
the systems lifetime.
2.1.3
Context
The notion of context has generated much debate, and there are still many schools
of thought examining this notion and its relationship with computing. One school of
thought approaches the subject with an open definition of context awareness
simply the understanding of what the current situation is. Other schools of thought
believe that context should be more formally defined as the situation outside of an
autonomic systems management function. Yet again, others think that the internal
states of the autonomic system are also context.
In this book, we assume a notion of context that refers to anything that is relevant
to an autonomic system while remaining external to its range of action. In other
words, an autonomic systems context represents the part of its execution environment that is significant to its management process. When an autonomic system
wishes to adapt so as to fulfil a goal, it will typically require an understanding of
both the external world (the context) and its internal self (its state); see Fig. 2.1.
Context can be defined as any information that can be used to characterise the
situation of an entity that is relevant to an autonomic system [24]. Hence, such
entity can be a person, an object or a program. An autonomic system uses its understanding of an entitys state to make decisions and achieve its self-management
goals. Obviously, the notion of context is very much application specific. Depending
on the autonomic system at hand, different entities and different characterisations
of these entities will be used. For example, an autonomic database management
system will not necessarily use the same contextual information as an autonomic
cellular phone.
2.1
Autonomic Computing
29
Within the subject of context, we now make the distinction between computing
context and usage context. The computing context contains the computing resources
that an autonomic system may use to its advantage and those that in some way can
impact the autonomic systems goal achievement. In the former case, the autonomic
system can request execution of the resource (e.g. via service calls). Such context
resources can include any computing entities that can be used to get information,
perform calculations or contribute to the achievement of the autonomic systems
goals. It can be, for example, a server, a sensor, an app store, a component repository, a network, a legacy application holding important business knowledge, a cloud
infrastructure offering on-demand computing power and so on. In the latter case, the
autonomic system may not have the ability to call the resource directly or to exercise
any explicit influence over it (e.g. through negotiation). Nevertheless, the activity of
an autonomic management process affects the administered system and this in turn
can have an impact on other external systems without accessing them directly.
In cases where an autonomic system consists of several autonomic elements, the
state of a certain autonomic element can be considered as part of the internal state
with respect to the global system and at the same time as part of the context with
respect to another autonomic element in the same system.
The usage context refers to the persons or external systems that interact with
the autonomic system in question and also to the way in which they interact with
the system and to the places where the interactions take place. Human-related
information is difficult to capture and to express. It is actually strongly dependent
on the context sensors that measure the environment. Such measurements can be
obtained via virtual sensing, derived from past usage of an artefact to illicit preferences, for example, or via actual sensingusing a device or service such as the
users current GPS location, for example. As such, contextual information can
only be made available through the programming interfaces provided by these
sensors. With the proliferation of smart devices, more and more context information
sources become available. For instance, sensors can be embedded into physical
spaces and provide information about a rooms brightness, temperature and the
movement of people therein.
Contexts can be characterised by a wide range of properties. Some contexts
may be fully observable. This means that an autonomic system can require any
piece of information at anytime. On the contrary, some contexts are only partially
observable. This means that some information cannot be obtained or that it cannot
be obtained on demand. For example, pervasive contexts are generally partially
observable. This is due, in particular, to volatile components or nonresponsive
devices that can be faulty or simply out of battery power. An autonomic system
has to be aware of such situations and behave accordingly. For instance, an autonomic system may have to build and maintain a representation of the world in a
best effort manner.
From a different classification perspective, a context can be deterministic or stochastic. This is especially interesting when the autonomic system attempts to influence its context, indirectly, as it has no direct control access to it. For example, in a
smart home application, the autonomic systems objective may be to control the
30
Autonomic Systems
ambient temperature, even if it only has access to devices such as thermostats and
windows. In a deterministic context, an autonomic system knows perfectly the
effects of its actions given the context. The future state of the context is determined
by its current state and by the actions performed upon it. In a stochastic environment, the future state depends, of course, on the actions performed on it but also on
some unknown factors that cannot easily be predicted. Pervasive contexts can be
seen as stochastic contexts, since their computing states change regularly due to
unpredictable human-related actions, such as introducing a new smart phone into a
room, or due to physical evolutions, such as sensor failure. However, even in what
would be considered a more deterministic system, there may be less predictive elements; for example, an operating system has to cope with random key presses when
the user types on the keyboard or when data arrives from the Internet.
Finally, a context is also characterised by its dynamicity. Namely, a context is
said to be static when it does not change while the autonomic system is analysing
the situation. Conversely, it is said to be dynamic if it can change during the
autonomic systems reasoning (thinking) time. Pervasive contexts are, by their
embedded nature, dynamic, and an autonomic system therein is governed by this
dynamism.
The aforementioned context properties also hold for an autonomic systems
internal state. Most of the time, an autonomic system has to perform in rather unstable conditions since its external context cannot be directly controlled and its internal
structures and behaviours are also shifting.
2.2
One could say that autonomic computing is a marriage of many subjects; therefore, it is no
surprise that many of the early proponents of the field from IBM originated in physics (e.g. Paul
Horn), computer systems (David Chess) and agent-based computing (Jeffrey Kephart).
2.2
31
sustained, by the 1980s, the demand for human switchboard operators would surpass
the available supply [6]:
Experts predicted that by 1980, every single woman in North America would have to work
as a telephone operator if growth in telephone usage continued at the same rate. (At that
time, all telephone operators were women).
In line with the general thinking of the time, IBM indicated that it was the IT
domains turn to consider the automation of its management processes, as a
necessary step towards ensuring and sustaining its continuous, swift advancement.
In 2001, IBM pointed out that the IT domain was being increasingly challenged
by the complexity that ensued from its rapid and extensive development. As the
advantages of computing systems rendered them increasingly popular, the rate of
their development, integration and insertion into key societal domains consequently accelerated. At the same time, the management of such increasingly
complex computing systems remained a largely manual endeavour, leading to a
soaring demand for skilled and expensive system administrators. Consequently, in
the initial manifesto [5], IBM indicated that the growing complexity of the IT
infrastructure threatens to undermine the very benefits information technology
aims to provide.
While decreasing numbers of farmers could also be caused by factors other than technology, such
as massive food imports, the US Department of Agriculture (USDA) provides data indicating clear
increases in farming productivity throughout the US history. For example, data available in this
USDA articleNational Institute of Food and Agriculture: http://www.csrees.usda.gov/qlinks/
extension.htmlpoints out that producing 100 bushels of corn necessitated around 14 labour
hours and 2 acres of land in 1945, under 3 labour hours and little over 1 acre in 1987 and less than
1 acre of land in 2002. This and a discussion on bioengineered food are well beyond the scope of
this book.
7
Alfred North Whitehead (18611947)English mathematician and philosopher.
32
Autonomic Systems
In short, the IT domain was being prompted to face the complexity brought about
by its own success! IBMs prediction in 2001 was that within the decade to follow,
the IT domains demand for workers would reach as high as 200 million, which is
comparable to the entire labour force of the United States. At the time this prediction
was made, hundreds of thousands of IT jobs in the United States remained unfulfilled. The trend at the time was indicating that the existing demand was to further
increase by 100 % over the following years, raising significant concerns about the
capability of a human task force to keep the societys computer systems running.
Certainly, one decade later, the situation seems less critical than IBM predicted
in 2001 [8]. Various factors have contributed to this development, including the
economic downturn of 20072009 [9], outsourcing and job delocalisation and
reluctance of CEOs to increase enterprise spending by adopting new technology
and hiring IT staff (e.g. [10]). In a possibly vicious circle, the high risks and total
cost of ownership (TCO) associated with computing systems may discourage companies from renewing or extending their technological base. Limits on the numbers of
available systems administration experts may already play a part in preventing the
development of new, more ambitious IT applications. While a deep analysis of the
exact causes behind IT development trends is outside the scope of this book, we
discuss possible reasons in the following sections, drawing from official data on US
employment statistics over the last decade.
While exact employment data in the particular domain of IT system administration is difficult to pinpoint, current statistics and predictions related to the IT domain
in general do not seem to indicate an extraordinary growth in job openings in this
sector. For example, according to the Bureau of Labour Statistics (US Department
of Labour)8 [11], the number of employees in computer occupations taken together
in 2006 reached around 3.1 million employees, then increased to around 3.4 million
employees in 2010 (6.9 % growth). A further 22.1 % increase was estimated over
the next decade, predicting to reach a total of about 4.2 million employees by 2020
[11]. This places the computer occupational group as the 6th fastest-growing occupational group (out of 22 groups9).
A further refinement of this data [12] highlights the progression of occupations,
such as IT system administration. This refinement estimates that the number of
employees in Database, Network and Computer System Administrator jobs was
around 458,000 in 2010 and predicts an increase to 588,500 employees by 2020
(i.e. 28.5 % growth). Similarly, the number of computer support specialists is estimated at 607,100 in 2010 and predicted to reach 717,100 by 2020 (i.e. 18.1 %
growth). While these numbers point out a need for a substantial system administration task force, they remain modest in comparison to the autonomic computing
manifestos initial prediction of 200 million required employees.
Based on this data, one may assume that the expansion of the IT domain may
have already been limited by the lack of system administration support. Yet,
8
9
2.2
33
existing data and predictions from the US Bureau of Labour Statistics indicate quite
the contrary. From the perspective of business growth and revenue output, the information industrial sector is predicted to be the fastest growing compared with other
major sectors. At a 4.7 % per year growth in real output, the information sector is
predicted to reach 1.9$ trillion real output by 2020 [13]. This is higher than the
sectors previous 2.3 % growth rate from 2000 to 2010 when real output rose from
$950.9 billion to nearly $1.2 trillion. More refined data indicates that the expected
growth in the information sector is to be mostly driven by software publishers, data
processing, hosting, related services and computer systems design and related
services industries.
Industrial growth correlated to employment statistics seem to indicate that even
though the IT industry is experiencing considerable and increasing growth and development, employment in the area is to progress at a somewhat slower pace than initially
thought. According to the report in [13], While real output in the information sector is
growing faster than the overall economy, employment in the sector is growing more
slowly than the overall economy. This may be due to an increased productivity, which
tends to accelerate output while slowing down employment.
Hence, with respect to system administration, it may be that increasingly automated management tools are already being introduced, subsequently limiting the
demand for human employees to intervene. At the same time, the situation may also
be due to IT outsourcing overseas and/or to increasing workloads on the existing
task force [10].
Nonetheless, the increasing complexity of computing systems is starting to surpass the capacity of the human administrators to manage them. When introducing
the autonomic computing vision, IBM was mainly concerned with enterprise systems.
As the number of interconnected, heterogeneous components and layers involved in
such systems increases, a point will be reached where human administrators will no
longer be able to react rapidly enough to ensure continuous system availability,
safety and security. At that point, or ideally before, automation should be introduced
to help or replace such manual interventions.
As emphasised in the first chapter, computing system administration challenges
are by no means confined to the enterprise domain. The recent proliferation of ever
smaller and smarter electronic devices like smart phones, tablets, mini PCs and a
variety of sensor and actuator devices, combined with the introduction of wireless
communication and mobile software technologies, has brought about the construction of a large variety of pervasive and ubiquitous applications. These have targeted
applications concerning smart buildings, home supervision and healthcare assistance, smart electrical grids or ad hoc social networks, to name but a few. These new
domains introduce additional complexity factors, including low device resources;
energy becoming an extra constraint to consider; significantly higher numbers of
constituent hardware and software elements; increased dynamicity as mobile
elements join, move about and leave the system; and so on.
Hence, the inherent complexity of such systems, combined with the lack of
technical computing expertise of many of their users, reinforces the need for
autonomic management solutions.
34
2.3
Autonomic Systems
2.3.1
In its autonomic computing manifesto [5], IBM identifies eight key characteristics
to define an autonomic system:
1. To hold self-knowledge and consist of elements which possess system identity.
2. (Re-)configure in reaction to, potentially unpredictable, environmental changes.
3. Continuously strive to optimise functioning so as to reach predefined criteria.
4. Detect and recover from component failure so as to maintain global dependency.
5. Anticipate, detect and eschew various threats so as to maintain integrity and
security.
6. Acquire knowledge of the environment and behave in a context-sensitive
manner.
7. Implement open standards so as to be able to survive in a heterogeneous
ecosystem.
8. Hide complexity by bridging the gap between business goals and underlying IT
resources.
These general properties were subsequently summarised via four fundamental
objectives or features (e.g. [5, 1420] or [21]):
1. Self-configuration: the system sets and resets its internal parameters so as to
conform to initial deployment conditions and to adapt to dynamic environmental
changes, respectively.
2. Self-healing: the system detects, isolates and repairs failed components so as to
maximise its availability.
3. Self-optimisation: the system proactively strives to optimise its operation so as to
improve efficiency with respect to predefined goals.
4. Self-protection: the system anticipates, identifies and prevents various types of
threats in order to preserve its integrity and security.
To achieve these objectives, a system must feature several essential attributes and
capabilities. Hence, objectives can be described as the broad system requirements
(what objectives to achieve), while attributes and capabilities as the key features for
meeting those requirements (how to achieve the objectives). Since the autonomic computing initiative was initially launched, numerous such attributes and capabilities have
been progressively identified by researchers in the area and categorised according to
various criteria or domain-specific preoccupations. This extended list of self-managing
(sometimes referred to as self-*) considerations forms an increasingly comprehensive
set of crucial and, in some cases, redundant autonomic system properties.
The four fundamental features of autonomic systems are further discussed in Sect.
2.3.2, while the more extensive self-* capabilities list is presented in Sect. 2.3.3.
2.3
2.3.2
35
The four self-* features considered as fundamental for any autonomic system, and
therefore most cited in the autonomic computing domain, are self-configuration,
self-healing, self-optimisation and self-protectionalso referred to in short as
self-chop. This section discusses these four fundamental features.
Self-configuration: an autonomic system configures and reconfigures itself in order
to adapt to various, possibly unpredictable conditions, so as to continuously meet a
set of business objectives. This allows system administrators to merely specify
high-level policies (what is desired) without having to worry about low-level
technical details (how to achieve it). As a relevant example, an autonomic system
would deploy and set itself up, based on predefined user objectives and current
platform resources. At runtime, the system would support the dynamic addition/
removal of servers to and from its infrastructure without requiring human intervention
and without disrupting its service. Self-configuration must not be concerned with
autonomic elements in isolation but the integrated system as a whole. Similarly to
the way a new cell is integrated into a body, a new autonomic element must be able
to integrate itself into a systems infrastructure, and the existing system must be able
to adapt to the new element. From this perspective, self-configuration becomes an
important enabler for the other self-* objectives, such as self-optimisation, self-healing
and self-protection.
Self-healing: an autonomic system detects, diagnoses and recovers from routine or
extraordinary problems while trying to minimise service disruption. Consequently,
fault-tolerance is an important aspect of self-healing behaviour. Moreover, a system
may predict potential problems and take pre-emptive action to prevent their occurrence. The purpose of self-healing is to attain overall system resiliency and robustness by being able to deal with the failure of any of the systems constituent parts.
Self-healing implies that the system must first be able to detect symptoms pointing
out an existing or potential future problemfor example, a bottleneck or an unresponsive system element. Second, it must be able determine a viable solution for
avoiding or recovering from the problem. Discovering the root cause(s) behind
detected or predicted problems (e.g. miss-configurations, bugs or failure in software
or hardware elements) may help selecting an appropriate repair solution while
involving more complicated analysis and planning procedures. Recovery methods
may include finding alternative resource usage, downloading software updates,
replacing failed hardware components, restarting failed elements or simply throwing an exception to notify a human administrator. Similarly to the way a damaged
brain may use unharmed areas to re-implement lost functions, an autonomic computing system may dynamically integrate redundant or underutilised components to
replace failed parts to maximise its availability. At the same time, it is important to
ensure that the self-healing process does not inflict further system damage (e.g. by
introducing new bugs).
Self-optimisation: rather than settling for the status quo, an autonomic system
always seeks ways and seizes opportunities to improve its operation with respect
36
Autonomic Systems
10
TCGthe Trusted Computing GroupTM (http://www.trustedcomputing.org)a non-profit organisation formed to develop and promote open, vendor-neutral standards and frameworks for
supporting trusted computing technology. The goal of trusted computing technology is to render
computer systems safer and less prone to viruses, malware and unauthorised access.
2.3
37
2.3.3
Since the launching of the autonomic computing initiative in 2001, the list of self-*
properties for autonomic systems has been substantially extended. It now consists
of a set of interrelated properties that a system should possess in order to achieve
various degrees of autonomicity (Sect. 2.3.3). Most of the extended self-* properties
are necessary for achieving the fundamental self-chop features (Sect. 2.2) and can
therefore be subsumed into those four key objectives.
Some of the most important self-* properties identified so far are briefly highlighted as follows:
Self-*: a systems self-management properties in general.
Self-anticipating: a systems ability to predict future events or requirements,
whether with respect to the systems internal behaviour or to its external context.
An anticipating system should be able to manage itself proactively.
Self-adapting: a systems ability to modify itself (self-adjust) in reaction to
changes in its execution context or external environment, in order to continue to
meet its business objectives despite such changes.
Self-adjusting: a systems ability to modify itself during runtime, including modifications to its internal structure, configuration or behaviour.
Self-aware: a systems ability to know itself, to possess knowledge of its internal elements, their current status, history, capacity and connections to external
elements or systems. A system may also possess knowledge of the possible
actions it may perform (self-adjustment) and of their probable consequences
(self-anticipating). Such knowledge is essential for achieving the self-chop
objectives.
Self-chop: the four fundamental self-* propertiesself-configuration, self-healing,
self-optimisation and self-protection.
Self-configuring: a systems ability to (re-)configure itself(re-)setting its internal
parameter values, so as to achieve high-level policies or business goals.
Self-critical (self-evaluation): a systems ability to determine whether or not its
high-level goals are being attained.
Self-defining (communication perspective): a systems ability to describe itself to
other systems. A systems description should represent a subset of the systems
self-knowledge, as relevant to targeted systems. A description should contain
both data and metadata (data describing that data). Conversely, an autonomic
system may need to understand and interpret other systems descriptions.
Self-defining (high-level policies or goals perspective): a systems ability to
determine and modify its own objectives.
38
Autonomic Systems
2.4
39
2.4
2.4.1
40
2.4.2
Autonomic Systems
2.4
41
attained in the presence of continual changes in the system constituent parts and
interconnections.
From an architectural perspective, the autonomic computing challenge can be
split according to the targeted managed elements and their level of granularity [25].
Self-management challenges must be addressed at the level of both individual autonomic elements and of entire autonomic systems. At a fine granularity level, innovations are required for introducing autonomic management capabilities at the level of
specific managed elements. At a higher granularity level, achieving autonomicity
requires coordinated interaction among multiple autonomic elements.
Humancomputer interaction (HCI) approaches must also evolve in response to
the conceptual shift in the way in which administrators and users should interact
with autonomic systems. Multiple, possibly conflicting business objectives and user
actions, at both local and global levels, will have to be taken into consideration in
tandem and translated into coherent technical parameters that can be managed via
the self-* processes.
2.4.3
When introducing the autonomic computing paradigm, IBM promoted an incremental approach for making the transition between existing computing systems and
future autonomic systems. Right from the start, IBM realised that it would be unrealistic to attempt to revolutionise the IT domain by seeking to suddenly replace all
existing systems with a new generation of autonomic systems. First, the IT community is not yet fully au fait with the ideas of autonomicity. Secondly, customers
that have invested significantly in existing IT environments would be reluctant to
completely replace them overnight, especially in the absence of solid reassurances
regarding the value of new self-management systems.
In contrast, an evolutionary approach addresses both challenges, as self-management
can be progressively phased in and integrated bit by bit into the continuously evolving IT system. In this context, IBM proposes an Autonomic Computing Adoption
Model [15, 16, 19, 26] describing five incremental levels of system management
(Fig. 2.4):
Level 1Basic (Manual): Represents the starting point, where skilled administrators manually control each system computing elementsetting it up, monitoring
and potentially modifying or replacing it. This is the level that IBM considered IT
systems to be at when launching the autonomic computing initiative initially. At this
level, system management is completely non-autonomic or manual.
Level 2Managed (Instrumented or Monitored): Employs monitoring technologies to collect data from disparate managed resources and presents this intelligently for both offline and online system management. This approach improves
productivity and reduces human administrator effort required to manually collect
and synthesise data.
42
Autonomic Systems
Fig. 2.4 (Reproduction of) IBMs Autonomic Computing Adoption Model x-axis: increasing
autonomic functions; y-axis: increasing scopes over which the autonomic functions can be applied;
and z-axis: service flows to which autonomic functions can be applied
2.5
43
2.5
2.5.1
2.5.2
As autonomic computing was proposed, related fields with similar needs began emerging in parallel. Research initiatives aiming to render computing systems self-managed
can be categorised with respect to two principal approaches.
First, top-down approaches aim to enhance the self-management capabilities
of existing systems by essentially introducing various forms of control loops that
can deal with targeted system resources. These approaches require administered
systems to be based on technologies that support runtime monitoring and modification (Chaps. 5 and 6). These include, for example, technologies that employ dynamic
component models. At the same time, beyond such technological requirements and
their non-negligible impact on system design and implementation, top-down
approaches impose no radical changes in the manner in which software systems
have been traditionally architected and developed.
44
Autonomic Systems
2.5.3
Similar Initiatives
2.5.3.1 Industry
In parallel to IBMs autonomic computing initiative, several major industrial actors
in the IT domain were promoting similar initiatives aimed at enhancing management support for complex computing systems [27]. The aim of these initiatives was
to provide enhanced support for managing large-scale, dynamic, networked infrastructures, most notably including clouds, grids and enterprise systems.
Virtualisation for data centres was a core principle driving most of these initiatives.
Examples include Compaqs Adaptive Infrastructure11 vision (2001) and HPs subsequent Utility Data Center (UDC) (20012004) and Converged Infrastructure12
(2009) initiatives, Suns N1 technology13 (2002), Microsofts Dynamic System
11
2.5
45
Initiative (DSI)14 (2003), Ciscos Data Center 3.0 strategy15 (2007) and VMwares
Virtual Data Center Operating System (VDC-OS) paradigm (2008).
From a different domain perspective, in the ubiquitous system context, the
European Commission Information Society Technologies Advisory Group (ISTAG)
specified the ambient intelligence (AmI)16 vision (1999), emphasising on more
seamless, efficient and natural system support for human interaction. Finally, as a more
generic, domain-agnostic vision, Intels Proactive Computing [28] (2000) promoted
the necessity for rendering computing systems more reactive, shifting the focus
from traditional, human-centred computing to more autonomous, human-supervised
computing.
2.5.3.2 Military
In addition to such industrial initiatives, several self-management research projects have been launched by DARPA17 for military applications. A first set of
DARPA projects was launched starting in the late 1990s enabling a new generation of self-forming, self-repairing, self-defending and heterogeneous networks18
to provide critical advantages in unpredictable, unstable and dangerous environments. These included the Small Unit OperationsSituational Awareness System
(SUO-SAS) program, Future Combat Systems Communications (FCS-C) program, the Optical RF Combined Link Experiment (ORCLE) program or the
Wireless Networks after Next (WNaN) program.
A further series of DARPA programmes was subsequently launched for addressing additional autonomy issues in battery-powered wireless systems, such as unattended ground sensor (UGS) networks. These included the Connectionless Networks
and the Wolfpack programs. The main goal of these programs was to develop techniques and technologies for enabling randomly deployed or mobile sensor devices
to form highly efficient, low-power radio networks. Solutions involved support for
forming ad hoc radio networks based on dynamically discovered neighbouring sensors, adapting sensor functioning modes to detected contexts so as to maximise
battery lifetime, reconfiguring network communication in response to predicted
transmission demands and collaborating within neighbouring sensor groups to
equilibrate loads, implement coordinated strategies or track moving targets. Both
initiatives involved individual sensor adaptations and collective collaborations,
requiring adaptations at both processing and networking levels to provide fully
autonomous sensor systems.
14
Microsoft Announces Dynamic Systems Initiative, March 2003http://www.microsoft.com/
en-us/news/press/2003/mar03/03-18dynamicsystemspr.aspx
15
http://www.networkworld.com/news/2007/072407-cisco-new-data-center.html
16
Introduction to Ambient Intelligence from ERCIM News 2001: http://www.ercim.eu/publication/
Ercim_News/enw47/intro.html
17
DARPA: Defence Advanced Research Projects Agencyhttp://www.darpa.mil
18
Henry S. Kenyon, Networks: Adapting to Uncertainty, DARPA, http://www.darpa.mil/WorkArea/
DownloadAsset.aspx?id=2570
46
Autonomic Systems
19
DARPA s Assured Arctic Awareness (AAA) program: http://www.darpa.mil/NewsEvents/Releases/
2012/03/16a.aspx
20
DARPAs announcement (April 2012) of future Robotic Challenge program (to be launched in
October 2012): http://www.darpa.mil/NewsEvents/Releases/2012/04/10.aspx
21
DARPAs Urban Challenge, held in November 2007, at the former George Air Force Base in
Victorville, California, USAhttp://archive.darpa.mil/grandchallenge
2.5
47
22
48
Autonomic Systems
24
2.5
49
with high social interaction capabilities that enable them to self-organise into various
structures for achieving predefined goals.
As a concrete example of the ANTS application, Prospecting Asteroid Mission
(PAM)28 aims to analyse an asteroid belt in search for materials of astro-biological
relevance. PAM plans to drive a carrier spaceship into deep space and have it
self-assemble and launch 1,000 small exploration spacecraft (picocraft) that are
to travel through and analyse the asteroid belt. Spacecraft belongs to one of ten
specialist classes, which include processing specialists (leaders), communication
specialists (messengers) and several instrument specialists for diverse measurement
types (workers). Once launched, spacecraft opportunistically self-organise into
several sub-swarms, containing specialists from all classes, and simultaneously
analyse different asteroids over the several years belt traversal. Each sub-swarm can
repeatedly search for, detect and navigate towards interesting asteroid targets;
measure and create 3D models of analysed asteroids; and send adequate asteroid
models to an Earth centre.
2.5.3.4 Academia
In addition to industrial, military and space exploration initiatives for developing
self-managing computing systems, several similar initiatives have been launched
from within the academic community.
These research initiatives aim to render computing systems capable of adapting
to their dynamically changing environments, of self-configuring, self-healing,
self-optimising, self-protecting and possibly self-developing via self-organisation
and self-assembly processes, in order to reach predefined business objectives. While
sharing a common goal, each initiative promotes a slightly different paradigm,
focusing on different core principles, for addressing the system autonomicity
challenge. Providing a comprehensive description of all the existing initiatives and
their intricate interrelations could constitute the subject of an entirely different
book. Here, we merely aim to exemplify some of the most relevant programmes,
highlight their core principles and challenges and show how their advancement can
contribute to progress in the autonomic computing domain.
Organic computing29 (OC) is probably the most similar initiative to autonomic
computing (AC). OC is based on a vision of future information-processing systems
consisting of myriad autonomous devices, equipped with sensors and actuators,
aware of their execution environments and organising themselves in order to provide various business services. In this context, the controllability of the emergent
28
NASAs ANTS Prospecting Asteroid Mission (PAM), expected timeframe: 20202025, http://
ants.gsfc.nasa.gov/pam.html
29
Organic computing (OC) initiative: http://www.organic-computing.de. The OC initiative has
been launched by a group of researchers from three German universities (Universitt Hannover,
Universitt Karlsruhe and Universitt Augsburg). In 2012 the initiative comprised more than 70
researchers from many institutions across Germany and other European and non-European countries.
OC has been initially funded by the German Research Foundation (DFG) as part of the priority
programme 1183 organic computing (20042011).
50
Autonomic Systems
30
2.5
51
31
52
Autonomic Systems
2.6
Key Points
33
2.6
Key Points
53
54
Autonomic Systems
References
1. Christman, J.: Autonomy in moral and political philosophy. In: Zalta, E.N. (ed.) The Stanford
Encyclopedia of Philosophy (Spring 2011 Edition). http://plato.stanford.edu/archives/spr2011/
entries/autonomy-moral
2. Dey, A.K., Abowd, G.D.: Towards a better understanding of context and context-awareness.
In: CHI 2000 Workshop on the What, Who, Where, When, and How of Context-Awareness,
The Hague (2000)
3. Truszkowski, W., et al.: Autonomous and autonomic systems: a paradigm for future space
exploration missions. IEEE Trans. Syst. Man Cybern. Part C 36(3), 279291 (2006)
4. McCann, J.A., Huebscher, M., Hoskins, A.: Context as autonomic intelligence in a ubiquitous
computing environment. Int. J. Internet Protoc. Technol. (IJIPT), special edition on Autonomic
Computing, Inderscience (2006)
5. Horn, P.: Autonomic Computing: IBMs Perspective on the State of Information Technology.
New York: IBM T.J. Watson Labs. http://www.research.ibm.com/autonomic/manifesto/autonomic_computing.pdf, October 2001
6. FitzGerald, J., Dennis, A.: Chapter 1: Introduction to data communications a brief history
of communications in North America. In: Business Data Communications and Networking,
10th edn, pp. 57. Wiley (2009). ISBN 978-047005575-5
7. U.S. Dept. of Agriculture, Economic Research Service.: A History of American agriculture
17761990, Washington, DC, 1993. Summaries are also available online as teaching material,
such as from the Library of Congress: http://www.loc.gov/teachers/classroommaterials/connections/hist-am-west/history.html
8. Dobson, S., Sterritt, R., Nixon, P., Hinchey, M.: Fulfilling the vision of autonomic computing.
Cover feature. IEEE Comput. Soc. 43(1), 3541 (2010)
9. Sum, A., Khatiwada, I. The Nations underemployed in the Great Recession of 200709.
Monthly Labor Review, Nov 2010, http://www.bls.gov/opub/mlr/2010/11/art1full.pdf
10. Schwartz, E.: Bureau of Labor Statistics reports big drop in tech jobs almost 50,000 IT positions lost in last 12 months. InfoWorld, 6 Aug 2008, http://www.infoworld.com/d/adventuresin-it/bureau-labor-statistics-reports-big-drop-in-tech-jobs-863
11. Lockard, C.B., Wolf, M.: Employment outlook: 20102020. Occupational employment projections to 2020. Bureau of Labor Statistics, Occupational Employment, Monthly Labor
Review, Jan 2012, http://www.bls.gov/opub/mlr/2012/01/art5full.pdf
12. U.S. Bureau of Labour Statistics.: Employment by Occupation. Employment Projections, 1 Feb
2012, http://www.bls.gov/emp/ep_table_102.htm
13. Henderson, R.: Employment outlook: 20102020. Industry employment and output projections to 2020. Bureau of Labor Statistics, Monthly Labor Review, Industry Employment, Jan
2012, http://www.bls.gov/opub/mlr/2012/01/art4full.pdf
14. Kephart, J.O., Chess, D.M.: The vision of autonomic computing. IEEE Comput. 36, 4150
(2003)
15. Ganek, A.G., Corbi, T.A.: The dawning of the autonomic computing era. IBM Syst. J. 42(1),
518 (2003)
16. IBM.: An Architectural Blueprint for Autonomic Computing, 3 edn. IBM Whitepaper, June
2005
17. Parashar, M., Hariri, S.: Autonomic computing: an overview. In: Proceedings of the 2004
International Conference on Unconventional Programming Paradigms, pp. 257269. Springer,
Berlin (2005)
References
55
18. White, S.R., Hanson, J.E., Whalley, I., Chess, D.M., Kephart, J.O.: An architectural approach
to autonomic computing. In: Proceedings of the First International Conference on Autonomic
Computing, 1719 May 2004. IEEE Computer Society, New York (2004)
19. Huebscher, M.C., McCann, J.A.: A survey of autonomic computingdegrees, models, and
applications. ACM Comput. Surveys. (CSUR) 40(3) (2008). ISSN: 03600300
20. Wolf, T.D., Holovoet, T.: A taxonomy for self-properties in decentralised autonomic computing. In: Parashar, M., Hariri, S. (eds.) Autonomic Computing: Concepts, Infrastructure, and
Applications, pp. 101120. CRC Press/Taylor & Francis Group (2007)
21. Hinchey, M.G., Sterritt, R.: Self-managing software. Computer 39(2), 107109 (2006)
22. Dijkstra, E.W.: Self-stabilizing systems in spite of distributed control. Commun. ACM 17(11),
643644 (1974). doi:10.1145/361179.361202. http://doi.acm.org/10.1145/361179.361202
23. Smirnov, M.: Autonomic Communication: Research Agenda for a New Communications
Paradigm. Technical report, Fraunhofer FOKUS (2004)
24. Dobson, S., et al.: A survey of autonomic communications. ACM Trans. Auton. Adapt. Syst.
1(2), 223259 (2006)
25. Kephart, J.O.: Research challenges of autonomic computing. In: ACM International
Conference on Software Engineering (ICSE 2005), pp 1521, St. Louis, MO, USA, May 2005
26. Miller, B.: The Autonomic Computing Edge: The Role of the Human in Autonomic Systems.
IBM developerWorks, Nov 2005, http://www.ibm.com/developerworks/library/ac-edge7
27. Murch, R.: Autonomic Computing. IBM Press/Prentice Hall, Englewood Cliffs (2004)
New Jersey (Chapter 14 Other Vendors)
28. Tennenhouse, D.: Proactive computing. Commun. ACM 43(5), 4350 (2000). doi:10.1145/
332833.332837. http://doi.acm.org/10.1145/332833.332837
29. Sterritt, R., Hinchey, M.: SPAACE IV: Self- properties for an autonomous & autonomic computing environment Part IV A Newish Hope. In: 7th IEEE International Conference and
Workshops on Engineering of Autonomic and Autonomous Systems (EASe 2010), 2226 Mar
2010, University of Oxford, England
30. Riedel, J., Bhaskaran, S., Desai, S., Han, D., Kennedy, B., McElrath, T., Null, G., Ryne, M.,
Synnott, S., Wang, T., Werner, R.: Using autonomous navigation for interplanetary missions:
the validation of Deep Space 1 AutoNav. In: International Conference on Low-Cost Planetary
Missions, Laurel, MD, USA, May 2000, http://hdl.handle.net/2014/14133
31. Bajracharya, M., Maimone, M.W., Helmick, D.: Autonomy for mars rovers: past, present, and
future. IEEE Comput. 41(12), 4450 (2008)
32. Schmeck, H., Mller-Schloer, C., akar, E., Mnif, M., Richter, U.: Adaptivity and selforganisation in organic computing systems. In: Mller-Schloer, C., Schmeck, H., Ungerer, T.
(eds.) Organic Computing A Paradigm Shift for Complex Systems, pp. 537. Springer, Basel
(2011). e-ISBN 978-3-0348-0130-0. ISBN 978-3-0348-0129-4
33. Abelson, H., Allen, D., Coore, D., Hanson, C., Homsy, G., Knight Jr., T.F., Nagpal, R., Rauch,
E., Sussman, G.J., Weiss, R.: Amorphous computing. Commun. ACM 43(5), 7482 (2000).
doi:10.1145/332833.332842. http://doi.acm.org/10.1145/332833.332842
34. Doursat, R., Sayama, H., Michel, O.: Morphogenetic engineering. In: Toward Programmable
Complex Systems Series: Understanding Complex Systems. Springer, Berlin/Heidelberg
(2012). ISBN 1244 978-3-642-33901-1
57
58
Economics
Psychology
Sociology
Chemistry
Physics
Games
Complexity
Autonomic
Computing
AI
Control Systems
Robotics
Multi-Agent
Pervasive
Artificial life
Software Eng.
Computing Systems
Fig.3.1 Sources of influence and inspiration for autonomic computing
3.1
Overview of Influences
3.1.1 Introduction
In the first two chapters of this book, we saw that software complexity and its related
costs are the main motivations behind autonomic computing. We also introduced
the fact that many initiatives with similar goals preceded this initiative. Indeed,
autonomic computing can be seen as an attempt at consolidating different domains
[1] that have until now remained mostly isolated. In this chapter, we take a closer
look at the influences that the autonomic computing initiative has had and investigate some of the existing domains that can support the development of autonomic
computing systems.
As indicated in Fig.3.1, we may distinguish two important realms of influence:
the study of complex natural systems and the development of complex computing
systems. Computing systems come from diverse software and hardware domains
ranging from traditional automation systems to smart systems based on artificial
intelligence techniques. They provide algorithms, architectures, models and techniques that can be directly reused in autonomic computing. In turn, natural systems
3.1Overview of Influences
59
are not software; their study can include domains as diverse as economics, biology,
chemistry or physics, but they can also provide significant contributions to autonomic computing in terms of theories and models.
60
systems. Namely, game theoretical principles can be employed for enabling multiple
autonomic decision-makers to collaborate or to compete, in order to achieve common optimisations or overcome conflicting objectives, respectively. Based on such
principles, system designers can determine the necessary rules to be set in place so
as to ensure that the global system state will converge towards an equilibrium point
that meets the systems business objectives. Examples of interesting game theoretical contributions include Nash equilibrium, Conjectural equilibrium, Best Response
strategies or the Stackelberg leadership model, which we now briefly discuss.
The Nash equilibrium1 is defined as a set of strategies, one for each decision-maker,
where none of the participants can benefit from unilaterally changing their strategy.
In other words, none of the players has an incentive to change its strategy provided
that all other players maintain their strategies. This theory applies well to noncooperative games, where each player attempts to maximise its benefit, possibly to
the detriment of the group benefit. In the context of autonomic computing, Nash
equilibrium would mean, for example, that each application takes the best decision
it can while considering the decisions of all the other applications. However, as
indicated above, it may happen that the established Nash equilibrium (or local
optimum) is not a Pareto optimal2 (or global optimum). Ensuring that a multiplayer
system operates at the Pareto boundary generally requires some sort of cooperation
among the players. Also, adopting a Nash equilibrium strategy requires that players
are fully rational and hold perfect knowledge of each others strategies. Alternative
variants have been developed for situations where such requirements are impractical
or impossible to attain. Notably, the Stackelberg equilibrium strategy can be applied
when one particular player holds private knowledge of all its competitors and
accordingly optimises its responses. Interestingly, this strategy has been shown to
improve the performance of all participating players even if they continue to behave
myopically. For cases where players cannot obtain perfect knowledge on each others
strategies, Conjectural equilibrium has been proposed to replace knowledge with
beliefs, which can be obtained through repeated player interactions with their
environments [4].
Considering the management of system complexity in general brings to the fore
the necessity for understanding fundamental complexity and systemic principles.
More precisely, to manage complexity one would need to first comprehend the key
characteristics, inner workings and resulting behaviours of the targeted (complex)
systems and to understand where the complexity comes from, how it manifests itself
in the system and how it can be influenced or controlled. Hence, from a more theoretical perspective, autonomic computing may also benefit from the developments
of general fields that have studied such issues, including systems theory, complex
Nash equilibria: named after mathematician John Forbes Nash (1928) who invented the theory
and received the Nobel Prize in Economic Sciences in 1994.
2
Pareto optimality (or efficiency): named after economist Vilfredo Pareto (18481923), who
employed the concept in the context of economic systems. In a Pareto efficient allocation, no individual can be made better off without rendering at least another individual worse off. In this state,
no Pareto improvements can be made.
1
3.1Overview of Influences
61
systems and the related subfield of complex adaptive systems (CAS) [57]. Notably,
the many research areas related to the complex (adaptive) system domain can provide useful theories, models, algorithms and techniques for understanding and
building complex autonomic computing systems. These comprise studies covering
networked systems [8, 9], nonlinear dynamics and chaos theory [10], spontaneous
synchronisation [11] and finally self-organisation, emergence, autopoiesis, adaptation and evolution [7, 12, 13]. Similarly, the field of cybernetics may provide
insightful theories and studies for understanding complex, self-regulating systems
[1416]. Most existing theories in these domains have been based on, as well as
applied to, scientific fields including biology, sociology, economy, physics, chemistry
or engineering. Autonomic computing represents another challenging domain where
such studies can find useful applications.
62
3.2Biology
3.2
63
Biology
3.2.1 Overview
The term autonomic refers to the autonomic nervous system (ANS), which is
responsible for regulating vital functions. One might think that the complexity of
the human body, given the variety of its parts, could make it impossible to coordinate all the necessary actions to keep the human system in a stable state, which is
essential to its survival. However, in spite of this complexity, living organisms are
prime examples of the power of adaptation to new environments that has yet to be
equalled. These capabilities are possible thanks to internal coordination. It is not
possible for a unique central organ to reign without sharing. Two distinct parts4 must
agree with each other in order to control the organism and ensure its survival: the
brain orders the conscious, purposeful acts, while the autonomic nervous system
(ANS) controls subconscious activities that are beyond wilful control. Both the
brain and the ANS are part of the human nervous system (NS).
The ANS main purpose is to ensure homeostasisa systems ability to maintain internal equilibrium despite changes in its external environment and its internal
state. The ANS5 uses external and internal sensory information to regulate the activity of internal organs so as to maintain a set of vital parameters (body temperature,
oxygen levels or glucose concentration) within a survivability zone.
The ANS extends from the brain to the spinal cord via the brain stem and has
branches to every gland and organ of the human body and is composed of several
subsystems that interact to ensure the functioning of our bodies without our awareness. It is then possible to carry out intellectual activities without worrying about
bodily functions. It is important to emphasise that these systems are interdependent
and that the actions of one may affect the other. For example, the brain, and therefore conscious actions, can impact the ANS unconscious behaviour. For instance,
a scary thought, which causes stress, may lead to changes in heart rate and increased
sweating to enable one to stay and fight or run away.
Within the biological realm, the ultimate goal of an organisms autonomic nervous system is to ensure the organisms survival. Similarly, in the IT domain, the
goal of an autonomic computing system is to ensure the continual provisioning of
the systems functional services and associated quality of service (QoS) properties
in the presence of external and internal changes. In this scenario, the role of a
human administrator seems to be equivalent to that of the conscious brain. That is,
human administrators merely specify a computing systems high-level objectives
and then only intervene in case of system failure or for changing system objectives
and behaviour. To do this, system administrators are provided an overview and
summary of the environment in order to make decisions, but a large part of the
burden of administration is in turn managed by a variable amount of autonomous
4
5
64
subordinates. They will have a very localised and detailed view of the situation of
a resource or a particular resource pool. All in all, the sum of their actions defines
the overall system behaviour.
3.2Biology
65
66
and output PNS paths (forming reflex arcs) to complex data-processing tasks in the
brain (like pattern recognition, planning or learning). In addition to data communication, the PNS can also transform transmitted signals by amplifying their strengths
or by filtering them out at various levels.
As illustrated in Fig.3.3, the PNS can be further broken down into several subsystems,7 in particular the sensory division and the motor division. The sensory NS
conducts impulses from receptors to the CNS. It can be viewed as a monitoring
infrastructure relaying sensory inputs towards data-processing centres in the CNS.
Input data can originate from both external receptors like eyes, ears, nose or skin
and from internal receptors like chemical concentrations, pressure or temperature
sensors in blood, lymphatic circuits, glands and viscera.
Conversely, the motor division transmits signals from the CNS towards external
and internal effectors. Two types of activator pathways can be distinguished: somatic
and autonomic. The somatic division typically conducts impulses from the CNS to
the skeletal muscles, hence playing an essential role in voluntary motor functions.
Involuntary actions pass by the autonomic division to reach internal organs, including cardiac muscles in the heart, smooth muscles in the stomach or hair follicles in
the skin. Hence, the autonomic division affects unconscious activities including
heartbeats, widening and narrowing of blood vessels, breathing, digestion, metabolism and pupil dilation. The autonomic NS (ANS) provides a particularly enticing
source of inspiration for the design of adaptive systems. This is due to its capability
of using sensory information for regulating internal processes, while relying on a
combination of relatively simple, unconscious and mostly hardwired circuits.
As illustrated in Fig.3.4, the ANS is classically divided into two subsystems8
consisting of circuits with opposing actions. On the one hand, the sympathetic NS
(SNS) is concerned with adaptations that prepare the body for stressful or emergency situationsfight or flight. On the other hand, the parasympathetic NS
Various neuroscience sources promote different PNS divisions (e.g. placing part of the sensory
division within or outside the ANS), yet a discussion on this topic is well outside the scope of this
publication.
8
Various neuroscience sources also include the Enteric nervous subsystem as part of the ANS, yet
for clarity reasons we avoid presenting this detail here.
7
3.2Biology
67
68
3.2Biology
69
across one neuron relies on the formation and propagation of an action potential9
in response to some stimulation. Stimuli can be received concomitantly from different sourcessensors or other neurons. The sum of stimuli on a neuron must cross
a critical threshold in order for the neuron to fire. According to an all-or-nothing
principle, once a neurons threshold is crossed, the signals transmitted are almost
identical irrespectively of the intensity of the initial stimuli. Nonetheless, while
signals at the neuron level are the same, significant differentiation can be achieved
over neural circuits, by varying neuron interconnections, synapse types and signal
synchronisation.
Most common and most diverse synapses are chemical. In a chemical synapse, the
termination of a sending neuron (activated by a signal) will release neurotransmitter
chemicals into the space between the sending neuron and the receiving neuron.
These chemicals bind to the receptors of the receiving neuron. Here, different receptor types will have different effects: excitatory effects contribute to the neurons
activation and subsequent signal propagation; inhibitory effects do the opposite.
Alternatively, electrical synapses can also transfer signals between neural cells, the
main difference being the increased speed of signal transmission when compared to
chemical synapses.
Similarly, communication between neurons and effector cell types (e.g. muscle
cells) is also based on synaptic connections. As before, depending on the type of
receptor in the receiving cell, the resulting effect can be excitatory, inhibitory or
modulatory. For example, an excitatory effect on a muscle cell would manifest as
rapid cell contractions. The NS uses over a 100 types of neurotransmitters,10 with
diverse effects on different kinds of receptors. Hence, a single sending neuron may
have both excitatory and inhibitory effects when connected to different receiving
cells with different receptor types.
While signal transmission between cells only lasts for a milliseconds fraction,
longer-term effects may also occur in the synaptic connection. For example, the
number of receptors in a receiving cell may be multiplied, subsequently increasing
the sensitivity of the synaptic bound. Such changes may last for variable periods,
such as days, weeks or longer. This type of mechanism provides an essential base
for the formation of memory traces and learning. Reward-based learning may also
occur based on the reinforcement of frequently activated neural connections and
conditioned by an extra reward signal (that uses a dopamine neurotransmitter).
These capabilities give the nervous system certain plasticity, enabling it to adapt to
variations in its environment.
Action potential (spike or impulse): the sequential polarisation and depolarisation of a neurons
membrane, caused by stimuli (in a neurons dendrites or soma) and travelling through the neuron
(soma and axon) towards its extremity (axon terminals). Importantly, only stimuli that cross a
certain threshold cause the action potential to travel across the neuron, causing the neuron to fire.
Once triggered, all signals have the same action potential amplitude.
10
The most common neurotransmitters include acetylcholine, dopamine, GABA, glutamate and
serotonin.
9
70
3.2Biology
71
72
consisting of a loose nerve net, with no central brain or spine. The interlaced nerve
network of such animals allows them to react to sensory inputs such as light, touch,
temperature or chemical concentrations. However, such uniform, unspecialised nets
provide insufficient accuracy for locating the sources of sensory inputs, hence engendering identical reactions to inputs from different locations.
To improve coordination and movement accuracy, starfish feature a different
organisation of their neural netsseveral radial nerves extending through each
arm and a radial nerve ring connecting them all in the middle. Starfish represent
interesting examples of extensive self-repair capacities, being able to fully regenerate any of their arms. Even though such capacities are not directly enabled by
the NS, they show how the particular NS topology renders it well-suited to extensive self-repair.
Most complex NS organisation can be found in Bilaterian animals, which represent most of the vertebrate and invertebrate animal species (including humans). All
Bilaterian animals possess a central nervous system (CNS) comprising a brain, at
least one central cord and numerous nerves. Certainly, the NS size will vary significantly across different Bilaterian species, from a few hundred highly specialised neurons and glial cells in simple worms to about a hundred billion adaptive neurons and
glial cells in humans. As can be easily observed, the size and flexibility of an NS have
a critical impact on the complexity and adaptability of its generated behaviour.
Within an NS, some neural circuits are genetically preprogrammed. These most
notably include the neural circuits involved in basic survival mechanisms. At the
same time, most NSs also feature various degrees of plasticity (or neuroplasticity),
which enables them to undergo structural and/or functional changes based on input
from the external environment. Changes may occur at reduced scales, like cellular
changes and new synaptic connections involved in learning; or at larger scales, like
extensive reorganisations of cortical mappings following brain injury.
3.2Biology
73
74
environment12 [24]. Such processes can also play a key role in system self-optimisation
(the continuous renewal of constituent parts depending on resource availability and
context), self-repair (by enabling the regeneration of system parts such as the arms
of start fish or the tail of gecko lizards) and even self-protection (by detaching
renewable body parts to avoid capture).
The mechanisms and processes behind species evolution provide great inspiration for engineering automatic solution-search algorithms and infrastructures, such
as has already been shown via genetic algorithms. Such processes mainly capitalise
on the capacity to represent an organisms blueprint in the form of an efficiently
compressible information code, the possibility of mutating and mixing different
blueprint variants and the capability of selecting the most suitable blueprints from
the new resulting variants, based on the fitness evaluation of individuals.
Overall, the entire biological realm seems to open a wide range of opportunities
for novel, alternative design solutions suitable for sustaining the production and
maintenance of complex, adaptive IT systems. Certainly, caution must be taken
when getting inspiration from the living systems domain. On the one hand, the
underlying environmental restrictions and the targeted objectives can prove quite
dissimilar for biological systems and engineered IT systems. Hence, engineers
should search to merely extract inspiration from the biological realm rather than
blindly attempting to copy its mechanisms without a thorough understanding of
their primary constraints and motivationsfor example, [25].
Finally, engineers should take considerable care when considering the long-term
impact of the autonomic IT systems they develop and deploy into the real world.
Indeed, the very advantages of an autonomous system could turn into notable disadvantages should an autonomous system no longer serve our purposesthat is, no
need for human intervention. From this perspective, the context and lifespan of an
autonomic system must be considered during its design and suitable solutions
included into its very structure and function. Within this context, the Apoptotic
Computing project has been defined with the goal of developing Programmed
Death by Default for Computer-Based Systems [26]. This project aims to introduce
a self-destruct property into autonomic computing systems in order to help prevent
catastrophic scenarios. This approach is inspired by the apoptosis mechanism in
biological systems. Here, programmed cell death intervenes as part of immune
responses in multicellular organisms, for eliminating damaged or diseased cells, in
a controlled and non-toxic manner.
3.3
Control Systems
Automatic control systems have been around for some time to regulate industrial
processes such as power stations and chemical plants. Like autonomic computing
systems, the feedback loop is central to such systems. A control system consists of
R. Doursat, Morphogenetic engineering weds bio self-organization to human-designed systems,
PerAda Magazine, May2011;http://www.perada-magazine.eu/view.php?source=003722-2011-05-18
12
3.3Control Systems
Reference +
75
Measured
error
System
input
Controller
System
output
System
Measured output
Sensor
Fig.3.5 Feedback loop
an interconnection of components specifically configured to achieve a desired purpose. Modern control engineering practice includes the use of control design strategies for improving manufacturing processes, the efficiency of energy use and
advanced automobile control, among others. It is these processes (plant or components) that are under the management of the control system.
3.3.1 Introduction
The traditional goal of automatic control systems is to maintain parameters in a
certain threshold without human intervention. These parameters are usually physical
measurements or quantities such as speed and temperature. Control techniques have
been developed to meet these needs [27], which include the use of feedback loops.
As illustrated in Fig.3.5, below, a feedback loop is composed of three elements: the
system that one wishes to regulate, a set of sensors and a controller. From reference
values set by the user, the controllers role is to observe the system through the sensors and make changes to ensure both system stability and compliance to the users
reference values. To achieve this, it is necessary to have an accurate representation
of the system. That is why regulation techniques use mathematical models of the
environment that define system state and the values of input and output that are possible. This concept of feedback loop has been present in autonomic computing from
the beginning [28].
For completeness we describe a general process for designing a control system. This section provides a very simplistic and brief introduction to the subject, enough to understand some of the main concepts that tend to affect
autonomic systems. To this end, much of the calculus has been removed; therefore,
we direct the reader to the many introductions to control theory that exist for
more full review.
Typically a representation of the control systems is modelled, and there is a gap
between the complex physical system under investigation and the model used in the
control system synthesis (see Chap. 7Knowledge, for a more thorough discussion).
Through the use of the feedback loop, the iterative nature of the control process allows
us to effectively handle the design gap while trading-off complexity, performance
and cost in order to meet the systems goals.
76
water
steam
float
valve
water
difference
valve
tank
steam
float
3.3Control Systems
77
78
Reference
R(t)
Error
Controller
e(t)
C(S)
Input
i(t)
System
P(s)
System
Output
o(t)
Sensor
F(s)
In this model of a control system, we have the system being controlled, a sensor
that monitors the outputs from the system and a controller that regulates the system.
Assuming the sensor in this system does not change the output value in any way,
that is, F(s)=1, meaning it has no gain, which is typical of computer-controlled
systems converting analogue signals to digital signals, for example.
The error at a given time, e(t), is the value calculated as the difference between
what was expected from the system (given the initial reference value r(t)) and what
the sensor read as the actual value output. The error is a measure of how well the
system is performing at any instant. If the error is large, this means the measured
output is not matching the desired output. Here, the controller must adjust the input
value to reduce this error and typically if the difference is large then the control
action (also known as gain) is large.
P(s) is the transfer function that goes from input i(t) to output o(t). The advantage
of the feedback controller is that only simple knowledge about the managed component, or system, is required and not the environment. The controllers goal is to
minimise the error. Because the system measures the effective error by subtracting
the output to the reference, the system can directly react to previous system output.
3.3Control Systems
79
Kpe(t)
Setpoint
Error
e(t)
System
P(s)
o(t)
Kdde(t)/dt
Sensor
F(s)
much the system should be tweaked in order to respond to an input error, for
example, like an amplifier. Higher gains will make the system adjust quicker.
However, adjusting the proportional gain too high can cause the correction value to
overshoot (Fig.3.9) which in turn can produce large errors that need compensating
for. Again if the gain is large it will overcompensate and essentially make the
system oscillate more or even become unstable. Proportional control alone will usually not make the system arrive at its set-point value, but will only approximate it.
Integral (I)This controller integrates the error signal and so provides the
response to the past behaviour of the system. It provides a control signal that
attempts to correct the errors that should have been corrected previously, that is,
it aims to eliminate bias. This accumulated error is then multiplied by an integral
gain to weight the contribution of this component to the controller. Here, Ki is the
integral gain. The integral term accelerates the system to the desired set point
and, when used in conjunction with the proportional controller, eliminates the
constant error. The gain needs to be carefully tweaked to avoid the system
over-responding to the previous error and overshooting the desired value. Setting
the integral gain higher will eliminate the error more quickly but with the risk of
a larger overshoot, as every negative error integrated needs to be compensated
for by positive error in order for the system to reach a steady state; see Fig.3.10.
Derivative (D)The derivative component of the controller, as its name implies,
compensates the derivative of the error over time. Kd is the derivative gain that
will reduce the signal amplitude when overshooting and so can be very useful for
decreasing the overshoot produced by the PI controller, while retaining the high
speed of adjustment provided by it. However, setting the derivative gain too high
can significantly slow the response.
The PID controller, having all of these components weighted, then produces the
final control signal by adding all their contributions together:
t
d
e(t )
dt
80
overshoot
steady state
setpoint
oscillations
rise time
time
There are many methods for tuning a PID controller in order to ensure that the
system behaves optimally under the desired circumstances. Manual methods usually start by adjusting the proportional term until it starts to oscillate and then adjusting the integral and derivative terms to make sure the system adapts quickly and
does not overshoot too much or become unstable. This controller is particularly
useful in cases where there is very incomplete knowledge as to how the system will
respond to the control signal. In cases where more complete knowledge of the system is available, the PID controller can be combined with feedforward mechanisms
to provide a more suitable control.
3.3Control Systems
Fig.3.11 Showing the effect
of damping on overshoot
81
100
50
%
overshoot
10
0
0.1
0.5
Damping ratio
to a steady state, and in an autonomic context, this can be manifested, for example,
as the system binding to different components and unbinding and rebinding again.
This obviously causes problems with performance as the overheads to carry out the
autonomic management are essentially dominating.
Damping is an effect that tends to reduce the amplitude of oscillations (e.g. friction can be described as a damping force). Figure3.11 depicts the relationship
between a damping ratio and the percentage overshoot. Obviously the more damping
of a state change we provide, the less overshoot will occur. However, the more damping of error correction leads to a less responsive system and could increase the time
taken for the system to reach stability.
The stability of a system relates to its response to inputs or disturbances. It can
be described as a system which remains in a constant state unless affected by an
external action and which returns to a constant state when that action is removed.
Therefore, a system is stable if its error correction response approaches zero as time
approaches infinity. Control theory is concerned not only with the stability of a
system but also its degree of stability which can be described as marginal whereby
the system is stable for a time and then when a disturbance is injected, the system
becomes unstable for a time. The degree of stability, therefore, is an indicator of
how close the system is to instability and how much disturbances will affect the
systems ability to return to its expected output.
82
3.4
Artificial Intelligence
3.4Artificial Intelligence
83
complicated and unpredictable the goals and the conditions in which they must be
met, the more intelligence the system must possess. The actual manner in which
problem-solving capabilities are designed and implemented in computing machines
highly depends on the adopted view on the concept of intelligence (briefly discussed
below). Irrespective of the adopted perspective, intelligence cannot develop or
perform in complete isolation from the world in which it must operate. Hence, perception and action on the environment, communication and coordination with other
intelligent systems become key capabilities. Such capabilities are also essential in
autonomic computing.
As highlighted by Russel and Norvig [29], intelligence has been studied from
several perspectives, and this has generated different lines of work. First, while
some concentrate on the thinking processes that seem to enable intelligence, others
focus on the observable behaviour that intelligence renders possible (or that generates
the illusion of intelligence). Second, various schools differ on whether intelligence
is strictly considered with respect to human capabilities or with respect to an ideal
form (also referred to as rationality14).
Cognitive science views intelligence from the perspective of human-like thinking
and consequently focuses on the study of the mind and its processes. This approach
to AI is only relevant to autonomic computing in so far as insight into human cognition and its physiological support can be used to create knowledge representations,
reasoning models or neural networks that can be implemented in a computing program to help render it autonomous.
The rational thinking approach to AI emphasises processes based on logic. Here,
relevant facts and their interrelations are formally represented, and different types of
reasoning are used for inferring new conclusions from existing facts. Based on this
approach, a computer program should, in principle, be able to solve any solvable
problem that is expressed in logical notation. However, serious limitations are
encountered when attempting to develop programs that can solve real problems.
These stem from the difficulty of formally representing knowledge about a complex
and often uncertain environment and of actually executing logical inference within
reasonable timeframes and with available resources. Within autonomic computing,
logic-based approaches can be employed to solve clearly defined problems, such as
straightforward analysis and planning (see Chap. 7).
Adopting a human behaviour perspective, identifying intelligence becomes a
matter of comparing artificial system behaviours to human conducts. Based on this
perspective, Alan Turing15 proposed a testknown today as the Turing testoffering
an operational definition for intelligence. In order for an entity to be considered
A system is rational if it does the right thing, given what it knows [29].
Alan Turing: English mathematician, logician, cryptanalyst and computer scientist. He can be
considered as one of the key predecessors of artificial intelligence (AI), as he defined a vision of
AI in a 1950 article called Computing Machinery and Intelligence, where he has introduced the
Turing test, genetic algorithms, machine learning and reinforcement learning. He has also introduced some fundamental AI concepts in a less-known article submitted in 1948 and entitled
Intelligent Machinery, but which remained unpublished during Turings lifetime.
14
15
84
3.4Artificial Intelligence
85
86
autonomic systems, which must also be evaluated with respect to their ability to reach
predefined goals.
While the concepts and paradigms related to intelligent agents directly apply to
autonomic computing systems, the actual design and implementation of autonomic
computing elements and systems proves just as difficult as the design and implementation of agents and multi-agent systems. Indeed, just as with complicated
agent systems, the global architecture of large-scale, distributed, multi-objective,
dynamic autonomic systems can quickly become fuzzy and hard to implement and
maintain.
18
IBMs Deep Blue computer program managed to defeat the chess champion Garry Kasparov in
May 1997 (http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/deepblue).
19
IBMs Watson Computing System challenged and beaten Jeopardy Champions in February
2011 (IBM Jeopardy Challenge: http://techcrunch.com/tag/watson).
3.4Artificial Intelligence
87
and image recognition, for example, identifying objects from video footage; heuristic
classification and decision-making, for example, advising on whether to accept
credit card purchases; or natural language skills, for example, predicting the past
tense of English verbs.20
Symbolic and connectionist approaches feature specific advantages and limitations21 [33]. Symbolic approaches facilitate rich expressiveness, explicit architecture
and procedural versatility, which render them suitable for goal-based reasoning.
They facilitate the conception of processes that use complex knowledge representations
to perform systematic search explorations, parsing and recursive procedures.
Explicit knowledge representation and architecture enable various parts to be reused,
rearranged and modified independently. As a main limitation, symbolic approaches
are highly sensitive to incomplete or incorrect data and perform rather poorly at
common sense reasoning tasks, where analogies and approximations are more
suitable than precise formal procedures.
Conversely, connectionist approaches can inherently handle fuzziness and adapt
knowledge fragments to specific contexts. They prove particularly well-suited at
addressing ill-defined problems and weakly linked facts such as involved in pattern
recognition, clustering, categorisation, optimisation and knowledge retrieval. Their
main limitations are essentially due to the rigid, uniform and flat structure imposed
by neural networks. Indeed, lack of larger-grain architecture makes it impossible to
isolate a part of the network as a reusable piece of reasoning; to express, extract,
share or reuse acquired knowledge; to address complicated situations by problem
decomposition and integration of partial solutions; to learn to perform new tasks
once trained; or to perform several tasks in parallel.
A radically different approach to building intelligence challenged existing AI
communities by proposing an exclusively behavioural approach to robotic systems
[34]. Namely, Rodney A. Brooks argued that intelligent behaviour can be achieved
while exclusively relying on collections of simple, well-integrated reflexes. This
approach eliminates intelligence as a necessary, explicit element that mediates
between perception and action and rather defines it as a virtual concept induced in
the mind of external observers. To help build complicated robotic systems based on
this vision, Brooks proposes the subsumption architecture, which organises reactive
reflexes into multiple, interdependent layers, representing different abstraction levels
and goal complexities. This is not unlike some of the processes found in the natural
ANS. However, in robotics, goals pursued by reactions in the highest layers, such as
searching for food, must rely on and subsume reactions aiming to achieve simpler
20
A well-known connectionist experiment conducted by David Rumelhart and James McClelland at
the University of California at San Diego and published in 1986 consisted in training a network of 920
artificial neurons (organised in two layers of 460 neurons) to form the past tenses of English verbs.
21
Cognitive sciences studying the human mind are similarly split into different communities.
Cognitive psychology takes a top-down, knowledge-oriented approach, focusing on internal mental
processes and states, including beliefs, desires, knowledge, ideas and motivations. Conversely,
cognitive neuroscience takes a bottom-up approach by studying the biological substrates, or the
brains neural network, that underlie and enable cognition.
88
goals in the lower layers, like avoiding obstacles. Successful examples of this
approach include the first autonomous spacecraftDeep Space Onedeveloped as
part of NASAs Remote Agent program ([29], pp 27).
A more recent AI approach, called Evolutionary AI, uses bio-inspired evolutionary
concepts to develop solutions that can solve predefined problems. An interesting
application example consists in modelling the evolution or growth of a business
within a simulated market place. Notably, Evolutionary AI has been used to model
artificial life forms within the artificial life (A-Life) domain. Among other centres
of interest, artificial life studies the self-organisation processes that lead to swarm
intelligence, such as can be observed in the simple flocking patterns of birds, movement synchronisation of fish schools or more complex constructions of anthills,
honey bee combs and human embryos.
While the debate on the merits of each of these approaches persists in the AI
domain, some AI researchers propose hybrid solutions that can capitalise on the
advantages of both these designs. Notably, Marvin Minsky argues that AI must
employ and be able to integrate many, heterogeneous approaches, each one specialised in handling a different type of knowledge representation [33]. This view is
further developed in Minskys SOM theory22 [35], where human intelligence is
modelled as a collection of simple agents, each specialised in performing a specific
type of task. The agent interactions lead to the formation of an agent society, or
society of mind, capable of performing complex intellectual tasks.
Last, but not least, learning was proposed and developed by the AI community
as a particularly potent element in creating and maintaining intelligence. Rather
than having external programmers carefully design and implement intelligence
all at once, learning enables intelligence to develop progressively and adaptively by automatically modifying a base of existing artefacts, in order to better
achieve problem-solving capabilities within a current environment and with
respect to present goals. Learning can apply to create, disable and tune reflexes
in purely reactive entities, enrich and update knowledge in more sophisticated
designs and finally identify suitable goals or even improve inherent learning
methods.
In addition to initial development, learning constitutes a powerful and essential
adaptation enabler. Most importantly, it allows the reuse of generic designs and
implementations within a wide range of specific execution contexts and for attaining a large spectrum of goals. For example, an intelligent entity (or agent) can be
designed so as to merely detain an initial reflex-based behaviour, enhanced with
learning capabilities that progressively enable it to develop more sophisticated,
knowledge-based behaviours. This renders a common design reusable for enabling
any individual agent to develop, starting from a generic but basic set of capabilities
that match all foreseeable environments to an efficient and sophisticated behaviour
specialised for a certain environment.
Society of mind (SOM): a conceptual theory about the workings of the mind and thinking, initiated
by Marvin Minsky with Seymour Papert in the 1970s and later developed and published by Minsky
in the Society of Mind book, published in 1988.
22
3.5Complex Systems
89
3.5
Complex Systems
Complex systems are systems that consist of numerous interconnected parts and
that exhibit properties and behaviours that are not necessarily obvious from studying the properties and behaviours of the individual parts. The study of complex
systems focuses on the way in which interactions among parts give rise to overall
system behaviours and relations to the system environment. Complex system examples include social systems, involving interrelated humans; weather dynamics, led
by differences in temperature and moisture densities; climate systems, based on
long-term interactions among atmosphere, hydrosphere, cryosphere, land surface
and biosphere; or chemical systems, where interactions among chemical elements
can give rise to cyclical or oscillating reactions. Complexity Theory23 focuses on the
Here we refer to Complexity Theory as studied in relation to complex systems. This is not to be
mistaken with the field of Computational Complexity Theorya branch of the Theory of Computation
(from theoretical computer science and mathematics) that aims to classify computational problems
according to their difficulty and to relate identified classes of problems to each other.
23
90
study of interactions, iterations, emergence and pattern formation, all of which may
prove relevant to advancements in autonomic computing.
Complex adaptive systems (CAS) represent a special category of complex
systems, where in addition to being complex, systems are also adaptive, in the sense
that they modify themselves in response to changes in their environments and
potentially learn based on experience. CAS examples include ecosystems, social
insects and ant colonies, brains and immune systems, cells and developing embryos,
financial systems and stock markets and political and cultural communities.
It is important to note that the term complexity is employed across different
research communities in software engineering and computer science with quite
diverse meanings. For example, complexity may imply that a system is either
extremely complicated, that the composition of its parts is nonlinear, or that the
resulting overall behaviour is unpredictable. These differences aside, it remains clear
that available CAS research can provide useful concepts and models for designing
complex self-managing computer systems. Some of the most notable CAS concepts
to be considered include self-organisation and emergence.
Moreover, several specific research fields that have emerged from the general
CAS domain may prove particularly relevant with respect to autonomic computing
systems. These include the study of networked systems (small-world and scale-free
networks, dynamic and adaptive networks, graph theory, scalability and robustness
properties), pattern formation (cellular automata, reactiondiffusion systems, selfreplication and differentiation), nonlinear dynamics (attractors, chaos and stability
analysis), evolution and adaptation (genetic algorithms, artificial life, evolutionary
computing, artificial neural networks, machine learning, co-evolution, goal-oriented
behaviour) or collective behaviour (ant colony optimisations, synchronisation,
swarms or phase transitions).
Cybernetics (defined by Norbert Wiener24 in [36], for instance) is another important interdisciplinary field specialised in the study of complex systems. Cybernetics
focuses on the understanding and specification of the self-regulatory aspects of
complex systems, where closed signal loops play an essential role. It is fundamentally
concerned with principles such as coordination, communication, information, feedback, control and regulation, which can be employed to explain and predict possible
system behaviours and functions. Such principles apply across a wide variety of
complex self-regulatory systems, from IT to physical and social systems. They are
definitely relevant to autonomic computing systems.
A noteworthy example of cybernetics relevance to autonomic computing consists in W. Ross Ashbys25 brain studies within this domain [15, 16]. Ashby regards
the brain as a physiochemical system that reacts to its environment and learns from
its experience to adapt its behaviour. The brain becomes a key adaptation enabler
24
Norbert Wiener (18941964): American mathematician, considered as the main originator of
cybernetics.
25
W. Ross Ashby (19031972): English psychiatrist, carried-out pioneering work in the cybernetics
domain.
91
3.6
Key Points
92
References
1. Kephart, J.O.: Research challenges of autonomic computing. In: ACM International
Conference on Software Engineering (ICSE 2005), pp 1521, St. Louis, MO, USA, May 2005
2. Hansen, J.G., Christiansen, E., Jul, E.: The Laundromat Model for autonomic cluster computing.
In: IEEE International Conference on Autonomic Computing (ICAC 06), pp. 114123,
Dublin, 1316 June 2006. doi: 10.1109/ICAC.2006.1662389. http://ieeexplore.ieee.org/stamp/
stamp.jsp?tp=&arnumber=1662389&isnumber=34794
3. Kephart, J.O., Greenwald, A.R.: Shopbot economics. Autonom. Agent Multi-Agent Syst. 5(3),
255287 (2002). doi:10.1023/A:1015552306471. http://dx.doi.org/10.1023/A:1015552306471
4. Su, Y., van der Schaar, M.: Conjectural equilibrium in water-filling games. In: IEEE Global
Telecommunications Conference (GLOBECOM 2009), pp. 17, 30 Nov 20094 Dec 2009,
doi: 10.1109/GLOCOM.2009.5425333. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnu
mber=5425333&isnumber=5425208
References
93
5. Waldrop, M.M.: Complexity: The Emerging Science at the Edge of Order and Chaos. Simon
& Schuster, New York (1992). ISBN 13: 9780671767891
6. Holland, J.: Hidden Order: How Adaptation Builds Complexity, 1st edn. Basic Books, New
York (1996). ISBN 13: 9780201442304
7. Kauffman, S.: At Home in the Universe: The Search for the Laws of Self-Organization and
Complexity. Oxford University Press, Oxford (1996). ISBN 13: 9780195111309
8. Watts, D.: Six Degrees: The Science of a Connected Age. W. W. Norton & Company, New
York (2004). ISBN 13: 9780393325423
9. Barabasi, A.-L.: Linked: How Everything Is Connected to Everything Else and What It Means.
Plume, New York (2003). ISBN 13: 9780452284395
10. Strogatz, S.H.: Nonlinear Dynamics and Chaos: With Applications to Physics, Biology,
Chemistry, and Engineering. Studies in Nonlinearity, 1st edn. Westview Press, Cambridge
(2001). ISBN 13: 9780738204536
11. Strogatz, S.H.: Sync: The Emerging Science of Spontaneous Order, 1st edn. Hyperion Book,
New York (2003). ISBN 13: 9780786868445
12. Holland, J.: Emergence: From Chaos to Order. Oxford University Press (Sd), Oxford, UK
(2000). ISBN 13: 9780192862112
13. Maturana, H.R., Varela, F.J.: Autopoiesis and Cognition: The Realization of the Living.
Boston Studies in the Philosophy of Science, vol. 42, 1st edn. D. Reidel Publishing Company,
Dordrecht (1980). ISBN 13: 9789027710161
14. Wiener, N.: Cybernetics, or the Control and Communication in the Animal and the Machine,
2nd edn. MIT Press, Cambridge (1965) (1st edn published by The Technology Press/Wiley,
New York, 1948). ISBN 13: 9780262730099
15. Ashby, W.R.: Introduction to Cybernetics. Chapman and Hall Ltd., London (1956)
16. Ashby, W.R.: Design for a Brain: The Origin of Adaptive Behaviour, 2nd edn. Chapman and
Hall Ltd., London (1960) (1st edition published in 1952). ISBN 13: 9780412200908
17. Nervous System. The Columbia Encyclopaedia. Columbia: Columbia University Press. 6th
edn. (2004) (entry available from Questia online encyclopaedia: http://www.questia.com/
library/encyclopedia/nervous_system.jsp)
18. Leong, S.K.: An Introduction to the Human Nervous System. Singapore University Press,
Kent Ridge (1986) (Reflexes, pp. 155161; The autonomous nervous system and visceral
afferents, pp. 500543). ISBN 9971-69-107-8
19. Gray, H.: Chapter IX: Neurology. In: Anatomy of the Human Body (Grays Anatomy). Lea
and Febiger, Philadelphia (1918). ASIN: B000TW11G6. Available online from Bartleby.com:
http://www.bartleby.com/107
20. Macaulay, D.: The Way We Work: Getting to Know the Amazing Human Body. Houghton
Mifflin/Walter Lorraine Books, Boston (2008). ASIN: B004TE780I. ISBN 10: 0618233784
21. Ritzmann, R.E., Tobias, M.L., Fourtner, C.R.: Flight activity initiated via giant interneurons of
the cockroach: evidence for bifunctional trigger interneurons. Science 210(4468), 443445
(1980). doi:10.1126/science.210.4468.443. http://www.sciencemag.org/content/210/4468/443
22. Watts, D.J., Strogatz, S.H.: Collective dynamics of small-world networks. Lett. Nat. (Nature)
393, 440442 (1998). doi:10.1038/30918
23. van den Heuvel, M.P., Stam, C.J., Boersma, M., Hulshoff Pol, H.E.: Small-world and scale-
free organization of voxel-based resting-state functional connectivity in the human brain.
NeuroImage 43(3), 528539 (2008)
24. Doursat, R., Sayama, H., Michel, O.: Morphogenetic engineering. In: Toward Programmable
Complex Systems Series: Understanding Complex Systems. Springer, Berlin/Heidelberg
(2012). ISBN 1244 978-3-642-33901-1
25. Hinchey, M.G., Sterritt, R.: 99% (Biological) inspiration.... In: Proceedings of the Fourth
IEEE International Workshop on Engineering of Autonomic and Autonomous Systems (EASE
07), 2629 March 2007. IEEE Computer Society, Tucson
26. Sterritt, R.: Apoptotic computing: programmed death by default for computer-based systems.
IEEE Comput. 44(1), 5965 (2011). doi:10.1109/MC.2011.5. http://ieeexplore.ieee.org/
stamp/stamp.jsp?tp=&arnumber=5688151&isnumber=5688134
94
27. Golnaraghi, F., Kuo, B.C.: Automatic Control Systems. Wiley, New York (2008). ISBN 13:
9780470048962
28. Kephart, J.O., Chess, D.M.: The vision of autonomic computing. IEEE Comput. Soc. 36,
4150 (2003)
29. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd edn. Prentice Hall,
Englewood Cliffs (2009). ISBN 10: 0136042597, 13: 9780136042594
30. Nilsson, N.J.: Artificial Intelligence: A New Synthesis. The Morgan Kaufmann Series in
Artificial Intelligence. Morgan Kaufmann Publishers, San Francisco (1998). ISBN 13:
9781558604674
31. Kephart, J.O., Walsh, W.E.: An artificial intelligence perspective on autonomic computing
policies. In: Proceedings of the 5th IEEE International Workshop on Policies for Distributed
Systems and Networks (POLICY) 2004, 79 June 2004, pp. 312. IBM Thomas J Watson
Research Center, Yorktown Heights, New York (2004). doi: 10.1109/POLICY.2004.1309145.
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1309145&isnumber=29053
32. Horling, B., Lesser, V.: A survey of multi-agent organizational paradigms. Knowl. Eng. Rev.
19(4), 281316 (2004)
33. Minsky, M.: Logical vs. analogical, or symbolic vs. connectionist, or neat vs. scruffy. In:
Winston, P.H. (ed.) Artificial Intelligence at MIT, Expanding Frontiers, vol. 1. MIT Press,
Cambridge (1990) (Reprinted in AI Magazine, 1991, http://web.media.mit.edu/~minsky/
papers/SymbolicVs.Connectionist.html)
34. Brooks, R.A.: Cambrian Intelligence: The Early History of the New AI, 1st edn. A Bradford
Book, Cambridge (1999). ISBN 13: 9780262522632
35. Minsky, M.: The society of mind. Pages Bent edition. Simon & Schuster, New York (1988).
ISBN 13: 978-0671657130
36. Wiener, N.: Cybernetics: Or Control and Communication in the Animal and the Machine, 1st
edn. The Technology Press/Wiley, Cambridge/New York (1948). ASIN: B000RJDZXI
95
96
4.1
Autonomic Elements
Goals / feedback
Usage
Context
Autonomic
element
Computing
Context
Autonomic software
A Trojan horse program can also be seen as an autonomic element, but this is not in the spirit of
autonomic computing!
4.1
Autonomic Elements
97
Goals / feedback
Autonomic
element
Unmanaged
artefact
Unmanaged
artefact
Usage
Context
Computing
Context
Autonomic
element
Autonomic
element
Autonomic system
context, internal and external, and the usage context. Let us remind that the computing
context contains the computing resources that the autonomic element can use and
those that influence its actions. The usage context refers to the persons or systems
interacting with the autonomic software. The behaviour at any point in time of an
autonomic element can depend on these two forms of context.
An autonomic element can form a complete software system. In this case, an
autonomic system and autonomic element are the very same, single software entity.
In most cases, however, an autonomic element is simply a part of the whole autonomic system. Here, the autonomic element is really some sort of smart island taking
its own decision in conformance with high-level directions. Outside of the island,
administrators regain control of software artefact management in the traditional
way, not simply via high-level directives. Unmanaged parts of a system represent
any software or hardware resource that cannot be self-administrated. The reason for
this may be that they are sufficiently stable regarding the system administration
goals and so do not require runtime management intervention. Another reason may
be that data and/or the means to self-administer these elements are simply lacking.
Unmanaged elements can however provide valuable information to an autonomic
element and clearly belong to the computing context.
Most of the time, though, the situation is much more complex (see Fig. 4.2). For
scaling and scoping purposes, autonomic systems include a number of autonomic
elements managing well-defined software regions or providing well-delimited functions. These autonomic elements may cooperate as they would do with any other
software artefacts, being within or beyond the autonomic system they belong to.
Building an autonomic system comes down to specifying the features that have
to be self-managed and to defining autonomic elements, usually an iterative process.
Depending on their goals and constraints, for example, the time available to react or
the number of events to consider, autonomic elements build upon different techniques and formalisms. Similarly, many interaction patterns can be employed to
98
sensors
effectors
Managed Artefacts
Autonomic element
4.2
4.2.1
Autonomic elements are structured according to a simple, widely accepted, architectural model introduced by IBM [3]. This architecture, depicted in Fig. 4.3, is
considered by many as the reference architecture. It clearly defines two distinct
types of modules: managed resources and autonomic manager. Managed resources
(or managed artefacts) are the software or hardware entities that are automatically
administered in an autonomic element. The autonomic manager is the entity in
charge of the runtime administration of the managed resources. IBMs reference
architecture should be seen more like a logical architecture, identifying the main
types of entities involved, defined via their roles, functions and interactions. For
example, in certain cases, the autonomic manager and the managed resources may
be more intertwined and less clearly separated than shown in this conceptual reference architecture.
A managed resource represents any software or hardware resource that is
endowed with autonomic behaviour by coupling it with the autonomic manager.
A managed resource can be a Web or database server, a specific software component, the operating system or a component therein, a cluster of machines in a grid
4.2
99
environment, a rack of hard drives, a network, a CPU, a printer, etc. However, all
managed resources share a common feature: they need to be adapted at runtime as
a function of internal or external change where that change impacts on their goals.
Adaptations are needed in order to provide a better quality of service, to take into
account a new element in the context, to better satisfy user expectations, etc. They
can be triggered by anything from someone with a mobile device has moved to a
new room, a devices availability has changed, the component is not performing as
expected or the size of the environmental data to be processed has changed.
Managed resources provide specific interfaces, called control points or touch
points, for monitoring and adaptation. Two types of control points have to be
differentiated: sensors and effectors. Sensors, often called probes or gauges, provide
information about the managed resources. This could be some information regarding the elements state or some idea of their current performance. For a Web server,
for example, that could include the response time to client requests, network and
disk usage figures or CPU and memory utilisation. Effectors provide facilities to
adjust the managed resources and, as a consequence, change their behaviour. For
instance, this could be some modification of a configuration file, the instantiation of
new objects, the replacement of some outdated elements, etc.
The explicit definition of sensors and effectors is the proper means to encapsulate
managed resources. In this manner, the internal structures and states of the managed
resources can be kept private and an autonomic manager can only access them via
the interfaces that are provided. This is a way for the managed resources to keep
some level of control. This is also a way for managed resources to remain directly
administrable by a human operator. This is important since it is very hard in practice
to foresee all the runtime situations and it is likely that, under some unexpected
conditions, an autonomic manager would not able to provide an adequate solution.
The autonomic manager may also fail or be deactivated. In such a situation, an
expert human administrator has to be able to take over. Note that it is not expected
that a regular administrator would take over in such situations, as deep, intimate
knowledge of the autonomic system is generally necessary.
The autonomic manager implements the autonomic loop(s). It perceives the current situation, internal state and external context, determines desired management
actions and pilots their execution. As previously mentioned, it is driven by administrative goals, generally expressed in rather abstract terms. The purpose of the
autonomic manager is then to transform these high-level directives into precise,
sometimes obscure actions on the managed resources. A goal can be, for instance,
be highly secure. In this case, the autonomic manager has to create the expected
security conditions through parameter settings, component creation, specific protocol usage and so on. Usually, the precise security approaches, technologies and
techniques used to secure this software element are not known by most software
administrators.
As explained previously, the autonomic manager also provides regular feedback
to the administrators so that they can remain in the decision loop. Here again, the
feedback should be presented in such a way that it remains intelligible by non-expert
administrators. It is the duty of the autonomic manager to aggregate data, perform
100
4.2.2
Sensors
Sensors are code or components that measure a physical or abstract quantity concerning the managed resource and convert it into a signal for the autonomic manager. Examples of such data would be system performance characteristics, user
context or even server temperature. Such data is presented in a timely fashion via
the sensor interfaces and contains all the information about a running system that is
needed by an autonomic manager. Depending on the target autonomic properties,
different types of data and different forms of presentation may be expected.
Determining the appropriate data to be collected and implementing the corresponding sensors are a difficult activity in itself. One of the first steps to be achieved when
building an autonomic system is therefore to define the data that are needed, their
nature and the way they should be collected and presented.
The data of interest may come from the external context (see Chap. 2) and from
many parts of the system, including, for instance, components, connections, classes,
operations and parameters. Some of these elements are business oriented, while others are more concerned with the supporting infrastructure, like database management systems (DBMS) or middleware. However, many supporting elements take the
form of components off the shelf (COTS), and, as such, it is difficult to obtain
internal data due to their black box nature.
A distinction has to be made here between desired and accessible data, therefore.
In green field situations, this can be a rather straightforward mapping since the
4.2
101
interfaces to the desired data are inherently accessible as they are part of the system
design. In other words, new systems are built in such a way that autonomic-related
data can be naturally provided, granted that no major requirement is violated in doing
so. However, dealing with a legacy system or COTS can be a much more complicated
a task. The code of the managed artefacts can be partially hidden, unreachable or
just not modifiable. It can be the case, for instance, when old, non-instrumented
libraries not designed to support monitoring/adaptations are used. Also, some
component code may be too large to be completely monitored. In such complicated
situations, compromises have to be reached in order to balance the targeted
autonomic properties and the complexity of instrumenting the system. In some
cases, certain levels of autonomy cannot simply be reached because of lack of
sensed data availability, and here clearly the point of an autonomic management
solution has to be questioned.
Therefore, given the complexity and cost of instrumenting a system, the goal is
not to collect just any information that can be obtained about a system but, rather,
get appropriate data that can be used to carry out autonomic actions. Appropriate
has different meanings here. First, it means that collected data have to be in line with
the autonomic properties that are sought. For instance, when it comes to performance management, a number of precise measures are needed to characterise system
performance that can include high-level information like memory consumption or
disk usage.
Also, more obscure sensor data may be extracted and reasoned about such as the
mean execution time of a software component, the time spent in specific parts of
the software, the number of threads and recurrent event patterns. Developers of
autonomic systems have to realise that presenting useless information to an autonomic manager does not come without impact. It has the effect of degrading the
performance of the managed system without improving its autonomic capabilities.
Instrumenting a system always has a cost in terms of code size, memory consumption and overall efficiency. This issue is acerbated for embedded systems, for
instance, where any addition of code has to be carefully justified due to lack of
resources. The trade-off between autonomy and the engendered loss of efficiency,
increased code size and memory footprint must always be considered. Finally, collecting data is a dynamic process and the data that is deemed appropriate may
change over the managed systems lifetime.
Likewise, the volumes of data to be collected can be very different: whether it is
raw or elaborated, simple or structured and functional or non-functional. Some data
is raw in the sense that it corresponds to values directly captured in the managed
artefacts. It can be, for instance, the memory consumption at a given time, the available disk space or any business-related value. In this case, there is no difference
between what is collected and what is presented to the autonomic manager. Some
other data are more elaborate in the sense that they are the result of operations
applied to a number of raw data. It can be, for instance, the mean value of memory
consumption on a given period. Making up aggregated information is necessary in
many situations, for efficiency reasons or because of network contingencies. Again,
it is clearly not desirable to communicate a huge number of low-level information
102
4.2.3
Effectors
Effectors are code or components that effect change and are provided by the managed elements. The purpose of effectors is to allow the autonomic manager, or any
other authorised entities, to trigger modifications to the managed artefacts in a synchronised fashion. That is, the timing or order of the changes makes sense and the
systems integrity is maintained. Like sensors, it turns out that determining the
effectors required and then implementing them can be a challenging task. It
demands, in particular, the anticipation of possible changes and the provisioning of
the technical means to realise them, especially at the execution platform level.
Management actions can impact on the different architectural elements, including components, connections, classes and operations. Since the autonomic manager
may have to act upon several of them, synchronisation mechanisms have to be
installed to ensure the change makes sense. As a matter of fact, modifying a software element is always tricky. In our case, it can have impacts on the other managed
artefacts and also within the autonomic manager itself. For instance, the way information is collected may be modified when a change is ongoing.
Effectors carry out changes to the managed elements in order to modify their
behaviour. Changes can be related to the elements functionality or to the quality of
4.2
103
104
4.2.4
Autonomic Manager
4.2
105
Decision
Collect
Usage context
Computing context
Act
Cooperate
Modify artefacts
the software in order to meet high-level management goals. These tasks, demanding
some level of intelligence, therefore seek to assist or replace human administrators
in their management tasks.
The purpose of an autonomic manager is to apply domain-specific knowledge in
order to gracefully adapt a set of software artefacts at runtime when internal or
external changes are detected. It is structured around a collect/decide/act control
loop, as summarised by Fig. 4.4. As indicated, the autonomic manager makes use of
monitored data and combines this with its internal knowledge of the system to plan
and implement management tasks, that is, execute the low-level actions that are
necessary to achieve the aforementioned goals. Diverse changes can be triggered on
the managed artefacts and, in some limited cases, on the computing context.
The knowledge handled by an autonomic manager concerns the techniques that
are used to structure and implement software and the rationale under which these
decisions are made. The ontology used by an autonomic manager includes objects
like components, data structures, events, files, libraries, operating systems and functions relating these different objects in some points in time. But it also contains
concepts like requirements, traceability links and deployment strategies, pertaining
to the initial phases of software development. A major challenge when building an
autonomic manager is certainly to relate decisions pertaining to different phases of
software development and use to make decisions.
The decision part of the autonomic manager has to reason about the present,
perhaps the past and possibly even the future. Indeed, in order to correctly comprehend a present situation, an autonomic manager may benefit from being able to
reason about previous experiences or historic events, to identify trends and recurring
patterns, to better put the current situation into context and to take more appropriate
action. A more sophisticated autonomic manager may be able to foresee the consequences of its various actions on the managed resources (e.g. via self-simulation)
and accordingly decide on the most appropriate ones to take at any one time. Some
level of future prediction may also enable an autonomic manager to take pre-emptive
106
4.2.5
At the architectural level, autonomic managers are defined through their capabilities,
that is to say their functions, and their properties. Even if the functions provided
are approximately the same for all autonomic managers, properties may vary a lot.
In particular, an autonomic manager can be optional or mandatory, changeable or
fixed, fully or partially autonomic.
4.2
107
108
far as the ECA rules have been derived from the goals. Stateless means that such
managers do not maintain any history about past events and actions. They only
consider recent events, occurring in a defined period of time. Stateless approaches
minimise complexity, are quite lightweight and, in short, are very effective in
uncomplicated situations requiring rapid decisions. They are also very limited, that
is, the autonomic manager keeps no information regarding the state of the managed
element and relies solely on the current sensor data readings to decide whether to
enact an adaptation plan. Some managers may use simple or more complicated
learning process to change their reflexes over time. This may be beneficial, for
example, for rendering reflexes better suited for the most frequently occurring situations or to temporarily block a reflex after it has just been triggered in order to
avoid state oscillations (i.e. switching from state to state and not getting on with the
purpose of the system). However, in reflex-based managers, it is not possible to
perform advanced reasoning about the situation and, as a consequence, it is hard to
deal with complex situations, especially when the nature of these situations is bound
to evolve over time. In particular, in reflex-oriented approaches, it is very difficult to
realise that past actions have failed.
Alternatively deliberative managers keep and reason about state information
regarding the managed element. This information can be updated progressively and
dynamically through fresh sensor readings. Accumulated state information allows
the manager to improve any existing knowledge it may have, based, for example, on
trend analysis, pattern detection and so forth. Better knowledge allows the manager
to carry out more complex reasoning and analysis of candidate solutions to identified
problems, subsequently increasing its chances of taking the most appropriate action.
In addition, recorded state information allows the system to be either more sensitive
or less sensitive to the sensor readings, in order to avoid the phenomenon of oscillating forward and backwards between states. This undesirable phenomenon is also
known as state flapping that occurs particularly in complex systems such as networks
(we describe this phenomenon in more detail later on). Such stateful approaches also
permit some self-learning, as previously introduced. A deliberative manager can
learn from its previous decisions and actions by continuously evaluating and potentially modifying itself. Such capability is, of course, more costly as it requires more
resources than reflex-based solutions. This renders deliberative approaches not applicable for certain systems, such as real-time or embedded systems.
Mixed approaches have also been studied, essentially in artificial intelligence.
Applied in the autonomic world, such approaches define an autonomic manager as
being composed of two complementary parts: a reactive, reflex-based one, in close
interaction with the environment, and a proactive, deliberative one, supervising and
adjusting the reflex-based functions. The reactive part is in charge of implementing
rapid actions in response to some well-defined conditions in the environment. The
proactive part, which can be executed on remote, more suitable resources, deals
with state conservation and complex reasoning. Based on its findings, this deliberative part can update the reactive part. For instance, it can change some parameters,
change reflex rules or even add or remove rules. This approach retains advantages
of both worlds but is difficult to realise.
4.3
109
4.3
4.3.1
Autonomic Manager
Knowledge and goals
Monitor
Analyse
Plan
sensors
Managed Artefacts
Autonomic element
Execute
effectors
110
that would necessitate corrective action. In such a case, it may represent the desired
states as a model of the managed artefacts. Once again, as in any model, such a
model is a focused, simplified representation of the desired state of the managed
artefacts. Detected problems and any associated analysis models are sent to planner,
potentially decorated with different attributes, such as the relative importance or the
urgency of the problem.
The planning activity logically comes after analysis. Its purpose is to determine
a set of management actions allowing the passage from a current state to a desired
state, as defined by the monitoring and analysis activities. Action sets are partially
ordered and can handle several failures or malfunctions at a time since problems can
be intertwined. The planning activity is carried out with some assumptions about
the context and the managed artefacts. This has a profound impact on the feasibility
of the adaption plans.
The execution activity, finally, has to carry out the plans, instantiating partially
ordered management actions. This activity directly interacts with the effectors provided by the managed artefacts.
The MAPE-K logical architecture has profoundly impacted the autonomic field,
providing a structuring framework to start with when building an autonomic system.
It is a modular architecture making sense for practitioners and combining properties
like the separation of concerns or scalability. The different activities, defined in
rather abstract terms, take care of focused, well-defined, complementary aspects.
Standardising communication interfaces of these activities, as advocated by IBM,
would even allow easier integration of various techniques developed by different
providers. The architecture is also scalable since activities can be executed on different machines, assuming that this is correlated with network latency and does not
affect reactivity.
It is important to note that the MAPE-K loop represents a logical architecture
and is not intended to be literally implemented as is in all autonomic systems.
Rather, its purpose is to indicate the main functions an autonomic manager must
support for administering a system and the main interdependencies between these
functions, that is, analysis depending on monitoring, planning on analysis, execution
on planning and all on knowledge. It also shows the manner in which the autonomic
manager interacts with managed resources (via sensor and effector touchpoints).
Various concrete designs and implementations are possible to instantiate this reference architecture.
Indeed, the MAPE-K proposal is not always directly applicable. Going through
the well-defined standardised interfaces of the four defined activities has a performance cost that cannot be always afforded. Timeliness is extremely important in
autonomic computing and sometimes constitutes one of the first requirements to be
met. A corrective action, in order to make sense, may be required to be carried out
within some time limit, and this cannot be negotiated. Thus, in some cases, grouping together some management activities, like analysis and planning, for instance,
can be required to meet a given deadline. Hence, the logical division advocated by
IBM is a high-level model; its merit is to provide high-level guidance facilitating
4.3
111
Monitoring directives
Current state
Analyse
Monitor
Context
Plan
Execute
Sensors
the design of an autonomic manager. But, it is only a first step in building a real
autonomic manager for a particular managed system.
Certainly, the major limit of this model is that it does not address the behavioural
dimension of an autonomic manager. Most of the time, an autonomic manager
does not implement a direct, straightforward monitor/analyse/plan/execute loop.
Interactions between the different activities are much more complex than that.
Backtracks are often needed when, for instance, a task needs additional information to
perform its computation. Breaks are also needed when, for instance, a task has to wait
for more data to be obtained or for some effects of adaptation to be measured. Also,
synchronisations are necessary when knowledge is shared by several activities.
4.3.2
Monitoring
112
4.3
113
propose QMON, an autonomic monitor that adapts the monitoring frequency and
therefore monitoring data volume to minimise the overhead of continuous monitoring
while maximising the utility of the performance data. It could be described as an
autonomic monitor for autonomic systems.
Also, data to be collected depend on the goals and on the state of the solving
process. Goals set by the administrators clearly affect the way the monitoring should
be done. Emphasis evolves as a function of the interest of the human administrators.
Similarly, intermediary results about the situation of the managed artefacts regarding the goals can influence data to be monitored and the way to collect them.
The monitoring phase provides information to the other management activities,
building a representation, or a model, of the present context and managed artefacts.
Once again, this model depends on the current goals and is very application dependent. To create it, the monitoring activity transforms the information collected into
an appropriate format, that is, a format that can be manipulated by the other activities. Such transformations can be complicated: they may involve a number of timely
operations like filtering, analysing and aggregating. In general, temporal windows
will have to be explicitly defined as some information makes sense only when
observed during given time periods. Knowing the right period can however be
difficult in some situations.
The output model can take different forms: a list of facts or observations, a graph
of objects, a state machine, a software architecture, etc. For instance, when handling
the battery life of a laptop, a three-state machine can be created and updated on
event occurrences. These three distinct states characterise the level of charge: under
1 % of charge, between 1 and 20 % and over 20 % of charge. Changing state can
provoke actions from the autonomic manager (analyse/plan/execute loop).
4.3.3
Analysis
We now focus on the analysis component of the MAPE-K loop. Analysing involves
evaluating the current state of the context and of the managed artefacts and specifying a target state if problems are identified. To do so, analysis relies on application
specific knowledge that can be hard to obtain. Let us remind that problems are
defined here as failures in the managed artefacts or suboptimal behaviour. Also, we
use the term state in its more general definition, readily acknowledging that various
formalisms can be used to define a state.
It is not the purpose of the analysis aspect to provide details about the identified
shortcomings but rather to specify the desirable states. It is then the job of the planning phase to come up with the best way to reach a desirable state (Fig. 4.7).
Analysis thus deals with the ability to understand the current context and to
determine a better state for the managed artefacts.
A wide variety of algorithms and techniques can be used in order to detect
misbehaviours and shortcomings, establish correlations, anticipate situations,
diagnose problems and define more desirable, and reachable, situations. This can
be anything from a model providing an evaluation of the situation to classification
114
Current state
Monitor
Target state
Analyse
Plan
Execute
systems that identify whether or not a constraint or goal has been breached.
Prediction systems typically monitor trends and also identify if a constraint or
goal will be broken in the near or further future. This can be implemented using
anything from simple regression analysis of a window of historical probe data to
using hidden Markov models that represent temporal states of the system and can
be used to model the outcomes of a plan.
The second purpose of analysis is to determine an improved situation for the
managed artefacts. This can be achieved using anything from a set of high-level
goals, expressed in a symbolic way, to more sophisticated models such as target
software architectures or property graphs. But, whatever the formalism, desired
situations have to be expressed in a focused, synthesised way. This is the very purpose of the MAPE-K loop that defines the specialised autonomic components that
intercommunicate. In the case of the analysis component, the goal is to feed the
planning phase with an abstract situation that is required to be reached from the current situationthis data is also focused and abstracted.
In the autonomic field, three policies have been heavily used to implement the
analysis expertise: eventconditionaction (ECA) policies, utility function policies
and goal policies [16].
Eventconditionaction rules are a clear and straightforward way to express
domain expertise. ECA policies can take the form:
when event occurs and condition holds, then execute action.
That is, when 95 % of a Web servers response time begins to exceed 3 s and there
are available resources, then increase number of communication ports. In this example, the action is the definition of a state to be reached, that is, a state where the number of communication ports better suits the needs of the managed system. ECA rules
and policy driven adaptation have been intensely studied for the autonomic management of distributed systems. However, a difficulty with ECA policies is that when a
number of policies are specified, conflicts between policies can arise that are hard to
detect. For example, when different tiers of a multi-tier system (e.g. Web and
4.3
115
application server tiers) require an increased amount of resources, but the available
resources cannot fulfil the requests of all tiers, a conflict arises. In such a case, it is
unclear how the system should react, and so in many cases an additional conflict resolution mechanism is necessary, for example, that would give higher priority to the
Web server. As a result, a considerable amount of research on conflict resolution has
arisen; the real challenge here is that conflict may only become apparent at runtime. A
pragmatic conclusion is that ECA rules are very effective when dealing with a small
number of policies or when concerns are orthogonal. When too many conflicts arise,
other formalisms have to be examined in order to avoid a debugging nightmare.
Utility functions rely on the definition of a quantitative level of desirability of a
given system state and any subsequent actions upon that state. This measure of utility is expressed as a function and takes as input a number of parameters and outputs
a desirability rating of this state. Thus, as an example, the utility function could take
as input the current or predicted response time for a set of Web and application servers available to choose from, thus returning the relative utility of each combination
of Web and application server response times. This way, when insufficient resources
are available, the most desirable combination of available resources among Web and
application servers can be found. The major problem with utility functions is that
they can be extremely hard to define, as every aspect that influences the decision
must be quantified and combined into a single figure. Nevertheless, utility functions
have been found to be very useful and have been used in automatic resource allocation [17], adaptation of data streams to network conditions [18] to name two examples. They are also very useful in very dynamic environments where devices, for
instance, come and go. Utility functions are here used in intelligent homes to allow
the autonomic manager to decide whether or not to select a given device to run a
media stream [19].
Goal policies require planning on the part of autonomic manager and are thus
more resource intensive than ECA policies. However, they still suffer from the problem that all states are classified as either desirable or undesirable. Thus, when a
desirable state cannot be reached, the system does not know which among the undesirable states is least bad.
4.3.4
Planning
Let us now take a look at the planning aspect of the MAPE-K autonomic loop. In its
broadest sense, planning involves making a decision regarding the changes and
adaptations to assemble and implement on the managed artefacts in order to move
from a current to a desired state. To do so, a planner relies on a set of actions that
can be performed on the managed artefacts. Once again, we see the importance of
the link between the autonomic manager and the managed artefacts: action plans
depend on the effectors and effectors are put in place to carry out some desired
action plans. The planner should not consider the implementation details of the
actions. It is the purpose of the execution component to implement these actions,
which are often realised in dynamic computing environment (Fig. 4.8).
116
Monitor
Analyse
Actions plan
Plan
Execute
A plan of action can be static or more dynamic. One plan could consist of a static
set of steps that must be carried out when a particular condition has occurred. Let us
take the example of a large grid system. If the monitoring side of an autonomic
manager detects that a node has died, the simple steps may be to inform the user and
then automatically reboot the node or if that does not work, then inform the workload
dispatcher to reroute the load away from this node. More sophisticated and dynamic
solutions may model the behaviour of the components that compose the managed
artefact and then choose a plan (from the number of plans that already exist) on the
fly or even generate a plan at real time by iterating through the model of different
paths or scenarios and choosing the best. To do so, a planner make hypothesis about
the effects of the scheduled actions on the managed artefacts.
In autonomic computing, we also often make the assumption that the autonomic
manager is the only entity acting on the managed artefacts. This is not always true.
In pervasive applications, in particular, many things impacting the managed artefacts happen in the computing and usage contexts. The autonomic manager has to
regularly check its predictions about the effect of its actions and about the state of
the managed artefacts in order to verify the adequacy of the plan. In extreme cases,
it can be necessary to redo a complete MAPE-K loop in order to determine a new
objective and a way to reach it.
In any case, a planner has to anticipate the future and predict the effect of a given
course of actions and, furthermore, the data that should be monitored in order to
verify its predictions.
We then have two approaches to planning in autonomic computing: domain
specific or generic. Domain-specific approaches rely heavily on the administrators
expertise. Typically, the planning module is made of a number of rules taking the
form when target and condition state holds, then create plan where the plan is
entirely specified, or instantiated, in the rule. An example of such a rule would be
when the number of communication ports has to be increased and there are available resources, then consume all the available resources and distribute them optimally. In this example, the action is the opening of new communication ports for
each of the Web servers. These rules are typically written by system administrators
4.3
117
derived from system and business goals. Writing adaptation policies is fairly
straightforward but can become a tedious task for larger complex systems, especially when conflicts have to be dealt with.
Generic approaches are much more ambitious. The idea here is to formally
express the problem, that is, the notion of state, and to define action operators acting
on the states. Operators are generally defined with preconditions and effects on the
state. Planning then comes down to determining a sequence of operators allowing
the passage from the current state to the target state. It generally takes the form of a
graph search, either with or without heuristics.
An architectural model of either a focused part of, or indeed the entire managed
system, is often used to formalise the current state. This architectural model reflects
the systems behaviour, its requirements and the system states required to reach its
goals. The model may also include aspects of the operating environment in which
the managed elements are deployed. Here, the model is updated through sensor data
and used to reason about the managed system to plan valid and appropriate adaptations. A great advantage of the architectural model-based approach to planning is
that under the assumption that the model correctly mirrors the managed system, the
architectural model can be used to verify that system integrity is preserved when
applying an adaptation. That is, we can guarantee that the system will continue to
operate correctly after the planned adaptation has been executed. This is because
changes are planned and applied to the model first, which will show the resulting
systems state including any violations of system constraints or requirements present in the model. If the new system state is acceptable, the plan can then be executed
on the managed system.
Building a model of the system under question is however a non-trivial task.
It assumes that the architect understands the components, their interaction and
behaviours to ensure accuracy. Further, the model needs to be able to run through
the different adaptation scenarios to check that an update is both useful and safe.
Given the number of states and each states interaction, the search of all interactions
is a highly complex problem of exponential proportions. This may mean that the
model and the system are highly decoupled. For example, the model may run on a
different machine so as to not impact the managed systems operation. This processing can potentially incur heavy execution costs. The timeliness of the solution
to the adaptation is important so to speed up the time model takes to reach an optimum solution heuristics may be used, which may or may not add error to the model.
4.3.5
Execution
Let us now examine the fourth and last activity of the MAPE-K loop. The purpose
of the execution activity is to implement the management actions determined by
the planning activity. Management actions essentially concern the managed artefacts, not the computing context. The purpose of an autonomic element, indeed, is
not to modify the environment but to react to its evolutions when they affect its
behaviour.
118
Actions plan
Monitor
Analyse
Plan
Result
Execute
Context
Effectors
Planning and execution are complementary activities. Planning focuses on highlevel actions to be undertaken, on their logical dependencies and, possibly, on their
order. Execution is much more concrete; it has to schedule the implementation of the
plans as they directly affect the artefacts currently running. It also has to examine in
real time the effects of its own actions in order to perform some adjustments if necessary (Fig. 4.9).
Dissociating planning and execution has been heavily investigated in artificial
intelligence but also in more traditional domains, like manufacturing execution
systems (MES), for instance, in order to deal with complex environments. Separating
out these activities is an efficient way to handle dynamic, stochastic or poorly
observable computing contexts. The principle adopted here is to work at two complementary levels of abstraction. A plan, for instance, could specify a set of parameters to be changed, with no ordering constraints. The execution activity, then, has
to determine how and when the parameters have to be changed. It is a matter of
timeliness and synchronisation where a number of functional and non-functional
dependencies have to be considered. Usually, a parameter can be changed only if some
conditions hold. When several parameters have to be modified, ordering constraints
have usually to be respected.
Planning has to make simplifying assumptions about the dynamicity and
predictability of the world in order to be able to produce plans at a reasonable
cost. It is then the purpose of the execution module to get back to reality and carry
out the plans in the real word. That means that it has to transform more or less
abstract directives into concrete interventions in the real word. It also may make use
of available sensors to get feedback about its actions. In this way, the execution
module implements a control loop of its own. Of course, the purpose here is not to
replace the global MAPE-K loop but rather to make sure that the management
actions decided by the planner are carried out as expected. If not, corrections have
4.3
119
to be made perhaps reengaging the planner, or even executing the whole MAPE-K
loop again.
Clearly, time management is at the heart of the execution activity. In order to
avoid unstable or incorrect situations, management actions have to be undergone at
the right time and in the right order. This is made more complicated in dynamic
environments where the computing environment has to be constantly surveyed
while corrective actions are under execution. For example, in pervasive environments, the way some management actions are realised may have to be changed
because of context evolution. Some actions may even have to be cancelled when
important changes occur in the environment. One way to deal with this is to use a
finite-state machine to orchestrate plan execution in dynamic environments. Such
machines allow for explicit and efficient coordination between corrective actions,
and also with ongoing operations. This approach has been used successfully for the
self-administration of industrial devices.
Another way for the execution module to deal with dynamicity and uncertainty
is to demand flexible plans from the planner so that it can recover from unexpected events. Analogy can be drawn here with the field of autonomous robots
immersed in uncertain environments. Here, static plans of actions turned out to
not be exploitable. Robots are then loaded with related partial plans that are
completed at runtime depending on the situations encountered. Reactive plans
have also been introduced after the failure of early static approaches like STRIPS
[20]. Reactive plans include branches in order to deal with uncertain events or
events that can only be known at runtime. Reactive plans can also be combined.
Thus, the directives sent to the execution module can take the form of a number
of reactive plans, including runtime contingencies.
By definition, the execution module interacts with the effectors of the managed
elements and should have no real control over external entities. But, it may also
interact with some other accessible entities in the computing environment. The
interactions can go from an authorised modification, through a simple setter interface, for instance, to some complex negotiations. Thus, to meet its self-management
objectives, the autonomic manager has to request for some modification to other
entities. This can include other autonomic managers controlling some other parts of
the software at hand. For example, some managed artefacts using threads may need
additional threads to improve their performance. Typically, threads are global concerns and are generally managed by the operating system, for example.
A final point is that effectors, like any other computing entities, can fail. This
means that the actions requested by the autonomic manager are not carried out. This
can happen because of a bug, local to the effectors, or because of the global situation
of the managed artefacts. Ironically, such a failure can be very well related to the
issue that the manager tries to solve. For example, if the reason the artefact was performing poorly was due to lack of memory resource available to it, perhaps this too
will affect the autonomic manager being able to run a new process to effect change.
Thus, the implementation of the execution activity often turns out to be very
complex and tricky. For all these reasons, current solutions are essentially domain
specific. Of course, they can rely on generic mechanisms, usually provided by the
120
underlying execution machine, in order to control the artefacts life cycle. But, complex
timing and synchronisation issues are generally handled case by case.
4.3.6
Summary
The MAPE-K model has had a big impact on the autonomic computing field and is
still very relevant. It has to be understood, though, as a logical architecture defining
the main architectural blocks to be defined when building an autonomic manager.
From that, depending on the specifics of each application, different implementations of the MAPE-K model are possible: from a monolithic approach to widely
distributed ones.
In any case, the MAPE-K model gives no indication about the way the aforementioned tasks should be implemented nor on the way they should be organised and
controlled. Similarly, this model does not address the way the knowledge is represented and shared between the different tasks.
Let us consider knowledge first. The way knowledge is shared among the different
activities is not specified in the MAPE-K loop, and it leaves open many solutions,
including a global shared database or completely distributed solution based on the
exchange of events. So knowledge is essentially represented in the models that
correspond to the following: the managed element and its interactions; the classification and feature extraction systems; the effectors or actions that have to be
performed (and when they are performed); the plans, etc. Knowledge lies also in
recording the changes to the system that occurred when the system is adapted, and
some systems may or may wish to close the loop in this respect to allow for further
more sophisticated analysis and planning strategies that can improve their operation
based on past experiences. Information is required to flow through the system.
This means that the managed element must be able to export interfaces to allow the
flow of attributes that represent both functional behaviours and control procedures.
As can be seen from the examples above, autonomic management operates at many
levels of abstraction. Therefore, the MAPE-K loop can be a combination of loops
and loops of loops, as we shall see later in this chapter. The knowledge in an
autonomic system can come from sources as diverse as the human expert (in static
policy-based systems [21]) to logs that accumulate data from probes charting the
day-to-day operation of a system to observe its behaviour, which is then used to
train predictive models [22, 23].
Control is clearly a hard point. Complex control strategies are often needed to
allow the right coordination and an effective synchronisation between the monitor/
analyse/plan/execute activities. These strategies depend on the application domain.
They can go from a simple state machine controlling the activation of the activities
to an AI-based controller allowing opportunistic activation of these activities [24].
As we can see, using the MAPE-K model is a good starting point to define ones
autonomic manager. However, its structure and design are not prescriptive, and
therefore, the designer may find that they tailor the MAPE-K concepts to best fit the
system they are designing; the rest of this chapter gives some examples.
4.4
121
4.4
4.4.1
Introduction
Goals / feedback
Autonomic
element
Autonomic
element
Autonomic
element
Autonomic system
122
Autonomic
element
Goals / feedback
Goals / feedback
Autonomic
element
Autonomic
element
Autonomic system
two architectural patterns, heavily studied in the multi-agent field, are presented in
the next sections with examples from the autonomic field.
4.4.2
4.4
123
Goals / feedback
Autonomic
element
Cooperation
Autonomic
element
Cooperation
Cooperation
Autonomic
element
Autonomic system
124
knowledge to understand that a fire has broken out, and it wishes to relay the message to the appropriate people. This may require it to quickly negotiate a wireless
route to the sink (away from the fire) via all the other sensing devices by appropriating their respective transceivers. To do this, they must communicate directly
and negotiate with all the neighbourhood autonomic managers and agree to serve
this data route.
As autonomic management solutions become more decentralised and less deterministic, we may begin to observe emergent features. That is, even though the interactions between autonomic components are simple, the systems complexity
increases considerably and we begin to observe the evolution of patterns that can
either be desirable or not. To harness this, there have been moves to engineer this
emergence. Taking the bio-inspiration of autonomic systems further than the original definition, the idea of engineered emergence can be described as the purposeful
design of interaction protocols so that a predictable, desired outcome or set of outcomes are achieved at a higher level [29]. It is a highly distributed approach to
building autonomicity into systems. The main benefits are that of scale, robustness
and stability. This is because there is no central management function; each managed component acts on its own behalf. Therefore, if one managed component dies,
then the system should be able to cope gracefully. Likewise if a new, better component
arrives, the system should evolve to make use of this. The other benefit of such
approaches is that they do not require precise knowledge of lower-level activity or
configuration. In such systems the solution emerges at the level of systems or applications, while at lower levels the specific behaviour of individual components is
unpredictable.
Here, typically a small set of rules operates on limited amounts of locally available information concerning the components execution context and its local environment. This differs from traditional algorithmic design of distributed applications
that typically focuses on strict protocols, message acknowledgments and event
ordering. In traditional systems each message and event is considered important and
randomness is generally undesirable, imposing sequenced or synchronised behaviour, which is generally deterministic. However, natural biological systems are fundamentally non-deterministic, and there are many examples of large-scale systems
that are stable and robust at a global level, the most commonly cited examples being
drawn from cellular systems and insect colonies [29]. The main drawback to such
systems is that they need regular messaging between components to be able to comprehend their state and adapt if necessary. Furthermore, guarantees are not as solid
in such emergent systems, and timely convergence to a desired state may take time.
4.4.3
4.4
125
its utility as a measure of things like the amount of resources available to the user
(or user application programs) and the quality, reliability or accuracy of that
resource. For example, in an event processing system allocating hardware resources
to users wishing to run transactions, the utility will be a function of allocated rate,
allowable latency and number of consumers, for example, [30]. Another example is
in a resource provisioning system where the utility is derived from the cost of redistribution of workloads once allocated or the power consumption as a portion of
operating cost [31, 32].
The (autonomic networked services) ANS protocol was designed to allow fully
distributed autonomic decision making where there are more than one alternative
state to be able to adapt to. In the ANS a computing system is completely composed
as services (not unlike services in service-oriented architecturessee Chap. 6).
Each service is called a context and is able to say what it does and to what degree it
is able to do this. This ability is termed its quality of context.
The protocol was designed for pervasive computing devices and wireless sensor
networks. Constrained resources pose the greatest challenge to the use of such systems. Power availability is severely constrained due to the capacity of current battery technology. Sending and receiving wireless communication is the greatest
consumer of power in these systems, and so must be limited as much as possible.
The applications also impose other constraints such as robustness, scalability, stability despite configuration change and low communication latency.
So the application may wish to have location information. This can be obtained
from a number of devices. The notion of quality of context (QoC) is used by ANS
to choose which of these devices will be used when a request is made by a requester
node. A process called tendering is used to select the node. The information
requester node will broadcast a request command containing the name of a sensing service (like location) and preferences for the QoC attributes (which is essentially the degree of precision the application can tolerate). Every node within range
and able to fulfil the service must respond. The nodes use a utility function to calculate its ability to fulfil the request if it matches the requested QoC. The result from
the utility function is a signed integer called closeness and is used as the nodes
response to the information request. The sensor node with the QoC closest to that
requested wins the tender and becomes the sensor supplying its data to the information requester. Frequent retendering allows the requester nodes to autonomously
adapt to the best source available, discover new devices in the network and recover
from node failure. Figure 4.13 shows message diagrams illustrating how the protocol fulfils the typical autonomic aspects.
The figure also shows the system that requires a service being turned on and its
first task is to tender a message. Any sensors (service providers) that are on and able
to service that request reply with a can_provide message; the service with the best
utility in terms of quality of context is selected (if there are more than one). For selfhealing, we can see that the requestor carries out periodic tenders. If the node that it
is currently bound to says that it can no longer carry out the task to the requested
level of service, or it dies (i.e. no reply is received from that node), then reconfiguration should happen. This happens by default as all nodes in the system have the
126
Environment
Application
Temperature
Sensor
Environment
Application
High
precision
Temperature
Sensor
Temperature
Service
Requestor
Temperature
Service
Requestor
tender
can_provide
Medium
precision
Temperature
Sensor
Environment
Application
High
precision
Temperature
Sensor
Temperature
Service
Requestor
Medium
precision
Temperature
Sensor
tender
select
tender
ok
can_provide
not OK
select
can_provide
tender
ok
tender
can_provide
select
ok
select
ok
Self-configuring
Self-healing
Self-optimisation
opportunity to reply to the request indicating their performance levels, then the
requestor can just select the next best service. Likewise, if the requestor receives a
better offer when carrying out its periodic tender process, it will take it. This sort of
protocol is highly decentralised and using lightweight tender messages can ensure
autonomic properties.
4.5
Key Points
References
127
References
1. Garlan, D., Perry, D.E.: Introduction to the special issue on software architecture. IEEE Trans.
Softw. Eng. 21(4), 269274 (1995)
2. Dey, A.K.: Understanding and using context. Pers. Ubiquit. Comput. 5(1), 47 (2001)
3. IBM.: An Architectural Blueprint for Autonomic Computing, 3rd edn. IBM Whitepaper, June
2005
4. Garlan, D., Schmerl, B.: Model-based adaptation for self-healing systems. In: Proceedings of
the First Workshop on Self-Healing Systems. ACM Press, Charleston, SC (2002)
5. Sterritt, R., Smyth, B., Bradley, M.: PACT: personal autonomic computing tools. In:
Proceedings of the 12th IEEE International Conference and Workshops on the Engineering of
Computer-Based Systems (ECBS), pp. 519527. IEEE Computer Society, Washington, DC,
USA (2005)
6. Bigus, J.P., Schlosnagle, D.A., Pilgrim III, J.R., Mills, W.N., Diao, Y.: ABLE: a toolkit for
building multiagent autonomic systems. IBM Syst. J. 41(3), 350371 (2002)
7. Maurel, Y., Lalanda, P., Diaconescu, A.: Towards a service-oriented component model for
autonomic management. In: IEEE International Conference on Services Computing (SCC
2011), 49 July2011. IEEE Computer Society, Washington, DC, USA (2011)
8. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge,
MA (1998)
9. Littman, M.L., Ravi, N., Fenson, E., Howard, R.: Reinforcement learning for autonomic network repair. In ICAC: Proceedings of the First International Conference on Autonomic
Computing, pp. 284285, Washington, DC (2004)
10. Dowling, J., Curran, E., Cunningham, R., Cahill, V.: Building autonomic systems using collaborative reinforcement learning. Knowl. Eng. Rev. 21, 231238 (2006). Journal Special
issue on Autonomic Computing, Cambridge University Press
11. Tesauro, G., Das, R., Jong, N., Bennani, M.: A hybrid reinforcement learning approach to
autonomic resource allocation. In: Proceedings of 3rd IEEE International Conference on
Autonomic Computing (ICAC), pp. 6573, Dublin, Ireland (2006)
12. Whiteson, S., Stone, P.: Evolutionary function approximation for reinforcement learning.
J. Mach. Learn. Res. 7, 877917 (2006)
13. Kephart, J.O., Chess, D.M.: The vision of autonomic computing. Computer 36(1), 4150
(2003)
128
14. Zhang, J., Figueiredo, R.: Autonomic feature selection for application classification. In:
Proceedings of the International Conference on Autonomic Computing (ICAC), Dublin (2006)
15. Agarwala, S., Chen, Y., Milojicic, D., Schwan, K.: QMON: QoS- and Utility- aware monitoring in enterprise systems. In: Proceedings of the 3rd IEEE International Conference on
Autonomic Computing (ICAC), Dublin, Ireland (2006)
16. Kephart, J.O., Walsh, W.E.: An artificial intelligence perspective on autonomic computing
policies. In: Proceedings of the 5th IEEE International Workshop on Policies for Distributed
Systems and Networks (POLICY) 2004, 79 June 2004, pp. 312. IBM Thomas J Watson
Research Center, Yorktown Heights, New York (2004). doi: 10.1109/POLICY.2004.1309145.
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1309145&isnumber=29053
17. Walsh, W.E., Tesauro, G., Kephart, J.O., Das, R.: Utility functions in autonomic systems. In:
Proceedings of the First International Conference on Autonomic Computing, 1719 May
2004, IEEE Computer Society, New York (2004)
18. Bhatti, S.N., Knight, G.: Enabling QoS adaptation decisions for internet applications. Comput.
Netw. 31(7), 669692 (1999)
19. Bourcier, J., Diaconescu, A., Lalanda, P., McCann, J.: AutoHome: an autonomic management
framework for pervasive home applications. ACM Trans. Auton. Adapt. Syst. 6(1) (2011)
20. Fikes, R., Nilsson, N.: STRIPS: a new approach to the application of theorem proving to problem solving. Artif. Intell. 2, 189208 (1971). doi:10.1016/0004-3702(71)90010-5
21. Bougaev, A.A.: Pattern recognition based tools enabling autonomic computing. In: Proceedings
of 2nd IEEE International Conference on Autonomic Computing, 1316 June 2005, pp. 313
314. IEEE Computer Society, Seattle (2005)
22. Manoel, E., Nielsen, M.J., Salahshour, A., Sampath, S.: Problem Determination Using SelfManaging Autonomic Technology. IBM Redbooks, San Jose (2005). ISBN 073849111X
23. Shivam, P., Babu, S., Chase, J.: Learning application models for utility resource planning. In:
Proceedings of 3rd IEEE International Conference on Autonomic Computing (ICAC), pp.
255264, Dublin, Ireland (2006)
24. Maurel, Y., Lalanda, P., Diaconescu, A.: Towards a Service-Oriented Component Model for
Autonomic Management. IEEE SCC, Washington, DC, USA (2011)
25. Wise, A., Cass, A.G., Lerner, B.S., Call, E.K.M., Osterweil, L.J., Sutton, Jr, S.M.: Using LittleJIL to coordinate agents in software engineering. In: Automated Software Engineering
Conference, 1115 September 2000. IEEE Computer Society, Grenoble (2000)
26. Jennings, N.R.: On agent-based software engineering. Artif. Intell. 117(2), 277296 (2000)
27. Gleizes, M.-P., Link-Pezet, J., Glize, P.: An adaptive multi-agent tool for electronic commerce.
In: Proceedings of the IEEE 9th International Workshops on Enabling Technologies:
Infrastructure for Collaborative Enterprises, 1416 June 2000. NIST, USA, IEEE Computer
Society (2000)
28. Kumar, S., Cohen, P.R.: Towards a fault-tolerant multi-agent system architecture. In:
Proceedings of the Fourth International Conference on Autonomous Agents. ACM Press,
Barcelona (2000)
29. Anthony, R.: Emergent graph colouring. In: Engineering Emergence for Autonomic Systems
(EEAS), First Annual International Workshop at the Third International Conference on
Autonomic Computing (ICAC), June 2006, pp. 213. IEEE Computer Society, Dublin (2006)
30. Bhola, S., Astley, M., Saccone, R., Ward, M.: Utility-aware resource allocation in an event
processing system. In: Proceedings of 3rd IEEE International Conference on Autonomic
Computing (ICAC), pp. 5564, Dublin, Ireland (2006)
31. Osogami, T., Harchol-Balter, M., Scheller-Wolf, A.: Analysis of cycle stealing with switching
times and thresholds. Perform. Eval. 61(4), 347369 (2005)
32. Sharma, V., Thomas, A., Abdelzaher, T., Skadron, K., Lu, Z.: Power-aware qos management
in web servers. In: RTSS03: Proceedings of the 24th IEEE International Real-Time Systems
Symposium, p. 63. IEEE Computer Society, Washington, DC, USA (2003)
Monitoring can be seen as putting the self into self-management. Just as in psychology,
the self is the representation of ones experience or ones identity; in autonomic computing,
the data obtained from monitoring contributes to the representation of the systems
experience or current state, self-knowledge if you like. Knowing the system state both
from a functional and non-functional perspective is fundamental to being able to
perform the operations necessary to achieve system goals at the desired level.
To maintain the analogy, just as a human can become self-conscious, that is,
excessively conscious of ones appearance or manner leading to suboptimal
functioning, so too can an autonomic system. Here where there is too much
monitored data or the understanding of that data is erroneous or unclear which
means the system is trying to change but does not know how to. Therefore, there
have been a number of approaches to the monitoring of autonomic computing
systems, the aim being to minimise the intrusiveness of the monitoring function
while ensuring sufficient system self-awareness to optimise decision-making.
This section will focus on the monitoring function. To this end, we focus on the
establishment of absolute measureable technical metrics that represent the performance or state of the system. This data can then be processed and these conclusions
used to derive whether or not a system is meeting its quality levels or fulfilling a
contractual obligation at the much higher levels of abstraction.
129
130
5.1
Introduction to Monitoring
To monitor (vb): to watch, keep track of, or check usually for a special purpose (Merriam-Webster
online dictionaryhttp://www.merriam-webster.com/dictionary/monitoring); to watch and check a
situation carefully for a period of time in order to discover something about it (Cambridge Advanced
Learners dictionaryhttp://dictionary.cambridge.org/dictionary/british/monitor_5).
5.1
Introduction to Monitoring
131
Considering the actual architecture of an autonomic manager, the monitoring function requires at least two types of components: sensor touchpoints for extracting the
actual data and monitor components for integrating the data into self-management
control loops. Chapter 4 introduced essential sensing and monitoring concepts and
showed how these functionalities integrate into the logical MAPE-K architecture.
An important remaining question relates to the manner in which raw monitoring
data is actually transformed into information and knowledge that are relevant to the
managers reasoning processes. Depending on the constructed model types and the
reasoning processes that use them, information may be required from the monitoring
function in different formats or levels of abstraction. For example, managing the
performance of a computer cluster may require fine-grained measurements of each
servers resource consumption as well as aggregated measurements of the clusters
overall load. Moreover, such information may be provided either in a system-specific
format, such as the concrete CPU consumption, or as a domain-specific indicator, like
a critical state signal. In such cases, the monitoring function must additionally process
collected data so as to deliver it to the decision processes with the expected format and
semantics. Similarly, correlating causes and effects in order to form an efficient feedback loop may require the monitoring function to associate data of different natures
and over various periods, for example, it could confirm that an action has been successfully executed at a certain instant and observe the outcome effects at a later time.
Hence, in addition to collecting and delivering raw information, the monitoring
function may initiate various data-processing and scheduling operations, including
aggregation, filtering and synchronisation, for obtaining more abstract, domain
specific indicators. Here, the boundaries between the monitoring and analysis components of the logical MAPE-K architecture may become blurred, depending on the
concrete architectural solution for each particular system. In some cases, monitoring and analysis can remain as clearly separated as indicated in the MAPE-K design.
Other systems may employ data-mediation solutions to simultaneously perform
monitoring and analysis operations and provide state information from different
perspectives and at various abstraction levels. Chapter 9 provides an example data
mediation frameworkCilia mediationthat can be employed for such purposes.
Finally, certain solutions may prefer to use external analysis services [1],2 for processing large amounts of collected information. Indeed, efficiently extracting and managing logging information can represent a research topic in itself [2].
Monitoring represents a vast topic encapsulating various subjects and raising
requirements that differ quite significantly depending on the administered system
and targeted objectives. In this chapter, we generally concentrate on performance
monitoring, as an important QoS concern commonly specified in Service-Level
Agreements (SLAs). This provides an illustrative example of the main concerns
raised by system monitoring.
2
An increasing number of log management services are becoming available to deal with the
progressively high amount of system monitoring data. These include open-source solutions,
such as GrayLog2 (http://graylog2.org), LogStash (http://logstash.net) or Sentry (http://sentry.
readthedocs.org/en/latest), and commercial services, including LogEntries (https://logentries.com),
Sumologic (http://www.sumologic.com), Loggly (http://loggly.com) or Splunk Storm (https://
www.splunkstorm.com).
132
5.2
Performance Monitoring
Computer performance monitoring is an activity that has been carried out since the
invention of computing and, put simply, is the measurement of how well the system
is doing what it was designed to do. Traditionally performance data was available to
those interested via system performance logs, and a large range of performance
analysis tools were developed to allow the user to make sense of those logs, mostly
using statistical analysis techniques. In the early days of autonomic computing,
these logs were also used to help drive autonomic management, for example, IBMs
Log Trace Analyzer (LTA) that was part of the autonomic computing toolkit [3].
Many computer performance metrics have been established to allow us to
communicate the systems non-functional state, and benchmarks have been developed
to allow us to compare different systems or versions of a given system or algorithms.
Initially the processing speed was deemed as the most important performance metric,
but as more components were added to the computing infrastructure, new metrics that
better represent their performance, or some notion of end-to-end performance, have
since been derived. The major metrics are those that provide a notion of throughput
(work done over time), component utilisation, or the time to carry out a particular task
(response time is one example). The most popular general performance metrics are:
Instructions per second (IPS)This is essentially a measure of processing speeds;
this metric represents the number of predefined (usually artificial) instruction
sequences that can be run over a given unit of time for a particular CPU. This
metric is useful to compare CPUs; however, it has been criticised because the
instruction sequences may not be representative of the work load that that CPU
will undertake. Further, other components that interact with the processor may
have a large impact on that processors performance, and therefore this decoupling of CPU measurements of performance and performance metrics relating to
other components may mean IPS becomes meaningless. Perhaps the most used
metric in this class is MIPS, millions of instructions per second, which has been
used to describe the capacity of a machine since its inception in the 1970s.
Floating-point operations per second (FLOPS)This metric is similar to the
IPS, above, as it also represents a notion of processing throughput. It was primarily designed to make comparisons between machines designed to carry out
the heavy mathematics required by scientific applications, hence the focus on
floating-point operations in particular. Like with IPS, a predefined set of
instructions compose a benchmark that, in some cases, have been standardised,
to permit fair comparisons. Interestingly, this metric is used to compare supercomputers in the yearly Top500 competition. Currently the top machines are
running in the petaFLOPS (1015 FLOPS) region of performance.
Response timeThis metric represents the length of time a system takes in carrying out a functional unit of processing. It is a measurement of the time duration
taken to react to a given input and is predominately used in interactive systems.
Therefore, one example would be the time instant at which the first character of
the response is received on the computer screen when the user hit return to initiate some processing; a computer game would become unplayable if the user were
5.2
Performance Monitoring
133
to experience response time lag. Responsiveness is also a metric used in measuring real-time systems. Here the elapsed time between the moment the real-time
task (or thread) is ready to execute until the moment it finished is measured.
LatencyThis is a measure of time delay experienced in a system and is
generally used when describing the elements of the computer that concern
communication. For example, to establish an idea of network performance, one
could measure round-trip delay whereby the time for a packet to be sent until it
is received back at the same machine is measured, for example. Latency takes
into account not only the CPU (or CPUs) that processes that packet but, more
importantly, the queuing delays that incur during the trip that the packet took.
Utilisation and loadThese metrics are intertwined and primarily used to
understand the resource management function. Utilisation metrics provide a
notion of how well a given system component is being used and is described as
a percentage utility. For example, a CPU running at 90 % utilisation means that
over a given Window of time, the CPU was used for 90 % of that time. Load too
is a measure of work performed by the system and is usually reported as a load
average for a given period of time. The load metric is typically calculated as a
weighted average of the processes in the operating systems run queue over time;
so it does not only measure the work carried out by the CPU but gives an indicator
of how well that CPU is serving the jobs scheduled for it. For example, one could
have a CPU that is 99 % utilised but no tasks ready and waiting to run; however
another could also be 99 % utilised but have a large number of tasks in the run
queue, which represents a much heavier load.
There are many more performance metrics that can be used. Some may focus on
costs, such as transactions per unit cost or a function of reliability (length of time the
system has been in operation without crashing) or a function of availability to
indicate that the system is ready to be used when needed. Others may focus on size
or the weight of the system; this indicates how portable the device is; smart phones
or laptops are compared using such metrics.
With the recent interest in green computing, other cost metrics are coming to the
fore in the form of measures of energy efficiency. In particular, performance per watt
is being measured and represents the rate of computation delivered by a machine for
each watt of power consumed. The computation rate is typically the number of
FLOPS or MIPS achievable for each watt. However, given the cost of cooling larger
systems, the total energy cost may also be included. Correspondingly the heat generated by machine components is also being used as a performance metric. Like with
other performance indices, a benchmark may be developed for comparative purposes. One such benchmark is the GCPI (Green Computing Performance Index) that
has evolved from the high-performance computing communitys industry standard
performance benchmarks. These performance metrics mostly focus on the performance of larger server systems and systems housed in data centres; however, at the
mobile and laptop end of the computing field energy, consumption has been a focus
for some time due to such devices being battery powered.
There are many raw performance metrics that enable us to better understand the
non-functional state of the system, or provide a measurable means to allow us to
134
detect an event that has occurred in the system that has changed its non-functional
behaviours. However single metrics measured in isolation may not be enough to
understand the system, instead performance may be required to be derived from
combinations of such metrics. An example would be where an autonomic system is
required to balance the conflicting goals of maximising performance while minimising energy consumption. This system may have a rule that says if the CPU is
saturated (exhibits a high utilisation), start a new virtual machine by bringing up a
new physical machine, move some of the jobs to the new machine and then carry out
load balancing between the two. If the system sees that its CPU is 99 % utilised, it
may think that the CPU is saturated. However on examining the load average where
it was found to be low, bringing up a new physical machine in this instance would
be costly in terms of energy consumption (and performance), so CPU saturation in
isolation is not a good indication of current performance. Therefore combinations
of metrics are required to be used to better understand what is going on in the
system and to make more informed decisions.
5.3
5.3
135
One starting point is to understand the goals of the autonomic manager. If its job
is to ensure the system will be running with a 90 % uptime, then one would start by
identifying the possible causes of failure. This means that we need detailed
understanding of the systems components and their interaction. Some of those
components may indeed be legacy systems themselves, and some may be under the
jurisdiction of another business concern and provide a service to the system under
focus. Therefore the software that has the potential to fail is embedded throughout
potentially many subsystems.
Therefore it is paramount that some form of model of the system is established
so that this problem can be tackled through a divide and conquer approach.
Hierarchical-based performance monitoring and diagnoses are a concept that has
been around for some time [4, 5], as far back as 1991 architectural models were
being used to simplify this process. Some hierarchical models that can be used
today are state transition or other architecture models [6, 7] or use cases [8]. Both
represent the flow of logic, data and process activations. They represent both the
structure and behaviour of the system. From this, one can identify where potential
failures may occur. However, in larger systems, it is a more difficult task to identify
the source(s) of error even when using such modelling tools. However, it is valuable
to be able to identify failure early so that amends can be made.
An example of a hierarchical approach comes from Haydarlou et al. (see
Fig. 5.1). Here they use a combination of abstraction and use cases to establish what
to monitor and when [8]. They divide the system into levels, which helps with large
and/or complex systems. At the highest level, the application is described as being
composed of a number of communicating subsystems, which they call runnables.
At the next, lower level, each subsystem is composed of a number of components,
and finally, each component is composed of a number of classes and methods. The
use cases then represent the flow of logic or interactions between runnables. An
example [8] is a secure portal authentication request, and this is described at the
application level of abstraction below.
In the example in Fig. 5.1, a portal application is accessed by business users via
a Web browser who provide their certificates to the Access Manager subsystem
using HTTPS negotiation. When the certificate has been received, the Access
Manager verifies it and passes it to the Business Integrator subsystem over a JRMI
136
connection which in turn communicates with the Database Manager subsystem via
JDBC. In the Business Integrator component, the users identity is extracted and
matched against the users password to produce login information which is sent to
the Business Manager subsystem (a legacy back-end system using a SOAP connection). The Business Manager authenticates the user and returns the result of the
authentication to the Business Integrator. From this, the Business Integrator passes
the result of the authentication through the Access Manager back to the browser.
In this example, the architecture is modelled, at a level of abstraction that shows
the interactions between components; see Fig. 5.1. They then move down a level of
abstraction in the hierarchy to focus on these interactions because this is where the
potential for failure resides, for example, a broken connection, an incorrect start-up
sequence of runnables or excessive heap usage. Therefore in this example, the monitoring task will be positioned to report connections between these runnables, for
example, in this case sensor code is automatically generated to monitor the output
state of the invocation between the components. Another example that is supported
by this approach is at a further lower level of abstraction. Here the occurrence of a
given event (e.g. a NullPointerException) can be monitored, in a similar way to how
we carry out exception handling. Here the self-monitoring engine again generates
sensor code that gathers information (such as time-stamp, stack trace, name and line
number) about the exception which in turn is passed to the analysis element of the
MAPE-K loop. To monitor state changes, the system may choose to sense a value
before and after the instantiation of a component for comparative purposes. The
system that Haydarlou et al. propose is one that automatically places sensors within
the managed elements (i.e. the runnables) code; they also permit complementary
user-specified sensor placement.
In addition to varying the level of abstraction, the reporting period and amount of
sensor data measured can vary. As mentioned previously, the amount of data monitored can not only directly affect the autonomic managers ability to understand
what is going on in the system but also affect the autonomic managers ability to
compute conclusions and take action in a timely manner. To improve this, one could
reduce the amount of data measured; however, this may increase the likelihood of
missing an event or cause the adaptivity to lagmeaning the system is not agile
enough to adapt quickly to change. Moreover, given that sensors are lines of code
that run with the system components (usually within the system components), they
too consume system resources, thus directly affecting the performance of managed
elements. This is in terms of the resources being used to run the sensing code but
also in terms of the dynamic scheduling decisions the operating system may make
when deciding to run that code. This situation can be overcome through adapting
the monitoring function to the context, that is, depending on the situation, more
monitoring data may be gathered and in other cases lessfurther the number of
components monitored can be likewise adapted.
The data values read from a sensor obviously must be augmented with a mechanism to identify that sensor. Also data regarding the moment that the observation
occurred is also required therefore quantitative temporal information, usually in the
form of a time-stamp, is added to the message communicating the sensed value.
5.4
Profiling
137
In a distributed system, this method of identifying the time of the observation may
be problematic due to different hardware clocks not being synchronised resulting in
the observation moment being represented by different times.
5.4
Profiling
There are a number of self-awareness data gathering techniques that can be combined
to obtain a view or profile of the system. These we call profiling tools, and their
primary aim is to obtain runtime information to characterise the behaviour of the
system. There are a number of approaches to this summarised below, from [13]:
Manual instrumentationOne of the most direct ways to obtain information
about the runtime behaviour of a program is to manually inject monitoring code
into a program. However, this can be a complex approach and potentially errorprone, intrusive and impractical in complex applications. This is not only because
both knowledge of the managed element and its interfaces are required, but there
may be side effects due to altering code that can affect both the monitored element and other elements that rely on it.
Compiler-based instrumentationProfile code may be introduced automatically
by some compilers. This obviously occurs at compile time which assumes one
has access to the source code to be able to compile it. Nevertheless this approach
has the potential to be less error-prone than manual injection of profile points;
assuming the compliers profiling action has been well reasoned in advance by
those who supply the compiler. Though compiler-based instrumentation can
ensure that the profiling code is syntactically correct, one is limited to only the
features that come with that compiler. Furthermore, semantic errors can occur
even here, or temporal side effects as, again, extra code are being added to the
managed element. Once the program has been compiled with profiling features
enabled, extraction tools can be used to capture the profiled data for analysis.
This is the approach used by GNU gprof.3
Interception-based instrumentationLanguages, like Java or Python, that can be
interpreted or executed in virtual environments, can also provide instrumentation
at the virtual machine level. In many cases, hooks are used, that is, predefined points
in the execution where monitoring code can be attached to report performance
data or generate event notifications. This technique is dynamic and minimises
intrusiveness in that it does not require changes to the managed elements code.
It also permits the development of the monitored program code to be decoupled
and abstracted from the monitoring code. For many of these systems, the
types of hooks are predefined by the virtual environment, for example, JVMTI4
(JVM Tools Interface), .NET and the Python profiling module. More recently
aspect-oriented approaches have been deployed [9]. Here the programmer can
3
4
138
5.5
Monitoring Overheads
5.6
139
using the CPU (i.e. it is in the run queue) or is waiting for the CPU (in the ready
queue), increases the load number by 1. The average load itself is calculated using
an exponentially damped/weighted moving average of this load number. Therefore
the load averages reflect the system utilisation that occurred over the past 1, 5 and
15 min of system operation.
The load figure is usually obtained by periodically sampling the state of the
scheduler. An alternative would be to carry out calculations at a finer grain, that is,
when the scheduler enforces a change of state. However, this would be impossible
to realise as the scheduler is central to the operating system function, and so its
efficiency impacts significantly on the overall systems performance. Note that in
the scheduler, events are plentiful and frequent, processes can move from run to
ready or wait states, etc. every 100 ms. There is a slight disadvantage to the periodic
sampling approachit may not exactly reflect actual system behaviour. Lets drill
down a little further as an example. The clock tick that determines how Linux systems run the load calculation function is based on the clock frequency at 5 Hz,
which equates to the load code running every 5 s [11].
As the reader can hopefully see, to just understand the load of a single CPU,
albeit not even a 100 % accurate representation of load, will cost the system at least
in terms of reading the current load (active_tasks), reading the older averages from
the three time period arrays (EXP_n), then carrying out the updating calculations of
the three moving averages (avenrun) and then writing out the updated values back
to the three time period arrays. This has to happen every 5 s! This is a trivial example; imagine how much the monitoring functions alone would cost in terms of a full
autonomic systems overheads.
5.6
The monitoring function can be seen as active or passive. Here, active monitoring
involves the placement of probes into the system to monitor its function. We discuss
this more in the next section. Alternatively passive monitoring is where state information is captured by an external monitoring element or service and this data is
provided to the autonomic manager. The code to do this can be developed as part of
the autonomic system architecture. However, there exist a number of tools that carry
out this function, and some of these come with the operating systemfor free.
We describe the monitoring function as being for free in this example because
typically the autonomic system is placed on one or many machines that run an operating system and all modern operating systems (bar some embedded systems) come
fully instrumented and very able to provide performance statistics concerning system operation. The load average example presented in the last section is an example
of some of the statistics being recorded automatically. Since the operating system
has to constantly probe its performance anyway, the autonomic manager may as
well make use of that information.
Typically the autonomic manager will wish to monitor the health of the system.
Runtime information such as CPU utilisation, memory and network usage can be
140
Table 5.1 vmstat and top performance reporting tools that report operating system and virtual
memory statistics; it obtains its data from the UNIX virtual file system /proc
Sample UNIX performance statistics
Vmstat: memory statistics
The number of processes waiting for runtime
or in uninterruptible sleep
Memory: the amount of virtual memory used or idle
memory left and of memory used as buffers or cache.
Amount of inactive and active memory
Swap: the mount of memory swapped in from
disk or to disk
IO: the rate of data blocks received and sent
to/from the block device
System: the number of interrupts per second (which includes
the clock) and the number of context switches per second
CPU percentage of the CPU spend running
non-kernal code, kernel code, idle, waiting for
IO and stolen from a virtual machine
Successful reads/writes, time spend reading/writing
Win32_PerfFormattedData_PerfOS_Memory
Class: memory statistics
AvailableMBytes;
CacheBytes;
CacheBytesPeak;
CacheFaultsPerSec;
PageFaultsPerSec
PageReadsPerSec;
PagesInputPerSec; PagesOutputPerSec;
obtained directly from the operating system. The length of time the system has been
running is useful to understand how reliable the system is and is also a metric that
can be obtained from the operating system.
Not only can runtime and accumulative information be obtained for free but
other relatively more static data such as the names of the nodes in the system,
the processor capacity of each, memory sizes, communications bandwidth and
disk meantime before failures may be available. Tables 5.1 and 5.2 show a very
5.7
Building Probes
141
5.7
Building Probes
So far we have described the trade-off between information richness and accurate
autonomic problem determination. Essentially the ideal is an optimisation that minimises the number of sensors or probes that are required to sample system state and
report it somewhere (a task that perhaps includes expensive disk saves) and yet
maximises the autonomic managers notion of current system operation. To this
end, the developer must establish what sensor types are required and the scope that
they cover. Some of these come from the free monitoring systems described in the
last section. At the other extreme is the purposeful placement of code that records
system state values or sets a trace on values that compose a set of operations. For
example, while system load can be obtained from the operating system, the response
time of a particular transaction or complex task within the system will require some
code embedded at the start and end of that transaction to establish time differences.
This not only impacts on the performance of the system in terms of the overheads
that are now attributed to the monitoring process but also contributes to the complexity of the system. Intuitively one would wish to develop simple sensors that
cover a well-defined minimal scope so that debugging the autonomic systems interface to the managed system and the autonomic systems function is less complex.
However to establish a richer understanding of the managed system, we would
require a larger number of sensors, and this means that the system is required to
142
have more probes or to carry out more probing of the managed system. As stated
before, a balance is required.
Once we have established the sensors that are required, the observation points
need to be identified, and obviously both are closely coupled. From this, the probes
can be placed.
At this point, raw data extracted by probes must be integrated into the autonomic management process. Two main functions are generally required in this
regard: information communication and, potentially, information preprocessing.
Communication is compulsory as data extracted by sensors must be sooner or later
transmitted to the autonomic managers monitor components. Preprocessing is
optional and may be executed either by sensors (before communication), by monitors (after communication) or by both. On the sensor side, some analysis code can
be added to enable the preprocessing of the information gathered by the probes to
reduce sensing communication costs. For example, the sensing function may
obtain response times every 5 s but only report the average response time per minute to the autonomic manager. This is essentially trading-off processing time
against the volume of data that is sensed, where some form of preprocessing of
sensed data can be carried out and only the result of this sent to the autonomic
manager. On the autonomic managers monitor side, similar analysis code can be
introduced for preprocessing raw data into more relevant information, such as
higher-level domain-specific indicators. As previously indicated (in Sect. 5.1), preprocessing functions on the autonomic managers side can be attributed to both
monitoring and/or analysis components, with respect to the generic MAPE-K
architecture.
So probes generally consist of code that instruments the managed software system
and its execution environment to provide information about system and context
state. In addition, physical probes can be introduced to monitor hardware resources
or environmental parameters, such as temperature sensors in enterprise clusters or
pervasive systems. Collected state data may be periodically communicated to the
autonomic managers reasoning processes, which then make decisions about how to
change the system and implement self-properties (improve system performance for
example). Certainly, in addition to input for the managers decision-making logic,
monitoring information can be used for knowledge-acquisition purposes. In all
cases, when data aggregation or filtering is enabled, only preprocessed data or
events are sent to the autonomic managers information processing logic. That is,
the probe touchpoints, monitoring and analysis components of the MAPE-K loop
may be more closely coupled.
Alternatively, the monitoring functions may contain event-reaction rules or policies,
for example, that not only trigger an event interesting to the autonomic manager but
also indicate when to send the monitored data. The decision of when to communicate
this data is important, however. One example could be where the state value may
reach (or exceed) a threshold, and this triggers an action to pass on this information
to the autonomic manager; this would be like the autonomic manager saying I only
want performance data when the average response time is less than 10 ms for the
last minute, for example. This communication may or may not contain the raw data
5.7
Building Probes
143
that caused the event to be triggered. However, for most cases, it will contain temporal
data to say either what time the event was triggered or after which instructions
within a sequence the event occurred. Also, the time lag between the probes
measurement and its subsequent communication may also be important; if the lag is
large, there could be a situation whereby the autonomic manager is acting on
out-of-date information. Further, it is important to know where the event was triggered
within the architecture of the system, both in terms of the software architecture and
the hardware itself. For example, it might report a response time violation event if a
probe read and calculated that response time for a given task was greater than a
given threshold. It would report this violation to the autonomic manager with
information about degree of violation in terms of time, what time it occurred
information about what task (or tasks) was involved and on what node in the system
this was running, etc.
The monitoring infrastructure, both at the probe and the monitor component
levels, may be hierarchical in that data probed from lower layers in the system
are passed to higher-level layers for correlation and analysis. This is typically the
case when the autonomic manager system is also organised in a hierarchical
fashion. As discussed in previously, the intermediate layers analyse and, perhaps,
filter data which is then passed to the autonomic manager functions at each layer
of the hierarchy. It may be sent to the layer above to take action upon the managed element. Correlation or corroboration is an important part of this process.
This may compare readings from probes to ensure that the reading is accurate
or reasonable within bounds. To do this, either moving Windows of historical
readings are maintained and trends observed, or probe readings from similar
processes running in the same environment may be compared to see if the current
value is an outlier. The system then may decide to ignore the erroneous result or
store it for later. Perhaps here, higher levels will be informed when a number
of these unusual results occur. Beyond error detection, correlation and corroboration can be used to reduce the amount of data communicated from probes to
monitoring and analysis functions and finally to the autonomic managers
decision logic.
The movement of monitoring data may not necessarily be in a hierarchical fashion. In more complicated systems where raw probe data is aggregated and passed
around the system, a publishsubscribe mechanism may be used. Here, sensors supply or publish their results to a pool of such data. Then the autonomic managers that
are interested in this data subscribe to receive notifications of changes in this data.
Other sensors or monitoring components that have analytical capabilities may also
subscribe to this to use for comparative or correlation purposes. An example of this
is where some autonomic managers in the system react to specific aspects of system
performance (e.g. transaction throughput) values, and therefore they subscribe to be
notified regarding the throughput of a particular component. Therefore it will publish its interest as say current throughput of Web server 4 on node 9. As one can
imagine in a large enterprise-wide autonomic system, the amount of raw probe data
can be immense. Therefore abstraction may be used to categorise notifications into
themes, and the autonomic manager can then narrow the field of performance data
144
Fig. 5.2 (a) Autonomic function resides on the same machine as the managed artefacts. (b) Managed
artefacts and autonomic functions reside on different physical machines. (c) A large-scale autonomic
system consisting of many managed artefacts residing on numerous computers or devices
5.8
145
However if economic costs are not prohibitive, there is nothing to prevent the
systems architecture being composed of nodes dedicated solely to the monitoring
and management of the system, while the remaining nodes are the nodes that run
the monitored system (Fig. 5.2b). In such systems, a separate CPU and associated
memory hierarchy is essentially carrying out the autonomic function. This kind of
architecture is more common in systems that require either very complex analysis
of the monitored data, and so a dedicated CPU is required, or they require the complex monitoring of very large-scale systems (1,000s of nodes).
An example of the former case, where the analysis is complex, is where
Kalman filters are used to establish a model of the monitored systems state.
This technique produces estimates of monitored values based on a series of
measurements observed in the past, ensuring that noise, for example, has been
filtered out. From this, the planning phase can better understand potential future
state and better make decisions regarding how and when to adapt the system.
Such an approach can require a large number of inputs, depending on the monitored system size, and as the processing consists of recursively iterating on this
potentially noisy data, its processing is correspondingly complex, possibly consuming more CPU and other resources than the actual managed system could
tolerate. Hence, it is best to place this analysis on another machine, perhaps even
a supercomputer!
An example of the latter, the large-scale system, is InfoScope,8 which continuously monitors planet-wide systems and consists of distributed monitoring facilities for thousands of nodes. Here mechanisms are used that improve the performance
of the monitoring system through both tailoring and compressing the monitored
data to best suit the current states of the system and the requirements of the autonomic function at that a given time. The probes remain with the managed system
and communicate state and performance data to the separate computing entities
that analyse the data. However, in an extremely large system, this too impacts on
the performance of the system as the bottleneck becomes the communications
infrastructure and the machine, or set of machines, that carry out the analysis; see
Fig. 5.2c. Schemes to reduce the amount of monitored data that is communicated
are favoured therefore.
5.8
In the previous sections, we saw that the monitoring functions of autonomic systems
rely both on sensors embedded into relevant artefacts and on monitor components
included in the autonomic managers MAPE-K loops. Communication functions
146
are required to feed necessary information into the autonomic management process.
Additionally, analysis functions, such as aggregation, filtering and scheduling, can
be mixed with monitoring functions and placed within either sensors or monitors to
provide preprocessed data into the autonomic managers reasoning logic. Various
architectural and technological solutions are possible for achieving such monitoring
and analysis functions, depending on the specific requirements of each autonomic
system. In this section, we illustrate a few examples of available solutions in order
to provide an overall flavour of this vast domain.
First, numerous monitoring utilities are available for collecting different data
types from various managed resources and context artefacts. For instance, for monitoring Java applications, the JVMTM Tool Interface9 (JVMTI) provides a means for
both extracting information and controlling applications running in a Java virtual
machine (JVM). Targeting the performance management of component-based applications based on Java EE, the COMPAS10 open-source framework was developed to
support portable, extensible and adaptable EJB component instrumentation and
monitoring. Also for enterprise systems, QMON [12] provides utility-aware QoS
monitoring, for different Service-Level Agreement (SLA) classes. For the performance
management of distributed systems at the resource level, the CLIF11 open-source
project provides a testing platform which includes both load-injection facilities and
a wide range of resource monitoring probes (including CPU, memory and network
bandwidth). For larger-scale systems like cloud and grid applications, the ganglia12
open-source project relies on a hierarchical design for offering a scalable distributed monitoring system. Additional freeware monitoring facilities providing
support for hierarchical organisation in grid and cloud environments include the
Clomon13 and Supermon14systems. On the industrial side, similar examples include
the Paramon cluster monitoring, the Big Brother15 Web-based network monitoring
or IBMs Tivoli Monitoring software16 for managing operating systems, databases
and servers.
Most of these tools represent mature, scalable and efficient monitoring solutions
for the specific system types they were designed for. In addition, they are often
bundled with complementary visualisation, analysis and control facilities providing
a rich support for domain-specific system management.
9
JVMTI homepage: http://docs.oracle.com/javase/1.5.0/docs/guide/jvmti; JVMTI replaces
previous utilities that provided similar functions, namely, the Java Virtual Machine Profiler
Interface (JVMPI) and the Java Virtual Machine Debug Interface (JVMDI).
10
COMPAS project: http://compas.sourceforge.net
11
CLIF project : http://clif.ow2.org
12
Ganglia project: http://ganglia.sourceforge.net
13
Clumon project: http://clumon.ncsa.illinois.edu
14
Supermon project: http://supermon.sourceforge.net
15
Big Brother Software homepage: http://bb4.com
16
Tivoli Monitoring software: http://www-01.ibm.com/software/tivoli/products/monitor
5.8
147
17
Composite Probes and CLIF (http://clif.ow2.org) projects were developed at Orange Labs,
France, and based on the Fractal component technology (http://fractal.ow2.org)
18
CiliaMediation project (https://github.com/AdeleResearchGroup/Cilia) was developed by the
Adle team at University of Grenoble in collaboration with Orange Labs, France, and based on
based on a dynamic service-oriented component technologyiPOJO/OSGi (www.ipojo.org)
(discussed in Chap. 9).
19
The Eclipse Test and Performance Tools Platform (TPTP) Project: http://www.eclipse.org/tptp
148
5.9
5.10
Key Points
149
remain constant until they are discovered to be not so useful. Here they may well be
updated manually. This approach is clean and simple but has obvious implications
for the autonomic managers ability to react to the system as it changes over time,
that is, its environment changes or its purpose changes. To better future proof
the system, alternatives to the hardwiring approach are coming to the fore. Here,
a feedback loop, dedicated to understanding how useful the parameters to measure and adapt in the autonomic system, is added. The autonomic manager too can
change the parameters to ensure that the autonomic system as a whole better
meats its goals.
Other more abstract approaches do not require the understanding of the
monitored system but view the autonomic system in terms of its feedback
loopsthe aim being to maintain stability in the system via the adjustment of
these feedback loops. One example of this is in AdaptGuard [16]. This system
builds adaptation graphs by essentially sniffing the monitored values coming
from the probes to the autonomic manager. It is able to use this data to detect
whether or not the system is likely to remain stable or not. If not, then stability
recovery policies are put into action which may involve the user to intervene.
These, like many of the systems presented in this subsection, are essentially
autonomic, autonomic monitors!
5.10
Key Points
This chapter has presented the many different methods and techniques that can
be used to achieve the monitoring function in autonomic computing.
Here we see that the monitoring function can have an important impact on the
autonomic systems ability to understand its non-functional behaviours, its ability to perform, as well as impacting on the monitored elements performance.
This can be in terms of probe code and perhaps even analysis code competing for
the same resources as the monitored element itself!
The numbers of, and the placement of, probes is an important issue therefore.
We learned that the monitoring function of the autonomic system may double up
as the performance reporting function for some systems. More commonly,
operating system produced statistics are used to report performance and feed the
autonomic monitoring function.
Non-performance data that measures aspects such as the numbers of times a
piece of code runs or uses specialist probes that determine some form of qualitative measure of the managed elements behaviour may also be used.
We learned that there may be large volumes of such data produced and, depending
on the kind of data gathered performance data, may be relayed to users in different ways. This may take the form of a call graph, detailing dependencies between
method calls; statistical summaries, detailing resource consumption per method
or unit of computation; or an execution trace, detailing sequence of method calls
and performance related information, for example.
150
References
1. Biyani, V.: Log management as a service. What & why: Log management in cloud. Cloudspring,
Nov. 2012. http://cloudspring.com/log-management-as-a-service
2. Chuvakin, A.A., Schmidt, K.J.: Logging and Log Management: The Authoritative Guide to
Understanding the Concepts Surrounding Logging and Log Management, 1st edn. Syngress,
Waltham (2012). 460 p. ISBN 1597496359
3. IBM.: Autonomic computing toolkit: Developers guide. Technical Report SC30-4083-02,
IBM. Available at http://www-128.ibm.com/developerworks/autonomic/books/fpy0mst.htm.
Aug 2004
4. Mozetic, I.: Hierarchical model-based diagnosis. Int. J. Man-Mach. Stud. 35(3), 329362
(1991)
5. Garlan, D., Schmerl, B.: Model-based adaptation for self-healing systems. In: WOSS 02:
Proceedings of the 1st Workshop on Self-Healing Systems, pp. 2732, New York, 2002
6. Cheng, S.-W., Huang, A.-C., Garlan, D., Schmerl, B.R., Steenkiste, P.: Rainbow: architecturebased self-adaptation with reusable infrastructure. In: Proceedings of the 1st IEEE International
Conference on Autonomic Computing ICAC, pp. 276277, New York, 2004
7. Foster, H., Uchitel, S., Magee, J., Kramer, J.: LTSA-WS: a tool for model-based verification
of web service compositions and choreography. In: ICSE 2006, pp. 771774, Shanghai,
China (2006)
8. Haydarlou, A.R., Oey, M.A., Overeinder, B.J., Brazier, F.M.T.: Use case driven approach to
self-monitoring in autonomic systems. In: Proceedings of the Third International Conference
on Autonomic and Autonomous Systems (ICAS07), IEEE Computer Society Press, Athens,
Greece 2007
9. Kiczales, G., Lamping, J., Mendhekar, A., Maeda, C., Lopes, C., Loingtier, J.M., Irwin, J.:
Aspect-oriented programming. In: ECOOP'97Object-Oriented Programming, pp. 220-242.
Springer, Jyvskyl (1997)
10. Drongowski, P.J., AMD CodeAnalyst Team, Boston Design Center.: An introduction to analysis
and optimization with AMD CodeAnalyst Performance Analyzer. Advanced Micro Devices,
Inc, Sunnyvale (2008)
11. Hughes, P., Navratilova, V.: Linux for Dummies Quick Reference, 3rd edn. IDG Books
Worldwide, Foster City (2000). 256 p. ISBN 0764507605
12. Agarwala, S., Chen, Y., Milojicic, D.S., Schwan, K.: QMON: Qos- and utility-aware monitoring
in enterprise systems. In: Proceedings of the 3rd IEEE International Conference on Autonomic
Computing (ICAC'06), Dublin, Ireland, June 2006
13. Maurel, Y.: PhD thesis, CEYLON: a framework for creating extensible autonomic managers
and dynamics or CEYLAN: Un canevas pour la creation de guestionnaires autonomiques
extensibles et dynamiques, University of Grenoble (2010)
14. Avouac, P.A., Lalanda, P., Nigay, L.: Autonomic management of multimodal interaction:
DynaMo in action. In: Proceedings of the 4th International Conference on Engineering
Interactive Computing Systems, EICS2012, June 2528, pp. 3544. ACM, Copenhagen (2012)
References
151
15. Avouac, P.A., Lalanda, P., Nigay, L.: Service-oriented autonomic multimodal interaction in a
pervasive environment. In: Proceedings of the 13th International Conference on Multimodal
Interfaces, ICMI2011, 1418 November 2011, Alicante, Spain, ACM, pp. 369376 (2011)
16. Heo, J., Abdelzaher, T.: AdaptGuard: guarding adaptive systems from instability. In: The
6th International Conference on Autonomic Computing and Communications (ICAC 09),
Barcelona, Spain, 1519 June 2009
153
154
6.1
Software Adaptation
Binary code
0100110011
0
1
1
0
1
0100110011
10
0100110011 11
0100110010 11
0010101011 10
0100110011 11
0100110010
0010101011
Local Data
A=1
A=1
A=1
B=2
C=3
D=4
Software system
6.1
Software Adaptation
155
If the binary code is under your jurisdiction, you can change it (if you are the
initial developer) or update it when new versions or patches are available and whenever you decide to; this is a bit different with external resources. At best, you can
change some configuration parameters or switch to different services. But there is
usually no way to change the internals of a service or the pace of its evolution since
a third party controls it.
Adaptability is an essential requirement of software systems. This is not only due
to the evolving nature of both the requirements and the execution environments but
also due to the difficulties we encounter to build correct software systems. By correct, we mean software systems providing the services or capabilities demanded by
users along with expected qualities like performance, resilience and availability.
Since many systems are not perfectly correct in this sense at installation time, adaptations are obviously necessary.
By nature, an adaptation is carried out while the system is under operation.
This means that the software system already has been running for some time when
something is required to adapt. Let us refine that something. First, it can refer to a
change in the computing environment. In this case, the software system has to be
updated in order to integrate changes regarding its resources. The purpose is to
maintain the systems functionality while remaining in sync with its environment. Also, adaptation may be needed because the context decides it is time to do
so. For instance, for an embedded system when the battery is low, it may be necessary to operate in a low-power mode that might, in turn, decrease computational
precision.
Adaptation can also be triggered to provide functional evolutions. The goal here
is to bring new services or capabilities or to modify the existing ones in order to
better satisfy users or to take into account new running conditions (and related
opportunities). Similarly, adaptation can bring non-functional evolutions. The purpose here is to modify the properties attached to the provided services or capabilities
(again, to better satisfy users or to exploit new possibilities). For instance, changes
can be brought to improve performance or security.
Last, but not least, fixing a bug is a good reason to perform an adaptation.
Software adaptation is a real challenge. As a subject, it has been studied for
decades in the realms of software engineering, but for the most part, it still relies
too often on ad hoc solutions. The bottom line is that most systems are not conceived to be easily adaptable. They are merely designed and coded to meet the
requirements at hand, with little projection into the future. Some development
methodologies, like agile approaches, even argue that explicitly preparing a software system for evolution is counterproductive, resulting in fat and slow code for
uncertain, illusive gains. The consequence, and problem, is that adapting a software system can turn into a scary, uncontrolled process, requiring much expertise
and huge effort.
Software adaptation is thus an open issue in many regards. It is however at the
heart of autonomic computing. And the requirements here are very high! Adaptations
have to be carried out on software that is already in operation, and well-defined
support has to be explicitly provided so that autonomic managers can trigger safe and
controlled adaptations in programmatic ways.
156
6.2
Code Adaptation
6.2.1
Upgrading Code
6.2
Code Adaptation
157
Compilation
Link edition
Configuration
Deployment
Binary code
0100110011
0100110010
0100110011
0010101011
0100110010
0100110011
0100110011
0010101011
0100110010
0100110010
0100110011
0010101011
0010101011
0100110010
0100110011
0010101011
0100110010
0010101011
Local Data
A=1
B=2
A=1
C=3
B
D2= 4
A=C
1=
=3
B=D
2 =4
C=3
D=4
Source code
Software system
This approach is very convenient since upgraded code is obtained through simple
compilation. It is to be noted, however, that all possible runtime conditions have to
be anticipated in the source code through conditional directives. The range of possible
adaptations is de facto limited to expected situations therefore. Also, the source code
has to be accessible by the autonomic managers so that they can recompile it.
158
Following the same principles, existing binary code can be customised through
the modification of some configuration parameters. Those parameters can be defined
and assigned in a text file. They can also be specified (and evaluated) when launching
an application. In this latter case, restarting some modules is necessary in order to
update binary code. This approach is often used today, notably in pervasive computing applications, for instance, where a process is being transferred between devices
(e.g. a phone to a PC with a TV screen). In this situation, the binary code is not
really impacted (where programming languages and CPUs permit). It remains the
same; only a few variables impacting the behaviour are changed, for example, to
adjust the text rendering to better suit the output device.
In order to ease adaptation and avoid side-effect issues, the source code can be
prepared for evolution in advance. It is an approach advocated by the software product
lines community that promote the production of collections of similar software
systems from shared assets where software reuse is proactively engineered for.1 Here,
software artefacts that are likely to be context aware (i.e. able to adjust to their environment) are designed and implemented with adaptation in mind. Specifically, variation points are explicitly introduced in the source code in order to make room for late
modifications. Variability is a property of software artefacts allowing them to be
extended, modified, personalised or customised in order to meet some specific needs.
A variation point embodies a delayed design decision. It can be complemented by
possible choices, called variants. In the product-line approach, development is seen as
a two-phase process. The purpose of the first phase is to develop the software architecture and a number of software artefacts with reuse and adaptation in mind. Facilities
can be provided in order to ease the creation of adaptable artefacts. The purpose of the
second phase is to reuse and adapt the artefacts produced during the first phase in
order to obtain a software system in line with the requirements at hand. Here also,
tools are generally provided to bring adaptations based on the variation points.
This approach can be used advantageously in autonomic computing. That is, it
can be useful to determine and implement variation points in the code and let autonomic managers act on them when necessary. If variants are associated with variation points, modifications triggered by autonomic managers are safer and more
controlled. Also, relationships between variation points can be established, meaning
that a modification on a given point has to be followed by modifications on other
points in order to leave the system in a coherent state.
As explained, autonomic managers can change configuration parameters, recompile some or modify source code. In any case, however, programmatic interfaces have
to be provided and some variability has to be integrated, like options in a compilation
for instance. This is especially true for modifications in the source code; an autonomic manager needs tools and interfaces to adapt the variation points. The bottom
line here is that, whatever the approach, adaptations have to be prepared during the
initial development phase. Otherwise, only the (human) developer can go through the
code, at whatever abstraction level, and introduce the necessary modifications.
http://www.sei.cmu.edu/productlines
6.2
Code Adaptation
6.2.2
159
Integrating Code
Once the new binary code that is better suited to the current runtime conditions has
been produced by an autonomic manager or released by the maintenance organisation, it has to be installed and activated in the computing environment. To do so,
it has to be integrated with the existing code: this means that new and old code
must interact safely, preserving correctness.
Updates can be static or dynamic. Static update means that the software to be
adapted is stopped and then restarted after modification. This is the simplest and
most traditional method of integration. Here, the autonomic manager has to decide
on the best moment to stop and restart the software. This is exactly what happens
when your computer downloads the latest version of your preferred Internet browser.
In some cases, however, the software system can decide by itself to reboot. This is
for instance the case for major operating system upgrades. Static update is rather
easy to implement, but it can be costly since it decreases the availability of the software. This can have very negative impacts in domains, such as in pervasive computing, where continuous services are needed. In addition, we are moving to a world
where updates, easily made available on a Website, are more and more frequent.
It is unacceptable that the system would halt for every update.
When interruptions are not permitted, the adaptation is said to be dynamic. A
dynamic adaptation is always a delicate operation and must be conducted with care.
The operation is complex for the simple reason that the software system to be modified is still running! This raises a number of specific issues related to code integration per se but also to software correctness and to the preservation of data and
control flows, that is, state.
Indeed, the fact that the software has already been running before, this new activation completely changes the situation. A major issue is that internal computational
states are often lost between two activations. In addition to that, the mere definition
of state between two activations may be different, even incompatible (e.g. perhaps
due to the use of new programming structures). Similarly, advanced transactions
and isolation mechanisms may be needed to maintain state with regard to consistent
interaction with other external software that has a dependency relationship with
the software that is being integrated. However, this does not prevent these other software systems being placed in a pending state, which may imply that they suffer
downtime or loss of quality in some way as a result.
Thus, for technical and business reasons, a number of software systems today
require dynamic upgrades and without service interruption. However, dynamic
updates should not endanger the correctness and robustness of software [2].
Correctness is an absolute. An updated software system often has to meet new or
modified requirements. Regression tests and tests accounting for new features must
then be somehow combined to make sure that the newly specified requirements are
met. Test strategies are clearly impacted by the fact that the system continues to run.
Verifications have to be conducted while the system is active and executing and, in
general, cannot be performed as in traditional design-time testing.
160
Further, the ability to carry out dynamic updates should not alter the quality of a
software system. Crashes, loss of data, inefficiency and so on are obviously not
acceptable during the course of an update. The problem is that such bugs are very
likely to occur if dynamic modifications are not perfectly controlled. A number of
recurrent issues must be handled with care, especially those related to the preservation of the computational state (once again!). In particular, data and control flows
must not be lost or altered after an update. Taking the classic example of a merchant
Website, the active carts have to remain unchanged when a software update is carried
out. Regarding control flows, connections between structures (e.g. class instances)
must remain valid. That means, in particular, that a structure cannot be suppressed
if another is using it (i.e. connections in an active control thread must not be invalidated). On the other hand, a structure not involved in the main control flow is said
to be in a quiescent state and can be safely updated.
In general, it is not possible to upgrade active code. Active code corresponds to
code being run or referenced in the execution stack. At best, the system can wait for
the code to become inactive, which unfortunately is not always possible. For instance,
some parts of code may always be referenced and the window to allow an update may
never open. A common technique is to implement a quiescence protocol. The principle
here is to intercept calls to the code to be changed and to wait for the code to become
inactive. When this happens, the object is upgraded and the blocked callers are then
resumed. The main issue here is that dependencies between the updated objects could
result in very complex situations (such as deadlock, livelock and starvation).
If possible, a safe way to proceed is to carry out non-destructive updates, meaning that the structure of the software that existed before adaptation is not lost or,
worse, unrecoverable. For instance, some programming languages allow the cohabitation of different versions of the same structure and provide an elegant solution to
this so-called versioning problem. Destructive updates are easier to implement but,
of course, can be rather hazardous. Let us take the example of Web services. Here,
services provided by the system are regularly updated, including at the interface
level. A non-destructive approach is to maintain the old interface for some time
when creating new versions. Thus, clients can progressively adopt the new interfaces. In a more destructive philosophy, old interfaces would be immediately suppressed! This is easier to manage, of course, but the risk of compatibility problems
at the client side is higher.
A very popular approach to implement dynamic updates, known as rolling
upgrades, is to use hardware redundancy, especially in mission-critical, software
intensive or enterprise systems. Essentially, the principle is to deploy the upgraded
version of a software system on a new or alternative machine and to activate it. In the
meantime, the software to be updated continues to run and to provide full services.
A load balancer is used to redirect all the requests that were sent to the original software to its new version. At some point in time, the original software becomes idle
and the swap between old and new versions can be made. This approach is costly
since additional hardware has to be used and maintained. In addition, it demands the
synchronisation of states between old and new versions of the software. This is
however an approach used today in cloud computing (coupled with virtualisation
techniques) where resources are added on demand or the load gets too high.
6.3
161
6.3
Due to its complexity, it has become necessary to provide some support that controls the dynamic adaptation of software applications. Too many parameters have to
be considered (coupling, side effects, timeliness, etc.), and requirements are just too
high to adapt a system in a dynamic fashion without tools that abstract away some
of the complexity.
In this section, we give an overview of approaches that are commonly used in
particular domains and that make sense in the context of autonomic computing. To
do so, we consider three different levels of abstraction:
The operating system level where resources and services used by a software
system can be adjusted
The programs level where algorithms or quality of service can be modified
The component level where the high-level structure of software systems can be
changed
The bottom line is that, in most autonomic systems, adaptations have to be made
at different levels. Complementary techniques are needed to intervene at different
levels of abstraction and at different places. Thus, some modifications are to be
done at the operating system level, other changes have to change an instruction in
a program and, in other cases, it is more appropriate to update a whole chunk of
code (a component).
6.3.1
OS-Level Adaptation
The operating system (OS) community has studied the issue of dynamic adaptation
for a long time. Robust and relevant techniques have been developed and can be
162
Binary code
Autonomic
Manager
0100110011
10
0100110011 11
0100110011 10 11
11
10
0100110010
11 11
0010101011
10
0100110011
11
0100110010
0010101011
Local Data
A=1
A=1
A=1
B=2
C=3
D=4
Software system
Update
Operating System
6.3
163
6.3.2
Program-Level Adaptation
164
Binary code
Autonomic
Manager
Update
0100110011
10
0100110011 11
0100110011 10 11
0100110010 11 10
11 11
0010101011
10
0100110011
11
0100110010
0010101011
Local Data
A=1
A=1
A=1
B=2
C=3
D=4
Software system
Operating System
6.3.2.3 C Language
In the C language, modules that can be linked dynamically are packaged in specific
libraries. Their implementation depends on the operating system. These specific
libraries are called dynamic-link libraries (.dll) in Windows and shared libraries
(.so) in Linux.
6.3
165
The following example shows how to load a library with the dlopen() function,
how to get the address of a symbol defined in a shared (dynamic) library with the
dlsym() function and, finally, how to unload the shared library with the dlclose()
function:
166
entities called class loaders. The purpose of a class loader is to resolve external
references. To do so, it has to locate libraries containing the appropriate classes in
the system resources and load them into the virtual machine. Several class loaders
can be used in the same virtual machine. Their use is based on the following rules:
Every class loader but the initial one (bootstrap) has a parent.
Every class loader delegates the task of class loading to its parent before doing
so itself.
By default, a Java virtual machine possesses three hierarchical class loaders:
The initial class loader whose purpose is to load standard Java classes (rt.jar)
The extension class loader which loads classes of the extension directory (jre/lib/
ext)
The application class loader that loads archives defined by the CLASSPATH
More class loaders can be added to load specific aspects in a modular way. Each
class loader then has its own name scope. This is a powerful approach applying the
separation of concerns principle to dynamic class loading. In particular, it allows the
loading of two implementations of the same class as soon as they are loaded by two
different class loaders. Such an approach brings flexibility since two versions of a
class can be used by different parts of a system. In addition, backtracking to a previous
state is made possible. However, the class loader concept is not one that is always
mastered by programmers. This results in tricky, buggy situations where unexpected
classes are used in a programme.
In contrast to the C approach, verification is done before loading a Java library.
In particular, type system compatibility is checked. The following example shows
how to dynamically load a Java class:
6.3
167
168
Autonomic
Manager
Update
Architecture
Binary code
0100110011
0100110010
0100110011
0010101011
0100110010
0100110011
0100110011
0010101011
0100110010
0100110010
0100110011
0010101011
0010101011
0100110010
0100110011
0010101011
0100110010
0010101011
Local Data
A=1
B=2
A=1
C=3
B=2
A = 1C =D3 = 4
B = 2D = 4
C=3
D=4
Software system
Operating System
a regular compiler. This raises the usual issue of conformity and maintenance of
these often ad hoc tools.
6.3.3
Component-Level Adaptation
The introduction of software components in the late 1990s aimed at providing a new
level of abstraction and new facilities to software developers [4]. Developing large
software systems solely using fine-grained programming concepts like objects or
functions does not scale very well. This is because the granularity is too fine and a
number of global aids to development and maintenance are not supported. In most programming languages, for instance, there is no provision that supports non-functional
qualities or code dependency management for instance. That is, there is no global
view of applications, which is a sticking point to implementing adaptations. In many
cases, code upgrade is limited to small sets of instructions because of this lack of
global perspective [5].
Components provide a very useful level of abstraction for self-managed systems
[6]. As a matter of fact, a number of self-managed systems are today based on the
component granularity for adaptation. As illustrated by Fig. 6.5, autonomic managers
directly manipulate components to adapt the system and are not aware of low-level
details. This leads of course to coarse-grained adaptations.
A software component is a unit of composition that can be independently deployed
and executed. A component model defines a common structure for components
and rules to assemble them. Many component models are based on the notion of
provided and required interfaces expressing what a component can do and what it
needs in order to be executed. Many component models also introduce a set of
non-functional properties that are used to characterise the overall behaviour of a
6.3
169
B
A
A
C
170
Service
Registry
1. Publication
2. Discovery
Service
consumer
Service
provider
6.3.4
Software Services
6.3
171
Resource
as a service
Resource
as a service
Resource
as a service
Update
Binary code
Autonomic
Manager
0100110011
0100110010
0100110011
0010101011
0100110010
0100110011
0100110011
0010101011
0100110010
0100110010
0100110011
0010101011
0010101011
0100110010
0100110011
0010101011
0100110010
0010101011
Local Data
A=1
B=2
A=1
C=3
B=2
A = 1C =D3 = 4
B = 2D = 4
C=3
D=4
Software system
Operating System
by the appearance of a new, perhaps better, service in the architecture (in the registry)
or because the current service does not provide the expected functionality or functions
rated at the quality level required (this quality-level requirement can be expressed in
a service contract).
A number of implementations of the SOA concept have been proposed, sometimes for different purposes. Web services,2 for instance, represent a solution of
choice for software integration. UPnP3 and DPWS (Devices Profile for Web Services)
are heavily used in pervasive applications in order to expose volatile devices.
As illustrated by Fig. 6.8, these technologies are very useful to implement and
dynamically integrate resources. Service orientation is therefore of great interest
for autonomic computing. The loose coupling between service providers and consumers facilitates architectural evolutions. Architecture is an important word
here: services are large-grained artefacts, and they target large-grained evolutions.
Changing services means changing big chunks of code. It is then perfectly adapted
to adaptations at the architecture level, not at the instruction level.
Service orientation is so promising that it has been extended to handle more than
resources. Cervantes and Hall introduced the notion of service-oriented component
in 2004 [8]. Their main motivation was to combine the advantages of two different
paradigms into a single programming model, that is, the architectural dimension of
software components and the inherent flexibility of services. The essential
2
3
www.w3c.org
www.upnp.org
172
innovation of service-oriented components resides in the way bindings are established. In traditional component-oriented composition, components are selected
and bound at design time. Afterwards, reconfiguration is driven by an administrator
or by a global autonomic manager in a centralised fashion. By contrast, the selection
process for a service-oriented composition occurs at runtime as component instances
are created. It is the purpose of the execution framework to bind the service-oriented
components together. To do so, service-oriented components come with a description of their provided and required functions. Depending on the available components, the execution framework binds the appropriate service-oriented components
together (i.e. it binds provided services to required services).
Building a software system using a service-oriented component approach comes
down to decomposing the system into a collection of modular interacting services.
These services are described in a specific file, separately of any implementation.
By doing so, it is possible to develop service-oriented components independently of
each other. It is also possible of course to provide variant implementations for the
services that can be interchanged, even at runtime. Variant implementations can be
used, for example, to support different non-functional requirements.
The execution of the software system based on service-oriented components
starts when all the dependencies are satisfied. The final topology of a system
depends of course on the available components. Components can be pushed
down in the execution platform or pulled from repositories by the platform. A
composition can be seen as an abstract architectural description that could be
used by an autonomic manager to deploy components that satisfy the service
specifications required by the composition. The resulting system may vary
dynamically at runtime.
A service-oriented component is characterised by the following information:
A set of provided service interfaces.
A set of required service interfaces, declared by the component and handled by the
execution platform (i.e. the dependencies are resolved by the execution platform).
Management interfaces. They allow direct management of the service-oriented
component and are of major importance when it comes to dynamism.
Required and exported resources. These are references to (code) resources that
have to be provided by other components or used by other components.
This approach facilitates the work of programmers in many aspects, especially the
management of dynamism. Somehow, service-oriented components infuse autonomic computing concepts into traditional component model. Reconfiguration is
not decided by a global manager but, locally, by autonomic managers attached to components. That is, depending on the available components, bindings can be regularly
evaluated and possibly changed.
Several implementations of service-oriented components have been proposed,
including OSGi (www.osgi.org) and iPOJO (http://felix.apache.org/site/apachefelix-ipojo.html) which is built on top of OSGi. These two technologies are today
used to aid the building of dynamic (sometimes autonomic) systems through their
6.4
OSGi
173
ability to support dynamism at the architecture level and are therefore presented in
more detail in the latter sections of this chapter.
6.4
OSGi
6.4.1
Modularity
Firstly, OSGi defines a form of modularity for Java, beyond the modularity provided
by classes and objects. It allows developers to modularise their applications.
Modules are called bundles in OSGi (the two terms, modules and bundles, are often
used without distinctions by practitioners).
The notion of a bundle is pivotal to OSGi; specifically, a bundle is a Java archive.
It can contain, in addition to Java classes, a number of resources including .gif and
.png files, properties files, containers like .jar or .zip files and libraries of native code
such as .dll or .so files. In other words, all the files that are required to implement a
module. A module can be defined as a set of coherent, collaborating classes grouped
together. The purpose is to organise Java applications into a set of loosely coupled,
highly coherent interacting modules.
A bundle is both a deployment unit and a composition unit. Regarding deployment, bundles are used to package classes and resources so that they can be deployed
on one or more execution platforms. Bundles are thus tangible artefacts that can be
copied or transferred by software administrators.
But bundles are also used as composition units at the application level. That
means that they are used as building blocks to form modular Java applications. Note
that this double role played by bundles often leads to confusion since differences
between the notions of deployment and composition are not always well understood
by programmers. Regarding the compositional aspect, bundles allow the definition
174
of what can be shared and what is private to the bundle. This aspect is defined in a
metadata file included in a bundle.
OSGi uses the Java manifest facility to specify metadata (manifest.mf in Fig. 6.9).
A manifest is a file defining high-level properties called metadata. By default, Java
defines a set of metadata, such as the vendor name and the version of the associated
archive. Most metadata depends on the execution context or on the nature of the
archive. However, the OSGi standard defines a complete list of metadata.4 Two of
them are especially important:
Export-Package defines packages of the bundle that are exported (made available
to the other parts of an application).
Import-Package defines packages required by the bundle for its execution.
One of the main assets of OSGi is related to modularity management, which
allows the installation and deinstallation of Java modules without the interruption of
services. This capability is made possible by the advanced use of Java class loaders.
Specifically, a class loader is defined for each bundle. Then, OSGi introduces
visibility between bundles via the notions of public and private packages. Public
packages can be imported or exported, specified through the use of metadata. A
bundle then has access to the classes of its own packages and to the classes of the
imported public packages belonging to other bundles. This defines the bundle Class
Space. Having a class loader per bundle allows several versions of a class to coexist
4
http://www.osgi.org/download/r4v43/r4.core.pdf
6.4
OSGi
175
Starting
Installs
Starts
Updates
Uninstalls
Uninstalled
Resolved
Installed
Resolves
Updates
Activated
Stops
Stopping
in the same program. The only constraint is that a given bundle can only access a
single version.
Bundles have a life cycle of their own. Specifically, a bundle can be:
Installed. The bundle is said to be valid, and it is assigned with a unique identifier by the running platform. Installation is an atomic and persistent operation.
A bundle object is created and is used for every upcoming administration
operation.
Uninstalled. The physical representation of a bundle has been deleted, and its
different resources have been correctly released or discharged from the
platform.
Resolved. All the dependencies of the bundle (packages, capacities, etc.) have
been satisfied.
Starting. The bundle is initialized through a call to its start method. A notification
is sent upon bundle activation.
Activated. The bundle has been successfully activated and is running.
Stopping. When a bundle is deactivated, all services and resources being used
have to be released, all the threads of the bundle are stopped and all services
provided by the bundle are deleted from the platform.
Thus a bundle goes through different states from its installation up to its retirement. This is illustrated by Fig. 6.10 which summarises the different states and
transitions.
6.4.2
Service
OSGi allows the dynamic management of deployment and composition units, also
known as bundles, meaning that the execution platform does not have to be rebooted
to instantiate the change to its architecture. However, this dynamicity only concerns
classes. However, the dynamic management of bundles does not imply the dynamic
management of applications. To do so, a bundle exposes its functions (services) to
176
the other bundles and, conversely, is able to use functions (services) offered by the
other bundles. These functions are concerned with the instance level: they correspond to running classes.
As introduced previously, a major aspect of service-oriented computing is the
notion of a contract. This notion defines what is expected (service specification) and
what is effectively used (concrete service). Clients use service specifications in
order to select a service provider and invoke a concrete service. This two-phase
protocol also gives clients the ability to change concrete services when the currently
used ones are not satisfactory (for whatever reason).
OSGi relies on the definition of a service register containing the services available on the platform at a given time. Services correspond to running classes that
belong to a bundle and are where their interfaces are explicitly exported. Bundles
are then concerned with instances of classes that can be shared by all other bundles.
A bundle therefore contains a number of service consumers and providers.
Regarding service provision, a bundle has to provide the following elements to
the registry:
A description of the provided service (Java interface)
The invocation point of the provided services (a reference to the implementation
class)
The non-functional properties
When a bundle registers a service, the register gets a reference to the record
(ServiceRegistration) which is used to administer the service, which is also used for
its deregistration. In particular, a bundle has to deregister its declared services when
it is deactivated. Some OSGi implementations automate this aspect, however.
To use a service, a consumer has to look for it. Two modes are available to do
this: active mode and the passive one. In active mode, the potential consumer
explicitly accesses the register to get one or several references to services running at
that moment, using the following:
6.5
iPOJO
6.4.3
177
Conclusion
6.5
iPOJO
http://code.google.com/p/peaberry/
http://wiki.github.com/weiglewilczek/scalamodules/
7
http://felix.apache.org/site/apache-felix-ipojo.html
6
container
handler
POJO
handler
178
handler
6.5
179
iPOJO
handler
POJO
handler
binding
handler
POJO
handler
handler
handler
180
The same result can be obtained through the use of a metadata file:
As said earlier, the specification includes the required services (@requires), the
provided services (@Provides) and optional life-cycle-related callbacks (@Validate,
@InValidate). These elements are then interpreted during execution: component
instances are created, service dependencies are dynamically injected and callbacks are
called depending on the instance state.
Finally, a useful aspect of iPOJO is the possibility to define hierarchical compositions. Here, instances can be regrouped into separate name spaces called composites.
This notion of composite allows the isolation of services in an execution platform.
iPOJO composites can be created in a declarative way in a description file,
just like iPOJO instances. The following example illustrates the creation of such a
composite:
6.6
6.6
Conclusion
181
Conclusion
182
self-manage some of their parts. Of course, this additional complexity is the price to
pay to reach the desired level of autonomy.
Good sense (and the use of strong software engineering principles!) can hide this
additional complexity as much as possible. The adaptation code, whatever the technique it uses, should be encapsulated and only changed when required by an expert.
6.7
Key Points
References
1. Lin, D.-L., Neamtiu, L.: Collateral evolution of applications and databases. In: ERCIM
Workshop on Software Evolution/International Workshop on Principles of Software Evolution
(IWPSE-Evol'09), Amsterdam, Aug 2009
References
183
2. Neamtiu, I.: Practical dynamic software updating. Ph.D. dissertation, University of Maryland,
Aug 2008
3. Kiczales, G.: The Art of Meta-Object Protocol. MIT Press, Cambridge, MA (1991)
4. Szyperski, C.: Component Software: Beyond Object-Oriented Programming. Addison Wesley/
Longman Publishing Co., Inc., Boston (1997)
5. Krakowiak, S.: Middleware architecture with patterns and frameworks http://sardes.inrialpes.
fr/~krakowia/MW-Book/ (2007)
6. Kramer, J., Magee, J.: Self-managed systems: an architectural challenge. In: Future of Software
Engineering, pp. 259268. IEEE Computer Society, Washington, DC (2007)
7. Papazoglou, M.: Service-oriented computing: concepts, characteristics and directions. In:
Proceedings of Web Information Systems Engineering, Los Alamitos, CA, 2003
8. Cervantes, H., Hall, R.: Autonomous adaptation to dynamic availability in service-oriented
component model. In: Proceedings of the 26th International Conference on Software
Engineering, pp. 614623. IEEE Computer Society, Washington, DC (2004)
9. Hall, R., Pauls, K., McCulloch, S., Savage, D.: OSGi in Action: Creating Modular Applications
in Java. Manning Publications, Greenwich (2011)
10. Escoffier, C.: iPOJO: a flexible service-oriented component model. Ph.D. dissertation,
University Joseph Fourier. http://defense.pdf. Dec 2008
185
186
7.1
Introduction to Knowledge
7.1.1 Definition
Knowledge is a central notion in autonomic computing. Indeed, in order to exhibit
self-administration properties, autonomic systems must rely on some form of knowledge about themselves, about the computing environment and about ways to solve
problems. The more sophisticated the autonomic capacities required, the more
advanced the knowledge.
We have seen that autonomic systems are made of a number of interacting autonomic managers. From a logical point of view, these managers are organised around
administrative tasks, the MAPE tasks, which are used to monitor the managed artefacts, analyse the situation, plan countermeasures when necessary and eventually
execute courses of action. As illustrated by Fig.7.1, the MAPE tasks strongly rely
on knowledge, for example, the K in the MAPE-K pattern.
The general notion of knowledge is very complex, to such an extent that its study
gave birth to a philosophical domain of its own called epistemology.1 The mere definition of knowledge is still a matter of intense debate, and, in fact, there is no single
agreed definition today. The classical definition of knowledge traces back to antiquity: Socrates2 stated that knowledge is true belief that has been justified.3 This
theory means that someone knows something if:
He/she believes it.
This something is true.
It is explained in some way.
Design rationale
and constraints
Domain knowledge
Problem-solving
knowledge
Knowledge of the
Comp. environment
Knowledge of the
running artefacts
From the Greek epistm meaning knowledge and logos for study of.
Socrates (469 BC399 BC) was one of the classical Greek philosophers who laid the foundation
of western philosophy. His work was transcribed by Plato, his student (428427 BC348
347 BC).
3
In the Theaetetus, one of Platos dialogues about the nature of knowledge.
1
2
7.1Introduction to Knowledge
187
This classical definition is sufficient for this book. In our scope, it puts forward
the fact that an autonomic system must rely on justified beliefs about itself, its environments and the effects of its actions. These beliefs have to be true; otherwise,
autonomic actions may be inappropriate.
The notion of truth has also been the subject of intense debates and numerous
research works. Once again, for the purpose of this book, we retain the classical
definition, also tracing back to the Greek philosophers, which defines truth as the
real states of things. A statement about something is true if it reflects the state of this
something at some level of abstraction.
4
5
188
7.1Introduction to Knowledge
189
It is then of utter importance for the autonomic managers to align the p roblems
to be solved with the knowledge at their disposal (see Chap. 5 about dynamic monitoring). An autonomic manager has to be proactive and fetch the information that is
necessary to solve the problems at hand. It is also necessary to evaluate autonomic
actions to make sure of their appropriateness. Chapter 8 is dedicated to this thorny
problem, which requires that the system has knowledge about the possible effects of
an action and is able to log the different actions taken so far.
190
This of course also holds for autonomic systems. A major issue when building a
self-managed system is then to find out how to represent the different forms of knowledge (and how to acquire them). In most cases, different formalisms are needed to
represent the different pieces of information to be expressed. Intuitively, it may seem
natural that prescriptive knowledge will often be at the heart of the MAPE tasks and
that, by contrast, descriptive knowledge will often be used to express information
related to design and domain constraints. But it is not so simple. For instance, propositional knowledge can be used to automate analysis or planning, and, by contrast,
procedural knowledge can be needed to express complex constraints verifications.
7.2
7.2.1 Introduction
As said earlier, knowledge and reasoning are very much related. Depending on the
expected adaptations, different forms of knowledge will be required in autonomic
systems. Various representations, used to handle different aspects, can coexist in a
same system. For instance, different knowledge representations may be used to reason
about and implement self-reparation and self-configuration.
The issue, for each self-property targeted by a system, is then to identify and represent the different forms of knowledge that are needed: the acquaintance knowledge
191
that is captured via touchpoints on the managed artefacts, the innate knowledge that
is engraved in the heart of autonomic managers and that captures the domain expertise and, in some case, the description knowledge collected by tier parties.
Many knowledge representations used in current autonomic systems found their
inspiration in artificial intelligence (see Chap. 3). Indeed, knowledge representation
has always been central to AI. In order to act as a person, and incidentally to pass
the Turing test [1], an intelligent machine has to be able to interact with its environment, to acquire and store knowledge, to conduct reasoning based on that knowledge and to learn.
There have been numerous debates in the AI community about the way knowledge and reasoning should be implemented. In short, the question is: should a
machine have to think like a human or not? (Granted, this would depend on what
that we know how a human thinks, another source of debate.) Answering this question leads to different ways to represent knowledge. As explained in Chap. 3, the
thinking has a human theorists explored domains as diverse as step-by-step logical
inferences, neural networks, psychology, etc. Opponents of this theory sought to
leverage the machines outstanding processing power and proposed solutions based,
for instance, on space searches. In all the cases, the division between reasoning and
knowledge is quite fuzzy, as one would expect. This is the very same when it comes
to the division between planning (reasoning) and knowledge used to effect adaptation in the autonomic MAPE-K loop.
In this section, we use a classification commonly adopted in AI to distinguish
between intelligent systems [2] depending on the adopted knowledge representation
and associated reasoning:
Rule-based systems implement knowledge through simple eventconditionaction
rules and are capable of quick, simple reflex adaptations.
Model-based systems maintain models of the managed artefacts and of the computing environment in order to produce more thoughtful actions.
Goal-based systems introduce and use an explicit definition of goals in order to
guide reasoning.
Utility-based systems introduce utility functions in order to compare and rank
states satisfying goals.
192
Fig.7.2Rule-based
autonomic systems
Administration policies
AM
Problem solver (rules)
sensors
effectors
Managed Artefacts
Autonomic Element
193
Here we have a system that delivers audio data to differing quality of service
levels. When the perceived network bandwidth is plentiful, the quality of the data
can be increased. Conversely, when the bandwidth is less, the data is compressed
and sent at a lower quality. The aim is to constantly deliver audio at the highest quality possible. This example shows two policies. The first one calculates the time that
the client data buffer will be emptymeaning it cannot play the audio clip, something the system wants to avoidand, if the threshold indicates this, tell the server
to compress the data. The second one takes a bandwidth measurement from the
servers point of view, and if it sees that the bandwidth is getting better, it increases
the quality of the audio file. The conflict here is that networks are somewhat asymmetrical, and therefore the bandwidth available at either the client or server could be
different resulting in the client wanting compressed data and the server wanting to
increase the quality at the same time.
In the broadest sense, reasoning in autonomic systems involves making a decision
regarding the changes and adaptations to assemble and implement on the managed
element taking monitoring data as input. In the simplest case, we could define
eventconditionaction (ECA) rules that directly produce adaptation plans from
specific event combinations. However, while applying this approach in a stateless
manner minimises complexity and is quite lightweight, it is also very limiting. That
is, the autonomic manager does not have to keep any information regarding the state
of the managed element but relies solely on the current sensor data readings to
decide whether to trigger an adaptation plan.
194
to serve different purposes (estimate the time of a trip, forecast the weather, predict
long-time climate evolution, etc.). And no single model can capture all the information needed to solve all sorts of problems. The variety of goals then lead to the
variety of models.
Several models of the same thing, but handling different aspects, can be used
jointly to support problem solving. This is a way to separate out concerns and get
simpler and more focalised models based on different representation ontologies.
Synchronising the different models at runtime may be an issue. One way to do so is
to use a central model and to relate all the other models to it.
Models have always played a major role in science. They are in fact an essential
means (if not the only one in certain cases) to reason about complex phenomenon
that cannot be observed in detail or totally embraced by the human mind. In spite of
considerable computing power, the situation is the same in computer science.
Software systems have to use models in many situations because the domain problem
is too complex or impossible to catch completely. For instance, models have been
heavily used to represent the different structures of software systems (see Chap. 1)
like their topology, behaviour and deployment units.
Numerous model representation languages have been devised in AI but also in
software engineering. This is still a subject of intense research. Today, there is no
agreement on a single general representation. Current research efforts concentrate
on the definition of domain or aspect-specific modelling languages (DSL). Later, we
will illustrate representations currently used to model pervasive applications and
their associated environment.
In autonomic computing, models can be used to represent knowledge by acquaintance, which is acquired from the touchpoints, but also innate knowledge like reference architecture, devised at design time, for the managed artefacts. The expertise to
solve problems, on the other hand, is encoded in a dedicated module, called problem
solver in Fig.7.3, which is kept separated from the models. Of course, the reasoning
techniques employed by the problem solver heavily depend on the nature of models.
Although very useful, it should always be remembered that models are incomplete
knowledge. Regardless of the representation language, the very purpose of models
is to present a partial or abstract (or both) view of the managed artefacts because
they are partially observable or because more advanced views would not be usable.
Decisions based on such models have to be backed up by human beings, as it is
anticipated in all definitions of autonomic computing.
195
Fig.7.3Model-based
autonomic systems
Administration policies
AM
Model(s)
Problem solver
sensors
effectors
Managed Artefacts
Autonomic Element
Fig.7.4Goal-based
autonomic systems
Administration policies
AM
Model(s)
Goal
Problem solver
sensors
effectors
Managed Artefacts
Autonomic Element
oriented and remain rather abstract. Detailed goals pursued by the system, in terms
of expected states of the managed artefacts, are not made explicit. They are in fact
embedded in the problem solver, and, as a consequence, they cannot be explicitly
manipulated. Changing such detailed goals requires changing the code of the problem
solver, which is always a daring task.
In goal-based autonomic systems, some part of the knowledge supporting management decisions is made explicit. Goals that are pursued by the system are represented separately from the problem solver. As said earlier, the notion of a goal has
196
to be understood as the expected state of the managed artefacts, not like high-level
administrative directives.
Making explicit this notion of a goal allows the definition of more flexible and
more generic problem solvers. In this approach, a problem solver takes as inputs the
current state of the managed artefacts, a goal (that can be expressed as target state
of the managed artefacts) and the set of actions that can be triggered through the
effectors to achieve the target state. It can then rely on generic, reusable algorithms
to handle the current issue since the problem is now to align two state representations using a number of available operations. Search-based algorithms, forward and
backward, have been very much used in this context.
Flexibility is a major advantage. We have seen throughout this book how dynamism is important in modern computing. This of course applies to autonomic systems. Being able to finely tune the target system states in a dynamic way is certainly
a major property for many systems.
197
Fig.7.5Utility-based
autonomic systems
Administration policies
AM
Model(s)
Goal
Pb solver
Utility fct.
sensors
effectors
Managed Artefacts
Autonomic Element
198
Learning
Administration policies
AM
Model(s)
Goal
Pb solver
Utility fct.
sensors
effectors
Managed Artefacts
Autonomic Element
7.3Model-Driven Autonomicity
7.3
199
Model-Driven Autonomicity
7.3.1 Introduction
Let us now focus on the notion of model, which is arguably central to advanced
autonomic systems (i.e. those not solely based on reflex rules). An important question
arises when it comes to models: what should be explicitly represented with models?
(And, conversely, what should be kept in the problem solver?)
Potentially, all the knowledge obtained by acquaintance and part of the innate
knowledge could be made available in explicit models. Concretely, the currently
preferred strategy of researchers and practitioners is to build a number of separate
models, including a model of the running software architecture, a model of the computing environment and models focusing on relevant non-functional properties like
security and performance. To properly implement the principle of separation of concern, these models are kept distinct, but they are often closely linked. For instance,
security concerns can be traced to architectural elements [6]. The explicit separation
of models favours their independent evolution. Depending on the situation and the
problems to be solved, some models can be refined more than others. However, the
models to be represented, their level of abstraction, their formalisation, etc. are still
defined on a case-by-case basis today.
Generally speaking, the more information is made explicit in models, the better
for autonomic software evolution. This allows the MAPE tasks to focus on the
computation (the know-how) and not on the data representation and collection.
Furthermore, several tasks can use the same data representation. This makes the
code leaner, more focused and easier to change.
Making models a key concern in autonomic computing is today a strong tendency.
This is also a major trend in software engineering where an ambitious research
field exclusively dedicated to the study of models has been recently established.
Specifically, the model-driven engineering (MDE) community advocates the creation and exploitation of models to entirely drive the development and maintenance
of software systems. Initially, models were essentially used to drive software development. The main principle was to bridge the gap between the specification of a
problem and its solution through successive model transformations (representing a
same system at different levels of abstraction). This was to hide the technological
complexity of the implementation and allow a better communication between the
different actors involved in software development.
Models used in MDE are said to be productive in the sense that they lead to an
implementation in a programming language after a certain number of transformations. Otherwise, models are said to be contemplative and are solely used to improve
communication between stakeholders and drive development informally.
Models are now seen as strategic artefacts that can be used all along software life
cycle and not only at design time. France and Rumpe [7] introduced the idea of
model at runtime (model@runtime) to abstractly capture runtime phenomenon.
Such model is in synchronisation with an operational system and can be used to get
synthetic information about the system operations. This is clearly in line with the
200
Conforms to
Meta-model
Conforms to
Conforms to
Links
Model
Conforms to
Model
Conforms to
Instance
Links
Conforms to
Instance
autonomic purposes, and it is no surprise that the model orientation brings together
scientific communities working on autonomic computing and software engineering
at runtime (as outlined in the introductory chapter).
7.3Model-Driven Autonomicity
201
202
Repair strategies of the architecture model may be specified as ECA rules, for
example, where an event is generated when the model is invalidated by sensor
updates, and an appropriate rule specifies the actions necessary to return the
model to a valid state. In practice, however, there is always a delay between the
time when a change occurs in the managed system and this change is applied to
the model. Indeed, if the delay is sufficiently high and the system changes frequently, an adaptation plan may be created and sent for execution under the belief
that the actual system was in a particular state, for example, a Web server overloaded, when in fact the environment has already changed in the meantime and
the system no longer requires this adaptation anymore (or it requires a different
adaptation plan) [11]. To overcome this, in many of the model-driven adaption
systems, the model is stored and executed on a separate machine from the computers that host the managed elements and the resulting parallelism improves the
processing of the model.
Architectural models tend to share the same basic idea of the model being a
graph of components and connectors. The components represent some unit of concurrent computing task, whereas the connectors represent the communication
between components. Usually, there is no restriction as to the level of granularity of
a component: it could be a complete Web server, an application on a Web server or
a component of an application. The architectural model does not describe a precise
configuration of components and connectors that the managed element must conform to. Instead, it sets a number of constraints and properties on the component
and connectors, so that it can be determined when the managed element violates the
model and needs adaptation. Let us now continue our description of architectural
model-based planning in the MAPE-K loop by taking a look at some of the most
notable architectural description languages (ADLs), which can be used to specify an
architectural model of a managed system.
Let us start with Darwin, one of the first ADLs that was the result of seminal
work by Magee etal. [12]. In Darwin, the architectural model is a directed graph
in which nodes represent component instances and arcs specify bindings between
a service required by one component and the service provided by another. Further,
the allow object modelling notation [13] has been applied to Darwin components
to be able to specify constraints on the components [14]. For instance, consider the
scenario where there are a number of server components offering services and a
number of client components requiring services. Each service of a component is
typed so that different services offered by a server or requested by a client can be
distinguished and properly matched. In this scenario, the architectural model can
guarantee that there are enough servers to service the clients. Should that not be the
case, new server components must be started that offer the unavailable service
types in order to return the model to a valid state. In this approach, each component
keeps a copy of the architectural model. In other words, each component in the
architectural model is an autonomic element with a managed element and an autonomic manager that holds the architectural model to the entire system. This
approach avoids the presence of a central architectural model management service,
7.3Model-Driven Autonomicity
203
which would otherwise introduce the problem of detecting and handling the failure
of this central component. Where such a decentralised approach is taken, there is
however the problem of keeping the architectural model up to date and consistent
across all copies in the autonomic managers. This can be achieved with fully
ordered atomic broadcasts, which work as long as no communication partitions
occur between the components.
Figure7.8 shows how to represent a component in Darwin. To initially construct
and subsequently change systems, we need a set of operations on components.
These are typically to create, delete, bind components to a port, unbind, and set
mode to a value. A system constructed in this way will have a configuration or management state consisting precisely of the set of components instances, the set of
connections between components and the set component mode values.
Other architectural models have since been developed, we show only two of the
many as they summarise many of the styles that are available. The Acme adaptation
framework [11, 15, 16] is a software architecture that uses an architectural model
for monitoring and detecting the need for adaptation in a system. The components
and connectors of their architectural model can be annotated with a property list and
constraints for detecting the need for adaptation. A first-order predicate language
(called Armani) is used in Acme to analyse the architectural model and detect violations in the executing system. An imperative language is then used to describe repair
strategies, much like the policy-based approach. The difference lies in how the need
for adaptation is detected and the appropriate adaptation rule selected. Whereas in
policies it is explicitly described in the rules, with an architectural model, the need
for adaptation implicitly emerges when the running system violates constraints
imposed by the architectural model.
Similarly in C2/xADL [17, 18], an important contribution lies in starting with an
old architectural model and a new one based on recent monitoring data and then
computing the difference between the two in order to create a repair plan. Given the
architecture model of the system, the repair plan is analysed to ascertain that the
change is valid (at least at the architectural description level). The repair plan is then
executed on the running system without restarting it.
204
7.4
Reasoning Techniques
mode
Fig.7.9Epistemological
relationships between models
and reasoning
Provided
services
(ports)
Component
Required
services
(ports)
7.4Reasoning Techniques
205
206
Fig.7.10 Example of search
space
Model(s)
Reasoning
potential solution to the problem lies, and then this initial idea is iteratively refined
until no better solution can be found. Where the search space is large or unwieldy,
or worse impossible to cover, heuristics are used to reduce the amount of area that
is required to be searched. Heuristics are methods and rules that help us to guide the
search, and these can originate from human knowledge about the problem. The
ultimate aim is to reduce the search space, to reach the solution quickly, but with
techniques that cause minimal impact on the solution optimality.
Evolutionary approaches to optimisation, which can sometimes be described as
bio-inspired, carry out the search for the optimal solution by representing the space
of potential solutions as organisms. In, for example, genetic algorithms, the organisms mutate and recombine, and only those organisms that are described as fit, that
is, they are part of the algorithm parameters that bring the search closer to the solution, survive to the next round or iteration of the algorithm. Another example of a
bio-inspired approach is those one that exploits swarm intelligence (e.g. network
routing with stigmergy, as mentioned earlier). Here storage areas are used to mark
good solutions or to reinforce good routes, so that packets will be sent down the
better routes available to them at a given time.
http://www.britannica.com/EBchecked/topic/213751/formal-system
7.4Reasoning Techniques
207
In logic systems, knowledge and inference are separate, which allows inference
to be domain independent. This is undoubtedly a major property. It means that once
axioms and rules are properly defined, you just have to feed the systems with facts
about the running systems and then actions to be undergone can be inferred. One
issue is that even a simple logic system, like propositional logic or first-order
logic, can have complexities for some problems that demand important resources
(propositional logic is decidable in polynomial time).
Logic systems can also represent facts, or fuzzy representations of facts, where
the value of a statement is allocated a value between 0 and 1 or a probabilistic scale
rather than simply being true or false.
There are in fact many formal logic approaches. Propositional logic is a simple
declarative language allowing the definition of sentences from proposition symbols
(facts) and operators. A sentence combines symbols (and other sentences recursively)
with negation, conjunction, disjunction, equivalence and implication operators.
Propositional logic has sufficient expressiveness to deal with partial information, using
disjunction and negation. First-order logic is more expressive than propositional logic.
It extends the syntax of propositional logic with universal and existential quantifiers in
order to express sentences about some or all the objects of the world. The expressiveness of first-order logic has however a cost in term of inference complexity. In simple
cases, problems expressed with first-order logic can be reduced to propositional logic
problem when the domain of discourse is finite or can be discretised. More complex
problems require complex algorithms to be used, which may not be compatible with
the usual resources of an autonomic system.
Finally, let us mention constraint logic programming which can be applied to
autonomic computing [19]. Constraint logic programming aims to satisfy constraints
to prove the problem, that is, the solution lies in an answer that satisfies all the constraints. It is used for systems with many constraints such as timetabling and air
traffic control-type problems.
Most of these techniques require a model of the world to be represented in the
problem space. However this is extremely difficult to do and to do so accurately.
Therefore some techniques that have been inspired by economics and probability
theory have been devised that are applied when we have either incomplete or uncertain information. One such class of reasoning method that is popular with autonomic
systems is that of Bayesian networks which have been used for inference, learning
optimisation and decision-making. Hidden Markov models and Kalman filters are
other examples of probabilistic approaches to perceive processes over time, ignoring
useless data (noise) and aiding the prediction of events in the future.
Using logic to express models is always a challenge. It requires defining symbols representing the world (the managed artefacts and computing environment
in our case), axioms and inference rules in order to capture some domain expertise. The transition from informal know-how knowledge and logic-based formalism is rarely straightforward. Also, inference algorithms can be extremely costly
and largely exceed the available resources. Indeed many attempts to use formal
systems in computer science fell short because of exceeding complexity of inference algorithms.
208
7.5
The purpose of this section is to provide an example using logic for knowledge
representation and reasoning. We have chosen an example based on Bayesian techniques for they are very popular today to deal with uncertain environments. Bayesian
networks are based on formal logic. In this section, we explain the basics of Bayesian
networks by way of an example that can be applied to many systems that use utility
as a means of decision-making in autonomic computing architectures. The example
used is an amended version of the work on autonomic middleware from [20].
209
State
actions
Throughput =
Consomption=
State
Throughput =
Consomption=
State
Throughput =
Consomption=
In this example, we have an intelligent home that has a number of sensors that
determine what the person in the home is doing. Each sensor is providing data
regarding the person, such as the persons location or whether or not they are sitting
or standing. This we call context, and each sensor is therefore a Context Provider
(CP). Each sensor is also able to provide a quantitative value as to how well it can
measure that context; this we call probability of correctness (poc).
To be able to derive the users activity in this example, we need to combine data from
different sources of the same type of context. That is, there may be more than one sensor
(CP) that can provide us with the users activity (e.g. pressure pad on the floor and/or a
video camera in the room). Combining sensor data will increase the chances of correctly
reporting the users context. In our system, we wish the combination to deliver an output
with a probability of correctness that takes into account the level of agreement between
the different CPs given their individual advertised probability of correctness. From this,
the autonomic manager can select the best combination of CPs to choose.
In this example, the goal is to take the context of all the Context Provider services
for the context type activity (in the sense of standing, sitting, etc.) and output a
final context value that takes into account the value of all contest providers for this
type and their probability of correctness each. This approach is applicable to any context with a finite set of discrete context values. The final output is based on probabilities; therefore, we use a probabilistic reasoning technique to solve this problem.
Bayesian networks are one approach to this problem and are frequently used in
reasoning about autonomic management and decision-making. Further, efficient
algorithms exist to perform inference and learning in Bayesian networks adding to
its usefulness. Bayesian networks model sequences of variables and allow us to
represent the relationship between the activity, the output of the CPs sensing the
activity and the final output as a Bayesian network.
Figure7.11 shows the Bayesian network of what we have just described. It is
essentially a directed acyclic graph, where the nodes of the graph represent random
210
variables, while the directed links represent the influential relationship that the parent node has on the child node. In our example, the activity being monitored has a
direct influence on the Context Providers that are trying to determine the activity
through their sensors. Further, the output of the CPs has a direct influence on the
final output. To this end, we must define for each node its conditional probability
given its parents. Thus, we define the probabilities P(A), P(CP1|A), P(CP2|A) and
P(O|CP1, CP2,). For simplicity, let us assume that there are only two possible
values of activity: sitting (si) and standing (st). The probability of either sitting
or standing is represented as the probability of an activity given the inputs from the
sensors as P(A), and this can be estimated by observing the frequency of the different final outputs (numbers of si or st representing sitting or standing, respectively).
Initially, we can assume that all outcomes are equiprobable; therefore, we set the
probability of A given si as equal to the probability of A given st:
P( A = si) = P( A = st ) = 0.5.
Lets also assume that we only have two Context Providers. The probability of the
first Context Provider given the activity, P(CP1|A) is the probability of correctness as
advertised by CP1. If there is only one advertised probability of correctness, then we
use that value for all possible context values (si and st in our example). However, we
assume that in reality, different context values have a different probability of being
correct, and therefore a single probability of correctness value is an approximation
for the probability of correctness value of a specific context value. So, if on the other
hand we have a specific probability of correctness value for each context value, then
we can use these values for P(CP1|A), that is, in our example:
P(CP1 = si | A = si) = poc(si)
P(CP1 = st | A = si) = 1 poc(si)
and
P(CP1 = st | A = st) = poc(st)
P(CP1 = si | A = st) = 1 poc(st)
Now, the same can be applied to P(CPi|A) for any CP of this context type. Once
we have chosen a definition for P(O|CP1, CP2), this describes the strategy we use
to combine the Context Providers to produce the final output, as we then select the
output value with maximum probability P(O|CP1, CP2). Finally, given P(O|CP1,
CP2), we can use the Bayesian network to find the most likely activity given the
final output and its probability:
P( A = x | O = x) = a P ( A = x ) P(O = x | CP1 = c1 , CP2 = c2 )
c1 , c2 C
P(CP1 = c1 | A = x) P(CP2 = c2 | a = x)
where C represents the set of all possible context values (xC) and is a scaling
factor that normalises the resulting probabilities P(A|O) such that they sum up to 1.
This result is obtained by applying the general form of Bayesian rule with normalisation: P(Y|X)=P(X|Y)P(Y).
211
Table7.1 Conditional probability table, for example, with |CP| = 4 and M = {st, si, ld}
A=st
CP1
CP2
CP3
CP4
st
si
ld
A=si
st
si
ld
A=ld
st
si
ld
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
0
1
0
0
1
0
0
1
CP2
CP3
CP4
A=st
st
st
st
si
si
si
ld
ld
ld
st
st
st
si
si
si
ld
ld
ld
st
st
st
si
si
si
ld
ld
ld
St
St
St
Si
Si
Si
Ld
Ld
Ld
st
si
ld
st
si
ld
st
si
ld
1
0
0
0
1
0
0
0
1
212
CP1
CP2
CP3
CP4
st
si
ld
A=si
st
si
ld
A=ld
st
si
ld
0.8
0.1
0.1
0.8
0.1
0.1
0.8
0.1
0.1
0.8
0.1
0.1
0.1
0.8
0.1
0.1
0.8
0.1
0.1
0.8
0.1
0.1
0.8
0.1
0.1
0.1
0.8
0.1
0.1
0.8
0.1
0.1
0.8
0.1
0.1
0.8
CP2
CP3
CP4
A=st
st
st
st
si
si
si
ld
ld
ld
ld
ld
ld
ld
ld
ld
st
st
st
st
st
st
st
st
st
st
st
st
st
st
st
st
st
st
st
si
ld
st
si
ld
0.9995
0.0002
0.0002
0.9827
0.0154
0.0019
si
si
si
st
st
st
st
st
st
st
si
ld
0.8
0.1
0.1
ld
ld
ld
st
st
st
st
st
st
st
si
ld
0.4961
0.0078
0.4961
ld
ld
ld
ld
ld
ld
ld
ld
ld
st
si
ld
0.0002
0.0002
0.9995
value. Note also that in the case where CPs have output (ld, ld, st, st), that is, half the CPs
output context c and half context a, these two values are equally likely of being output,
whereas si, while possible, is extremely unlikely. This is also the desired behaviour.
Consider now a more complicated example. Table7.4 shows the CPTs for the
CPs in this example.
Numbers in bold represent the probability of a CPs output given the real context,
that is, the probability of the most likely value. Table7.4 shows an extract of the
output when we apply the definition P(O|CP1, CP2).
7.6Key Points
213
Here, given the combination of output from CPs (st, si, ld, st), the most likely
final output is O=st. This reflects the fact that, in Table7.4, given an activity A=st,
the most likely value of each CP is (st, si, ld, st) as emphasised by the probabilities
in bold. This behaviour is also obtained in the cases (si, si, si, si) and (ld, si, ld ld)
for the activities A=si and A=ld, respectively, as expected. Further, the probability
of correctness of the final output is greater than the poc of any single Context
Provider, so taking all four Context Providers into account does give us an advantage to using a single CP.
In summary, this example shows that through using a probabilistic reasoning
approach, we are able to combine outputs from Context Providers with different
values and probability of correctness and produce an output that has its own
measure of uncertainty, which is determined by the uncertainty in the Context
Providers. Further, taking multiple Context Providers into account produces better
results than taking only a single CP.
7.6
Key Points
214
actions. These models reflect the systems structure and behaviour, its requirements and the system states required to match its goals.
A great advantage of the architectural model-based approach to planning is that,
under the assumption that the model correctly mirrors the managed system, the
architectural model can be used to verify that system integrity is preserved when
applying an adaptation.
Models are built to support reasoning. There are a number of reasoning techniques
well suited to autonomic systems. This includes programming languages, searchbased reasoning and logic-based reasoning, which are discussed in this chapter.
References
1. Turing, A.: Computing machinery and intelligence. Mind LIX(36), 433460 (1950)
2. Russel, S., Norvig, P.: Artificial Intelligence, a Modern Approach. Prentice Hall, Englewood
Cliffs (2010)
3. Osogami, T., Harchol-Balter, M., Scheller-Wolf, A.: Analysis of cycle stealing with switching
times and thresholds. Perform. Eval. 61(4), 347369 (2005)
4. Sharma, V., Thomas, A., Abdelzaher, T., Skadron, K., Lu, Z.: Power-aware qos management
in web servers. In: RTSS03: Proceedings of the 24th IEEE International Real-Time Systems
Symposium, p. 63. IEEE Computer Society, Washington, DC (2003)
5. Dorigo, M., Blum, C.: Ant colony optimization theory: a survey. Theor. Comput. Sci. 344(23),
243278 (2005). doi:10.1016/j.tcs.2005.05.020. http://dx.doi.org/10.1016/j.tcs.2005.05.020
6. Chollet, S., Lalanda, P.: An extensible Abstract Service Orchestration Framework. In:
Proceedings of the IEEE 7th International Conference on Web Services (ICWS 09), Los
Angeles, CA, 6 July 2009
7. France, R., Rumpe, B.: Model-driven development of complex software: a research roadmap.
In: FOSE07: 2007 Future of Software Engineering, pp. 3754. IEEE Computer Society,
Washington, DC (2007)
8. OMG.: Unified Modeling Language (UML). http://www.omg.org/technology/documents/
modeling_spec_catalog.htm#UML. Feb 2009
9. OMG.: Meta-Object Facility (MOFTM) specification, version 1.4. http://www.omg.org/cgi-bin/
doc?formal/2002-04-03. Apr 2002
10. Herrmann, C., Holger Krahn, H., Rumpe, B., Schindler, M., Vlkel, S.: An algebraic view on
the semantics of model composition. In: Model Driven ArchitectureFoundations and
Applications. Lecture Notes in Computer Science, vol. 4530, pp. 99113. Springer, Berlin/
Heidelberg (2007)
11. Garlan, D., Schmerl, B., Chang, J.: Using gauges for architecture-based monitoring and adaptation. In: Working Conference on Complex and Dynamic Systems Architecture, Brisbane,
Australia (2001)
12. Magee, J., Dulay, N., Eisenbach, S., Kramer, J. (eds.).: Specifying distributed software architectures. In: Proceedings of 5th European Software Engineering Conference (ESEC 95),
Sitges. LNCS 989, pp. 137153. Springer, Berlin/Heidelberg (1995)
13. Jackson, D.: Alloy: a lightweight object modelling notation. Softw.Eng. Methodol. 11(2),
256290 (2002)
14. Georgiadis, I., Magee, J., Kramer, J.: Self-organising software architectures for distributed
systems. In: Proceedings of the First Workshop on Self-Healing Systems, Charleston,
South Carolina, USA (2002)
15. Garlan, D., Schmerl, B.: Exploiting architectural design knowledge to support self- repairing
systems. In: Proceedings of the 14th International Conference on Software Engineering and
Knowledge Engineering, 1519 July, Ischia Island, Italy (2002)
References
215
16. Garlan, D., Schmerl, B.: Model-based adaptation for self-healing systems. In: Proceedings of
the First Workshop on Self-Healing Systems, Charleston, South Carolina, USA (2002)
17. Oreizy, P., Medvidovic, N., Taylor, R.N.: Architecture-based runtime software evolution.
In: ICSE98: Proceedings of the 20th International Conference on Software Engineering,
pp. 177186. IEEE Computer Society, Washington, DC (1998)
18. Dashofy, E.M., van der Hoek, A., Taylor, R.N.: Towards architecture-based self-healing
systems. In: Proceedings of the First Workshop on Self-Healing Systems, Charleston, South
Carolina, USA (2002)
19. Dearle, A., Kirby, G.N.C., McCarthy, A.J.: A framework for constraint-based development
and autonomic management of distributed applications. In: Proceedings of International
Conference on Autonomic Computing, 2004, pp. 300301, 1718 May 2004
20. McCann, J.A., Huebscher, M., Hoskins, A.: Context as autonomic intelligence in a ubiquitous
computing environment. Int. J. Internet Protocol Technol. (IJIPT) special edition on Autonomic
Computing 2(1), 3039, Inderscience Publishers, Geneva, Switzerland
Evaluation Issues
Computer scientists, and the computing industries, rely on the ability to build systems
and iteratively evaluate the design and implementational decisions that they have
made during that process. As we have seen in previous chapters, an autonomic
computing system can take many forms and as a consequence their evaluation, and
moreover comparison, can be difficult. The very nature of some systems that emerge
solutions adds further complexity to their evaluation. This chapter presents the challenges to evaluating an autonomic system, what to look out for and what others have
attempted to do to aid this activity.
The chapters aim is to enable the reader to be able to design tests and metrics
that can be used to evaluate autonomic computing systems with a particular
focus on the aspects that makes an autonomic system different from those without
self-management features. As you will see, there is no single definitive metric that
can be used in assessing the mechanisms of all autonomic computing systems.
217
218
8.1
Evaluation Issues
8.2
Evaluation Elements
8.2
Evaluation Elements
8.2.1
Quality of Service
219
8.2.2
Cost
Autonomicity costs, the degree of this cost and its measurement are not clear-cut.
Currently, most performance studies of autonomic systems have measured the
systems ability to reach its goal. However, more appropriately, the amount of
communication, actions performed and cost of the actions required to reach that
goal need to be noted.
220
Evaluation Issues
For many commercial systems, the aim is to improve the cost of running an infrastructure, which primarily includes people costs in terms of system administrators
and maintenance. This means that the reduction in cost for such systems cannot be
measured immediately but over time and as the system becomes more and more
self-managing. Therefore, measuring such costs, and in turn savings, is complex.
However, using standard capacity planning techniques, there may be ways to estimate these savings to give a relative figure which can be used to compare approaches.
Cost comparison is further complicated by the fact that adding autonomicity means
adding intelligence, monitors and adaptation mechanismsand these cost in terms of
not only processing time but also storage and memory (and all the maintenance costs
typically associated with a computing system). For example, a Web server could have
had autonomic features added to allow it to cope with fluctuating and sudden high
demand (flash crowds) without lowering the user experience of the Web service. One
would not only be interested in how well the system was able to cope with demand,
but we would want a measure of the cost of adding these particular features (and a
measure to allow us to compare approaches). It may be the case, as in [2], that the
costs of adding both monitors to observe incoming Web traffic, and the mechanisms
to analyse the resulting data and effect change, are outweighed by their benefits under
normal operation only. As the so-called normal operation is the majority of time and
is fairly predictable, it would appear that adding autonomicity is hardly worthwhile.
However, perhaps there could be a case where loss of service under extreme conditions, for example, disaster recovery servers, would be so damaging that the cost was
justifiable. So in some cases, the addition of autonomic features might even impact
negatively on the system. However, under duress, the system would simply fail without the autonomic features, and it is there where the real benefit lies. Therefore, a
measure of the added functionality that would otherwise not be achieved in a nonautonomic system would be useful. In the example above, the added functionality is
obvious and is also the actual goal of the autonomic system. However, finding other
added benefits might not be obvious and may be found in a serendipitous fashion, so
it could be difficult to predict what to test for in advance in such cases.
The systems physical architecture also has an obvious impact on the cost of a
self-managing system. For example, most solutions consist of a service that has
autonomic features added as separate components that are interfaced to the managed element. For many of these, the analysis and planning is either hierarchical or
even centralised; that is, the monitors or gauges are external to what they are measuring and the decision to adapt, and its supervision, is external to the managed
element. Here the question is: is it fair to compare systems that use external computing hardware to run the autonomic services with those who run the autonomic services on the same system? With the former, costs could be in terms of the extra
hardware and communications to that hardware. The saving is that the AI processing and autonomic data is residing on a separate system and therefore does not
impede on the running system being managed. Extra processors dedicated to the
autonomic services mean that they could be more intelligent, for example, checking
the validity of a given reconfiguration in advance of that reconfiguration, or provide
open intelligence where the autonomic decisions themselves are adaptive. So these
benefits, and their future potential, versus their extra costs could be considered.
8.2
Evaluation Elements
221
In more emergent, decentralised or agent-based autonomic systems, the intelligence can be tightly coupled with the functional logic of the main managed element
and usually contained within the component or agent itself. Therefore, the selfmanagement overhead is perhaps indistinguishable from the agents core function,
and therefore it is more difficult to separate out the costs of autonomicityif that is
sensible at all.
8.2.3
Adaptivity
To discuss this, we separate out the act of adaptation from the monitoring and intelligence that cause the system to adapt. Adaptivity can be something as simple as a
parameter being changed, for example, changing the buffer size thresholds in
self-configuration systems. Here the adaptation does not impact the system so much
as for a component-based architectural reconfiguration. In the latter, a component
may need to be hot-swapped where state is saved, the new component located and
then bound into the system. Some systems are designed to continue execution while
reconfiguring, while others cannot. Furthermore, the location of such components
again impacts the performance of the adaptivity process. That is, a component
object, which is currently local to the system versus a component (such as a printer
driver), being retrieved over the Internet, will have significantly differing performance.
Perhaps more future systems will have the equivalent of a prefetch of components
that are likely to be of use and are preloaded to speed up the reconfiguration process.
Intuitively standard metrics that measure responsiveness should be able highlight
whether or not the time is being spent in adapting the system because the system
becomes less responsive.
Adaptability can also be seen as a measure of how well the system can configure
to cope with policy or goal changes after initial deployment. For example, over
time, the goals of the business can change subtly, and this has a direct impact on the
autonomic systems requirements. This, therefore, has implications for the rules and
policies that are derived from these changes. Not only does the autonomic system
need to allow this change, those changes will have impact on the behaviour of the
autonomic managers. How these managers will enable, and furthermore cope, with
this is something that can be evaluated. What one should be interested in here is how
those changes affect the systems behaviour and whether this behaviour is what is
both expected and welcome. This is a relatively straightforward process if the
changes can be predicted in any way. However, there are many classes of system
whereby their goal or usage may not be so predictable, and so testing the system for
its ability to adapt in a meaningful way is non-trivial.
8.2.4
Related to cost and sensitivity are the measurements concerned with the system
reconfiguration and adaptation. The time to adapt is a measurement of the time a
system takes to adapt to a change in the environment. That is, the time lag taken
222
Evaluation Issues
between the identification that a change is required until the change has been
effected safely and the system moves to a ready state. Reaction time can be seen to
partly envelop the adaptation time. This is the time between when an environmental
element has changed and the system recognises that change, decides on what reconfiguration is necessary to react to the environmental change and getting the system
ready to adapt. Further, the reaction time affects the sensitivity of the autonomic
system to its environment (see next section).
8.2.5
Sensitivity
8.2
Evaluation Elements
223
Network bandwidth
Fig. 8.1 Audio example showing how reducing samples and sensitivity equates to less state
changes and highlighting that this equates to less bandwidth utilisation
environment (i.e. bandwidth measurements) and has low deviation thresholds such
that the system tries to track the bandwidth and maximise the overall quality of
sound delivered to the user over time. The green line represents a less sensitive version of the autonomic audio player. Here the sample frequencies are less and also
the thresholds that indicate when to adapt are looser; therefore, the system configures less over time. The trade-off here is that sound quality is not optimal, but the
cost of the autonomic system is actually significantly lower.
8.2.6
Stabilisation
Another metric related to sensitivity is stabilisation. That is the time taken for the
system to learn its environment and stabilise its operation. This is particularly interesting for open adaptive systems that learn how to best reconfigure the system.
For closed autonomic systems, the sensitivity would be a product of the static rule/
constraint set and the stability of the underlying environment the system must adapt
to. It is the time to reach homeostasis that is important here. To test this, one would
ensure that the test environment contained parameters that cover edge conditions,
values that are beyond the normal expected values of the system. From this, the
tester can observe the behaviour of the system as it tries to adapt to best maintain
goals under these extreme conditions. What the tester is wishing to observe is
whether or not the system can get back to a steady state and how long this takes (see
Fig. 8.2). As with every living organism, as the system ages, it may lose its ability
to maintain homeostatic balance.
224
Evaluation Issues
disruption
Stable state
Return to stable
state
Fig. 8.2 Illustration of stability, disruptions and the return to stability. Compare this with control
example in ctrl chapter
8.2.7
Typically, many autonomic systems are designed to avoid failure at some level.
Many are designed to cope with hardware failure such as a node in a cluster that is
no longer responding. Some avoid such a failure by stopping and rebooting; others
seek an alternative perhaps retrieving a missing component and installing it. Either
way, the predictability of failure is an aspect that can require consideration when
comparing autonomic systems. Some systems will be designed for their ability to
cope with predicted failure but unable to handle failures that they are not programmed to identify or rectify. Systems that typically refine policies from goals
are excellent at coping with predicable failure, as the methods to overcome this are
programmed in the policies associated actions. For example, a goal may be to ensure
that transactions do not take over one second. Here the resulting policy could have
the condition if node utilisation reaches > = 70% with the action being to bring up
a new server node. However, such mechanisms typically fail in unpredictable cases,
and when they come across a situation (assuming they even recognise the situation
at all), they can resort to some default action, such as informing the user. Other
systems are designed to be able to cope with unpredictability. These tend to be systems that embed the autonomic features at the finer-grained lower levels and here
typically the logic is highly distributed. Examples of such systems are those that
perform routing functions in the Internet, for example. Here nodes in the system
maintain routing tables to enable them to route around a node failure. Note that the
notion of unpredictability is relative here; even the most autonomous system
requires some programming of what to do when failure is detected.
Let us illustrate this by returning to our audio-server example. Recall that the
purpose of this system was to adapt its audio encoding code depending on how it
perceived the link between the audio server and the user at a moment in time. The
8.2
Evaluation Elements
225
overall goal is that there should never be a moment of audio silence during playback.
To test how well this system is able to achieve this, we placed the system in a controlled
environment. Here we would artificially mimic the communications link between
the user and the server and vary the bandwidth available over the link. We would
also change how quickly the bandwidth varied. This would test its ability to avoid
periods of silence given differing environmental circumstances. The intuition is that
the system would most probably be able to cope in situations where the bandwidth
only varied slightly or in a predictable way. That is, the changes in bandwidth would
fit obvious trends, and the variation in bandwidth would be minimal enough to allow
the system to have time to identify the trend and quickly reconfigure to ensure perfect playback. One would expect that the system would adapt more gracefully compared
with its operation in a more bursty network. We increased the parameters that reflected
the environment to more stressful levels. In this case, the environment represented
bandwidth fluctuation with extreme variation between high and low values. This
experiment showed us that the system continued to operate correctly but was adapting
up and down the codecs constantly, sometimes even missing an opportunity to adapt
because it did not notice an instance of environmental change as it was still handling
the previous adaptation [3]!
Therefore, one may wish to see how well the system is able to cope with less
predictive situations. Tests would be designed with this in mind. One could choreograph
a situation where nodes are switched off to mimic failure or a workload is ramped
up to extreme heights and injected into the system. The use of randomness or distributions of values can be stretched to beyond expected limits to enable the tester to
examine pathological cases. The measurement of the systems ability to cope could
simply be in terms of how well certain quality of service metrics are met, which is
obviously close to the application domain.
Related to this is the ability to compare how autonomous a system is. This would
be a measure of not only how well the system can cope in less predictable situations
but how much it relies on the outside world to do this. For example, the NASA
pathfinder must cope with unpredicted problems and learn to overcome them without direct external help. Decreasing the degree of predictability in the environment
and seeing how the system copes could measure this. Lower predictability could
even mean it having to cope with things that it was not designed for. A degree of
proactivity could also compare these features. The notion autonomy is also related
to the conversation about how centralised and fully decentralised approaches to
autonomic computing differ which we have later in this chapter.
8.2.8
Most autonomic systems that exist at the moment of writing are of a partially autonomic type, also described as the basic, managed and predictive levels in the
Autonomic Maturity Model [4] (see Chap. 2.4.3). This means that some form of
user interaction remains in the MAPE-K loop. Therefore, one needs to evaluate the
systems ability to communicate with external entities such as the technical support
226
Evaluation Issues
8.2.9
8.2
Evaluation Elements
227
properties of the system (rather than the microscopic local or node-based properties). Metrics would now be required to compare the system or the structures
ability to quickly and seamlessly adapt, for example, add or delete a new node to/
from the system.
The common example regularly used to illustrate this is found in autonomic
communications systems. A communications network is designed to route data
packets from a source node to a sink node (e.g. similar to that found in Internet
routing or sensing network routing). The elements of the network are nodes (which
are computers and routers) and arcs (which represent the communications infrastructure, wired or wireless radio links). The network communications is multistaged whereby data hops over the arcs to the nodes. Each node has local knowledge
regarding what it should do with that packet, but its logic is such that the structure
(routing tree) emerges, to ensure reliable delivery of the data packet, and that the
packet would be delivered over the shortest or quickest route to its destination.
Here the system does not need to know about all the nodes in the network, just its
neighbours. If it knows its best neighbour to send the packet too, then one can
imagine that that packet will be sent over all the best links in the network to the
sink. Therefore, optimising the delivery of that packet, that is, the route, emerges.
Evaluating such a system would involve its ability to cope with disturbances,
malicious or not and temporary or permanent. In this example, there could be a
node failure which means that the current fastest known route is no longer viable
and the data is required to be rerouted around the dead node to ensure data delivery.
Metrics to measure this would be throughput and latency based. Latency will highlight that the data had to be rerouted away from the shortest path and therefore the
extra hops involved which will incur temporal costs. Likewise, if a new node is
added to the system, this too will affect the structure in that many of the shortest
routes may need recalculating; otherwise, the new node will be underutilised. The
adaptation is to the structure of the system as a whole and not necessarily changing
the behaviour of the local nodes themselves, though they need to have logic that
knows to look for new nodes and route around nodes they no longer have contact
with. For a fully distributed system, this would mean each node sending identification messages to each neighbour node to see who is there and then collectively
build up routes. This is obviously an expensive activity increasing the numbers of
control messages and limiting the ability to route the actual data. Therefore, when
measuring such a distributed system, it is necessary to measure the goodput relative to the throughput as a result. This then highlights how much data is being sent
over the network and how much is overheads.
8.2.10 Granularity/Flexibility
Similar to sensitivity, the granularity of autonomicity is an important issue when
comparing autonomic systems. Fine-grained components with specific adaptation
rules will be highly flexible and perhaps adapt to situations better; however, this
228
Evaluation Issues
may cause more overhead in terms of the global system. That is, if we assume that
each finer-grained component requires environmental data and is providing some
form of feedback on its performance, then potentially there is more monitoring data,
or at least environmental information, flowing around the global system. Of course,
this may not be the case in systems where the intelligence is more centralised or the
monitored data is stored in a shared repository.
Granularity is important; take the example in [2]. Here the authors found that
unbinding, loading and rebinding a component took a few seconds. These few seconds could be tolerable in a thick-grained component-based architecture where the
overheads can be hidden in the systems overall operation and where change is not
that regular. However, in finer-grained architectures, such as an operating system or
ubiquitous computing where change is either more regular or the components
smaller, the hot-swap time is potentially too much.
One question we may ask is: can systems that provide the same service be
compared with each other if the granularity of autonomicity is different? Perhaps
at a high level, yes. But let us unpick this a little further. If both approaches provide the same quality of service, the same ability to reduce costs, the same capacity to satisfy users, etc., is there a further cost? Of course, this further cost lies in
the systems ability to be maintainedan autonomic system and its autonomic
features require maintenance too, just like traditional systems. If the granularity
is fine grained, it usually means that there is tight coupling between the managed
element and the management software. This adds extra burden in terms of debugging, updating and improving the overall system and therefore should be
accounted for.
8.3
Emergent behaviours are those that arise from a number of (simple) processes
cooperating to achieve a goal. Emergence has been proven in nature as an agile
way to solve problems that involve many components as it has the ability and
flexibility to adapt to situations that were unplanned. A natural example of emergent
behaviour is found when birds flock; the flock is an entity that is used to transport
numbers of birds for migration or to make the unit appear larger so that the individuals can be defended. Another advantage of emergence is that the rule base for
the individuals is typically quite simple. It is for these reasons that the autonomic
computing community has developed and adapted emergent algorithms to make
the system more robust to failure, and change. One example of this is where gossip
algorithms can be used to move heartbeats around the network of components.
Here each node, component or entity in the network sends a message of its heartbeat
to its neighbour and then that is passed on. When the network converges, it has a
common understanding of the systems heartbeat as a result. Emergent algorithms
behaviour also needs evaluating, and there are a number of metrics that can be used
to do this; we list some below.
8.3
8.3.1
229
8.3.2
Equilibrium
230
8.4
Evaluation Issues
Benchmarking
Finally, it may become necessary to bring these metrics together to form some sort
of benchmark. There are two approaches this can take: either we can derive new
autonomic systems benchmarks or we can augment current benchmarks to incorporate
metrics that measure autonomic characteristics.
Benchmarking is the process whereby a system that is being tested is subjected
to a synthetic (controllable) workload and its performance is measured given that
workload. Therefore, the components of a benchmarking process are the system
under test, the workload, the performance metrics, the component that measures the
performance of the operational benchmark and the test results, as seen in Fig. 8.3.
The design of the workload is central to benchmarking as it is this that tests the
system. It can represent what the system was designed to do, and thus, the results
from the benchmarking process will show how well it is able to do that job. Here the
workload will be designed to represent the typical use of the system. The workload
can also test aspects such as how well the system can scale and work under stress.
Here, the workload would be beyond the expected use of the system (as known at
design time or under current usage), and the results will show where the system
could potentially fail in the future.
An example of general benchmark is the Standard Performance Evaluation
Corporations (SPEC) benchmarking suites. SPEC is a not-for-profit organisation
that produces standardised sets of performance benchmarks to evaluate computer
systems. The results of running such benchmarks are sometimes referred to as
Trace
Goals
WORKLOAD
Disruption
Environment
System
Under
Test
Autonomic
Manager
Standard
Benchmarks
Performance
Results
8.4
Benchmarking
231
232
8.5
Evaluation Issues
There has been little progress in defining a definitive benchmark for autonomic
systems, bar a small set of publications that very much come to the same inconclusive conclusions. In this section, we discuss the Autonomic Computing Benchmark
as this is perhaps the most mature approach available at the time of writing.
To match the autonomic computing investment by IBM, a benchmark, aptly
named the Autonomic Computing Benchmark, was released in 2008. This benchmark
is described as using a fault injection methodology and five categories of faults or
disturbances. Though they describe the word fault to imply an invalid operation
and the word disturbance as having the broader meaning covering invalid operations, intrusions, interruptions, events, etc., anything that alters system state, they
use the words interchangeably. Two metrics are used to evaluate the system: the
throughput index and the maturity index. The latter is a measure of autonomicity
(as described in Chap. 2.4.3) indicating the degree of human intervention required
in the task. Throughput measures the impact on quality of service expectations due
to the injected disturbances.
As recommended in our early paper on autonomic system evaluation [6], the
Autonomic Computing Benchmark essentially extends current representative
benchmarking suites with mechanisms that pertain to autonomic features. To this
end, the benchmark suite mimics a typical B2B (business-to-business application)
and is designed essentially to wrap around current business application benchmarks
such as the SPECjAppServer2004 Performance Benchmark, a popular J2EE performance benchmark from the SPEC organisation [7]. It covers a multicomponent
application architecture and also takes administrative duties into account. The
benchmark has three states: baseline, test and check. Essentially, the system must
ramp up to a steady state to represent the baseline performance of that application
under the given conditions. From this, disturbances are injected in a predefined
sequence and the system is thus tested. Finally, the check state double-checks
that the changes the system made to maintain stability have had no other negative impacts on the system, such as transactions or updates lost and missing
data. In between disturbance sequences, the system is allowed to recover to a
steady state to enable the user to trace the cause and effects of the disturbances
more clearly.
Among the measurements that determine the notion of autonomic maturity is
quantification of the quality of a self-healing action. Some self-healing actions can
consist of mechanisms to avert the problem caused by a disturbance by routing
around it or creating ways to bypass that process. After that, the system may instigate mechanisms to heal the problem, and the repaired system is reintegrated back
into the system under test. Now given that self-healing systems have excised since
before the term autonomic has been used and many systems have some sort of
self-healing nature to them, one may not notice any change to the system because
a component was able to continue operation with the fault injected into it. The
example given is that of a RAID redundant disk array that is able to continue
operation after a disk failure. That component is still operational, albeit in a reduced
8.6
Key Points
233
way that might cause problems in the future. This is what they mean by quality of
repair. The system has not failed and is continuing to deliver data in, but it is not
ideal or operating in an optimal way. To represent these phenomena, The Autonomic
Benchmark judges that bypassing a problem does not constitute a full recovery and
attributes a value to this and a score of any repair action taken. The intuition is
that if no repair occurs (even if the system is running and stable), the resources
available to the system are reduced. It allows those aiming to evaluate the system
to get a better idea of the capability that the system has when facing any subsequent
changes upon it. This metric was derived from The Autonomic Benchmarks
authors, inspired by working with autonomic systems. It is a metric they found to
be valuable and highlights both the complexity of deriving metrics for autonomic
systems and that they can be metrics that are important to some users but not necessarily to all.
8.6
Key Points
234
Evaluation Issues
References
1. Kaddoum, E., Gleizes, M.-P., Georg, J.-P., Picard, G.: Characterizing and evaluating problem
solving self-* systems. In: Proceedings of the 2009 Computation World: Future Computing,
Service Computation, Cognitive, Adaptive, Content, Patterns, pp. 137145. IEEE Computer
Society, Washington, DC (2009)
2. McCann, J.A., Jawaheer, G.: Experiences in building the Patia autonomic webserver. In: 1st
International Workshop Autonomic Computing Systems, DEXA 2003, September 15,
Prague, Czech Republic (2003)
3. McCann, J.A., Howlett, P., Crane, J.S.: Kendra: adaptive Internet system. J. Syst. Softw. 55(1),
317 (2000). Elsevier Science
4. IBM Data Governance Council Maturity Model. http://www-935.ibm.com/services/uk/cio/pdf/
leverage_wp_data_gov_council_maturity_model.pdf (2007)
5. 47RC23146 (W0403-071) March 10, 2004 Computer Science IBM Research Report. An
approach to benchmarking configuration complexity Aaron B. Brown, Joseph L. Hellerstein
6. McCann J.A., Huebscher M.C.: Evaluation issues in autonomic computing. In: The International
Workshop on Agents and Autonomic Computing and Grid Enabled Virtual Organizations
(AAC-GEVO04), 3rd International Conference on Grid and Cooperative Computing, Wuhan,
China, 2124 Oct 2004. Springer-Verlag, Heidelberg
7. http://www.spec.org/jAppServer2004/, url dated 26 Sept 2011
235
236
9.1
Software Integration
Databases
Application
Integration
software
Devices
Application
9.1
Software Integration
237
Databases
Application
Application
Devices
238
Application
Application
Endpoint
Mediator
9.2
Cilia
239
9.2
Cilia
240
Dispatcher
Processor
Input
Ports
Scheduler
Administration interface
Output
Ports
Binding BD
Binding AB
Mediator B
Adapter
(in)
Mediator A
Mediator C
Binding AC
Binding CD
Adapter
(out)
Mediator C
is to feed the mediators (and the destination resources) with data in the appropriate
format and with the appropriate timing. Mediators constitute the heart of the chain
since they implement the effective mediation operations. This is illustrated in Fig. 9.5.
Mediators (and adapters) are connected via bindings. A binding describes a connection between an output port and an input port. At execution time, a binding is
realised by a communication protocol transferring data from a mediator (or adapter)
to another mediator (or adapter). This protocol can be specified at deployment time
but also at development time. Cilia supports local and distant communication protocols, including several message-oriented protocols.
A mediation chain is data-flow driven. Specifically, the computation model is the
following. An adapter gets data from an application or a resource (using an appropriate protocol) and sends the collected data to one or several mediators. These mediators apply their operation to the received data as soon as the triggering conditions
provided by their scheduler are met. Results are put in the output ports by the
dispatcher and propagated to the next mediators. At the end, an adapter feeds an
application or a resource (using an appropriate protocol) with the transformed data.
The DSCilia language permits straightforward definition of the mediation
chains. Specifically, defining a mediation chain consists in specifying mediators
and bindings between those mediators. This is done via domain-specific terms that
are familiar to domain developers (integrators). Also, DSCilia permits the easy
definition of the most commonly used enterprise integration patterns [6], which are
part of the baseline knowledge for every integration engineer. Indeed, these patterns
are often based on synchronisation and dispatching functions.
9.2
Cilia
241
DSCilia file
Cilia Runtime
Application
Application
OSGi/iPOJO/RoSe
The Cilia execution framework is built on top of OSGi and iPOJO (see Chap. 6
about adaptation). It also includes RoSe, an open-source communication middleware that is able to dynamically import and export services.4
A mediation chain is created in the following manner (Fig. 9.6): a specification
file, based on DSCilia, is transmitted to the Cilia runtime. These specifications are
transformed into a number of iPOJO components definitions. At least five iPOJO
components are created for each mediator: one component for the scheduler, one
component for the processor, one component for the dispatcher and two components for the in and out communication ports (more components are created if
different protocols are used by different ports). The defined iPOJO components are
then instantiated and executed. From this point, the mediation chain is operational
(and the desired integration is achieved).
Let us remind here that iPOJO relies on byte code manipulation to create extensible
containers encapsulating the execution of Java classes. A container can host a number
of handlers implementing non-functional aspects (handlers are triggered before or after
a method call). As we will explain in detail, this feature is particularly convenient for
implementing dynamic monitoring functions attached to a component.
To sum up, Cilia is a recent framework meeting the stringent requirements on
software integration and well adapted to the implementation of commonly
accepted integration patterns. The use of domain-specific concepts simplifies
the creation and understanding of mediation chains. However, adapting Cilia
chains to new runtime conditions still depends on skilled administrators and
generally requires some downtime. In many domains, administrators are not
available or service interruption is not an option. Autonomic approaches are
therefore needed. This is why the Cilia framework has been partly redesigned
with autonomicity in mind.
In the rest of this chapter, we concentrate on the autonomic features exhibited
by Cilia, as needed in the context of this book. Above all, we focus on the way in
which Cilia has been implemented in order to illustrate some of the techniques and
242
9.3
Autonomic Cilia
9.3.1
Overview
Specifically, support for the following adaptations has been demanded during
Cilias requirement elicitation phase:
A mediation chain can be dynamically added or removed.
Configuration parameters of a mediation chain can be dynamically updated.
A mediator can be dynamically removed from a running chain.
A mediator can be dynamically added to a running chain.
A mediator can be dynamically replaced within a running chain (hot-swapping).
Configuration parameters of a mediator can be dynamically updated.
Configuration parameters of the execution machine can be dynamically updated.
An adapter can be dynamically replaced.
Configuration parameters of an adapter can be dynamically changed.
These adaptations are required to address functional evolutions or to fix nonoptimal behaviours. Triggering an adaptation, however, requires a good knowledge of
the running chains both in terms of specification and runtime behaviour. It also
requires facilities to implement structural and behavioural modifications without
data losses or broken control flows.
As illustrated in Fig. 9.7, the Cilia framework now provides a set of touchpoints
to dynamically monitor and adapt the mediation chains under execution and some
aspects of the supporting execution platform (essentially the service discovery functions). It also allows the construction of a configurable knowledge-base storing
design and runtime information about the mediation chains and the platform under
operation. This knowledge base provides a model of runtime phenomena, with trends
and past data, and is intended for use by autonomic managers. This model is causal
in the sense that modifications made on the model representation are reflected on the
Cilia runtime and vice versa. Using this knowledge module is a very convenient way
for domain engineers to create autonomic managers. Managers use high-level APIs
provided by the knowledge module to get relevant information and trigger adaptations. Such approach does not demand to be familiar with the intricacies of Cilia;
domain-specific mediation knowledge suffices to manage Cilia-based systems.
Many of the Cilia features are flexible and configurable. Monitoring, in particular, can be controlled in a dynamic way. This means that Cilia monitoring can be
activated or deactivated globally. It also means that the elements to be monitored,
and the way they are monitored, can be configured without interruption of service.
Similarly, the knowledge base can be loaded or not, used or not, executed on a different machine (through REST interfaces) or not, etc. This allows developers and
administrators to use Cilia features in accordance with their needs and objectives.
9.3
Autonomic Cilia
243
Autonomic
Manager
M
A P
Developed by domain
engineers
E
Adaptation
directives
Monitoring
directives
Knowledge
Touchpoints
Cilia Runtime
Application
OSGi/iPOJO/RoSe
Application
Expectations can obviously vary according to the runtime situation and to the problems that may arise, which is a typical property of administration systems.
It is important to remember that Cilia is a framework, which means that domain
developers are in charge of the development of the mediation chains. To do so, they
create mediators, bindings, chains, etc., in order to meet their requirements. But
domain developers are also responsible for the development of the autonomic managers. In this context, the purpose of the framework is to provide facilities (touchpoints, design and runtime knowledge) to ease the work of the domain engineers.
As said earlier, the Cilia runtime, and the mediation chains, may be distributed
across several machines. However, autonomic decisions are centralised in the sense
that a unique autonomic manager is responsible for the management of the running
mediation chains. We will see later in this chapter that more decentralised solutions
involving multiple autonomic managers are also being investigated.
To implement the touchpoints and the knowledge base, the Cilia framework uses
many of the techniques presented in this book (Chaps. 4, 5, 6, and 7). This is
presented in the next sections.
9.3.2
Cilia Touchpoints
Now, let us see the interfaces that the Cilia framework provides to monitor the
mediation chains and to adapt them dynamically. The Cilia framework offers a
unique entry point called a CiliaContext. This interface is a faade, as defined in [7].
It provides general information about the Cilia framework and gives access to the
Builder and Application Runtime objects.
For creation and modification purposes, the CiliaContext provides a Builder
object. Builder is another well-known design pattern defined in [7] and often used
in object-oriented frameworks for the creation of complex heterogeneous structures.
244
In our case, the Builder object allows the construction and update of the mediation
chains and any of their constituents. Such creations and modifications can be made
through a DSCilia file or directly in Java via what is usually called a Java DSL5
(where domain-specific concepts are exposed through regular Java interfaces).
The Application Runtime object allows the management of the monitoring function for the mediation chains and any of their constituents. It provides methods to
dynamically define the elements to be monitored, the information to be collected or
received and the way to do so (monitoring policies).
An extract of the CiliaContext interface is provided hereafter in Listing 9.1.
Monitoring relies on the notion of state variables that are used to model the
dynamics of the running chains. This approach draws its inspiration from control
theory, as presented in Chap. 3. State variables are attached to global mediation
chains but also to their constituents (mediators, adapters and bindings). Their values,
called measures, are kept in circular lists in order to keep records of the past. The size
of the lists is configurable and can be changed at runtime.
Measures can be kept in the knowledge base. Several policies are available to
do so. For instance, values can be regularly sent to the knowledge module or simply
provided on demand. Also, warnings and alarms can be defined on the state variables. When a measure exceeds a low or high threshold, a warning is emitted.
When a measure exceeds a very low or very high threshold, an alarm is emitted
by the Cilia runtime.
Specifically, the state variables attached to each mediator are the following:
Scheduler start time
Scheduler incoming data
Processor start time
Processor incoming data
Processor outgoing data
Processor end time
5
9.3
Autonomic Cilia
245
9.3.3
246
DSCilia file
Cilia Runtime
Meta level
Domain concepts
(Java)
Base level
support. The meta-level allows the dynamic definition of monitoring and adaptation
strategies in domain-specific terms, which is essential when it comes to complex
administration problem solving.
The base level thus contains the iPOJO components implementing the schedulers,
the processors, the dispatchers, the bindings, etc. These low-level objects are never
presented by Cilia through public interfaces. Yet these are the objects holding the
runtime information that is needed to administrate the Cilia framework. They are
also the objects to be manipulated when it comes to modifying a running chain.
The Cilia framework ensures synchronisation between the two levels. The base
level follows what is specified in the meta-level, and the meta-level includes information coming from the objects at base level. The implementation relies on the
observer pattern, as defined in [7], and on the notion of controller. As illustrated in
Fig. 9.9, the specification of a mediator leads to the creation of three components:
SpecificationModelManager stores the specification of a mediator at the meta-level.
MediatorManager, at base level, creates and manages the iPOJO components
implementing a mediator, that is, the scheduler (S in Fig. 9.9), the processor (P),
the dispatcher (D) and the communication components (C).
MediatorControler handles the causal relation between the model in the
SpecificationModelManager and the implementation controlled by the
MediatorManager.
MediatorControler is an observer of SpecificationModelManager. When registering as an observer, it provides SpecificationModelManager with a callback method
that has to be called when the mediator specification is changed. MediatorControler
9.3
Autonomic Cilia
247
Meta level
Mediation Chain
Mediator
Specifications
SpecificationModel
Manager
Observation/Notification
Mediator
Controler
Base level
Lifecycle management
Mediator
Manager
also transforms mediator specifications into management directives intended for the
MediatorManager. The latter can be seen as a factory [7]. It creates all the necessary
components and manages their life cycle.
9.3.4
The monitoring functions are implemented at the base level and controlled by the
meta-level. That is, the meta-level is in charge of activating/deactivating the monitoring activity, selecting the state variables that are needed, collecting the values of
those state variables and deciding on the storage policy.
The base level implements the monitoring functions per se. It tracks and gets the
relevant state variables and makes their values (measures) available to the metalevel. Implementation is based on the following principles (see Figs. 9.10 and 9.11):
Each iPOJO component is augmented with a specific administration handler.
Components of type MediatorManager have a monitoring API and must pass
monitoring directives down to the iPOJO components they manage.
Provided
Admin
POJO
Required
248
9.3
Autonomic Cilia
249
Mediation Chain
Mediator
Specifications
Runtime
SpecificationModel
Manager
RuntimeModel
Manager
Observation/Notification
Mediator
Controler
Monitoring
events and measures
Lifecycle management
and monitoring
Mediator
Manager
9.3.5
Regarding adaptations, the most ambitious goal of Cilia is to allow the dynamic
modification of the chains topology. This requires preserving control flows and saving
the data being processed, the messages in our case. Just like monitoring, adaptation
250
Fig. 9.12 Message
management in Cilia
schedulers
DataPersistency
Incoming messages
POJO
(scheduler)
Required
Provided
Admin
directives come from the meta-level. Concretely, the meta-level decides on the
mediators to be modified, added, suppressed or swapped. Decisions are implemented by modifying specifications in the meta-level, using a domain-specific language
(the DSCilia language or the Java DSL). Directives are transmitted down to the base
level that has to realise modifications in the code.
The base level implements the adaptation functions per se. To do so, it implements a quiescence protocol, as discussed in Chap. 6, allowing the safe adaptations
or replacements of mediators. Precisely, implementation is based on the following
principles:
All messages entering a mediator can be saved outside of the mediator.
Components of type MediatorManager can be locked or unlocked.
SpecificationModelManager provides a life-cycle management API, including
start, stop, resume and remove directives.
Let us now detail these different implementation aspects. The administration
handler of the iPOJO components that implement schedulers has been modified in
order to store the data received by the schedulers outside of the components.
Specifically, incoming messages (data) are first saved in a dedicated component
called DataPersistency and then transmitted to the scheduler. This function is
configurable: it can be activated or not. It is illustrated by Fig. 9.12.
In addition to that, a mediator can be locked. This operation is proposed at the
MediatorManager level. When a mediator is locked, incoming messages are redirected
to the DataPersistency component but are not transmitted to the scheduler POJO. This
mechanism is used to put a mediator in a quiet state, that is, a state where no computation is going on. When a mediator is quiet, that is, when it is locked and all the started
computations are done, then it can be removed, and the stored messages can be sent to
another mediator (a brand new one or an existing one). This prevents data losses during
mediator hot-swapping or chain topology change operations.
Once again, implementing such a mechanism has a cost. Messages transmitted
between mediators are intercepted, stored, managed, etc. However, this is the price
to pay to be able to update a mediation chain without data losses or broken control
9.4
251
9.3.6
Knowledge Module
The purpose of the knowledge module in Cilia is to describe the running Cilia artefacts
(chains, mediators, adapters, bindings, etc.) in terms of specification and runtime models.
It can be seen as the K in the MAPE-K approach presented in Chap. 4, architecture.
This K is directly built by the Cilia framework and can be configurable (even
disengaged if desired). It contains a selection of measures and events characterising
a mediation solution over time that is corresponding to past and present runtime
situations. For instance, it may record topological changes of the mediation chains,
significant past values of specified state variables and current values of specified
state variables. The definition of the information to be kept can be specified by autonomic managers through dedicated APIs. The autonomic managers can also directly
use the touchpoints provided by the Cilia framework, but in this case, they have to
build and maintain their own knowledge base.
Interactions between the knowledge base and the runtime are bidirectional. The
knowledge base can explicitly fetch data through the monitoring touchpoints and
subscribe to events originating in the Cilia runtime. Also, the knowledge can trigger
adaptations on the runtime and get feedback.
As previously indicated, the model provided by the knowledge basethe K
is causal. Modifications made on that model are automatically propagated down to
the running chains and vice versa. For instance, suppressing a mediator in the model
implies suppressing the corresponding implementation under execution in the Cilia
runtime. Of course, a delay is necessary for an adaptation to be completed and for
the results to be observable back in the K model. This has to be taken into account
when measuring the effects of an adaptation, as explained in Chap. 8.
As illustrated by Fig. 9.13, the role of domain engineers is then to implement the
mediation chains and the autonomic managers providing self-management features.
If they decide to use the knowledge module for monitoring and adaptation purposes,
they can concentrate on decision-making algorithms (see Chap. 7). Simply put, they
just have to dynamically configure the monitoring function, get runtime information
and apply adaptations onto the knowledge module.
9.4
9.4.1
Life-cycle management comprises all operations necessary for getting an application from its specification stage to its execution stage. Often, life-cycle management
must extend into the runtime for performing maintenance operations including
252
E
Implement the
autonomic features
Monitoring
directives
Adaptation
directives
Knowledge
Adaptation
and feedback
System
Developer
Specify and code
the mediation chains
Cilia Runtime
OSGi/iPOJO/RoSe
updates, optimisations, repairs and extensions (see Chap. 1). In its current form,
the Cilia framework supports the autonomic life-cycle management of mediation
chains. It provides monitoring and adaptation touchpoints at two abstraction
levels. First, mediation domain concepts such as chains, mediators and bindings
can be manipulated through a configurable knowledge base. Second, the concepts
from the underlying implementation technology (iPOJO) can be manipulated
through the Cilia runtime. Autonomic managers can be built on top of these
touchpoints.
However, in some situations, developing autonomic managers may require
significant effort. Indeed, when mediation chains and their execution environment
become complex, the corresponding life-cycle management logic is also complex,
requiring expertise in both autonomic computing and mediation domains. Advanced
solutions could facilitate the development and maintenance of autonomic management
systems in the mediation domain. The main challenges to address for providing
such solutions stem from key questions such as:
1. How to express the business-level objectives of autonomic life-cycle managers at
a high level of abstraction? (See discussion on goals in Chap. 2.)
2. How to develop the system management logic that automatically attains the
objectives?
3. How to develop the decision logic that uses monitoring information and enforces
adaptation operations in order to attain the objectives in the presence of runtime
change? (See Chap. 7.)
9.4
253
4. How to ensure that the decision logic can handle a large spectrum of changes?
5. How to render autonomic managers extensible in order to easily add new objectives and decision functions able to pursue them?
6. How to ensure the scalability of autonomic managers with the size, number and
distribution of mediation chains and with the frequency of dynamic changes to
adapt to?
7. How to ensure the life-cycle management of the autonomic managers so that
they can follow the deployment of mediation chains and survive failures in the
underlying platforms?
9.4.2
Model-Based Solutions
254
Reference architecture
Autonomic
Manager
Cilia Runtime
Application
OSGi/iPOJO/RoSe
Application
9.4.3
Cube project is developed by the Adele team at University of Grenoble in collaboration with the
S3 team at Telecom Paris Tech (Cube homepage: http://cube.imag.fr).
9.4
255
256
Cube Agent
Cube Agent
Cube Agent
Archetype
Archetype
Archetype
Knowledge
Knowledge
Knowledge
Application Part
Application Part
Application Part
Platform
Platform
Platform
9.4.3.1 Example
Let us consider an illustrative example of a Cube archetype and its life-cycle management process, in the context of a theoretical mediation application. Here, data
collected from any source(Cilia Adapter in) must first be aggregated (via a Cilia
mediator) and then transmitted to a destination (Cilia Adapter out). To improve
performance, an aggregator may only accept data from a maximum of three
sources. As sources join the system dynamically, the mediation chains must
accordingly adjust their composition in order to integrate them. Figure 9.16 lists
this examples archetype.
The archetype defines three types of mediator components (lines 68): S (source),
A (aggregator) and D (destination). It also defines a number of constraints on
these types. Namely, any data source must be connected to an aggregator (line 18),
and any aggregator must be connected to a destination (line 19). When attempting
to acquire an aggregator for a certain source, the agents in charge must first try to
find an existing one (line 21) and then, if none is available, to create one (line 23).
The same policy is specified for acquiring destination components. A final constraint limits to three the number of input connections that each aggregator may
accept (line 26).
Figure 9.17 depicts the runtime model7 of the mediation chains created by the
agents after the dynamic insertion of eight data sources. The sources were added
manually, triggering the agents to instantiate and interconnect of the necessary
7
Cubes graphical interface shown here is based on the Prefuse visualisation toolkithttp://
prefuse.org
9.4
257
258
9.4.3.2 Discussion
By design, the Cube approach promises several advantages over related life-cycle
management solutions. Conversely, the same design features that provide these
advantages can also introduce certain drawbacks. In short, abstract reference models (archetypes) allow more runtime flexibility but provide less statically verifiable
guarantees over the automatically determined solutions. The right level of abstraction must be found for each system in order to ensure the necessary balance between
adaptability and control.
Similarly, decentralising the life-cycle management process and fragmenting the
runtime model can provide clear scalability and robustness advantages. At the same
time, decentralised control makes it difficult or sometimes impossible to obtain
management solutions that are globally optimal. Furthermore, communication
overheads required by agent coordination may impact global system performance;
system development must consider additional challenges including convergence
and stability. As before, the right compromise between complete, centralised control and long-term scalability and survivability must be determined depending on
each managed system. Generally, Cube is mainly applicable to cases where the
robustness of mediation chains in the long term is more important than instant system
performance, even though a certain baseline performance level may be ensured.
Conversely, the Cube approach should be avoided in cases where strong guarantees
can be guaranteed via centralised control.
Concerning positioning with respect to related work, the Cube approach finds
itself at the intersection of several research fields and subfields, of which we only
mention a few at this point. For example, from a purely autonomic computing perspective, Cube can be seen as a solution for the autonomic life-cycle management
for large-scale, distributed and highly dynamic mediation systems [18, 20]. With
respect to the model-based software engineering field, Cube introduces a particular
combination of design features, including model abstraction and control decentralisation, for providing increased runtime flexibility and scalability. Also, Cube is
similar to the problem-centred approach presented, for example, in [24] or[25],
where specifying a Cube archetype corresponds to posing a problem. From a bioinspired system engineering perspective, Cube can be related to morphogenesis
the development process of biological organisms [21]. From this standpoint,
Cubes archetype can be viewed as the equivalent of a biological genotype and the
resulting application instances as the equivalent of biological phenotypes. From a
self-organising and self-adaptive system perspective, Cube can be seen as a compromise between top-down and bottom-up approaches, where the archetype provides
the means of controlling the global result of decentralised self-organising processes
[19]. From a constraint programming perspective, Cube can be considered as a
constraint-oriented solution, where the constraints are defined via the archetype and
the constraint resolver is decentralised, allowing partial solutions to combine into
globally conformant resolutions.
As a final note, it is important to note that most of the features presented for the
Cube approach represent envisaged capabilities associated to the generic Cube
proposal and have not yet been fully implemented or validated. Developments so far
9.5
Key Points
259
9.5
Key Points
260
References
1. Wiederhold, G.: Mediators in the architecture of future information systems. Computer 25(3),
3849 (1992)
2. Wiederhold, G., Genesereth, M.: The conceptual basis for mediation services. IEEE Expert
12(5), 3847 (1997)
3. Lalanda, P., Bellissard, L., Balter, R.: Asynchronous mediation for integrating business and
operational processes. IEEE Internet Comput. 10(1), 5664 (2006)
4. Garcia, I., Pedraza, G., Debbabi, B., Lalanda, P., Hamon, C.: Towards a service mediation
framework for dynamic applications. In: Proceedings of the IEEE 2010 Asia-Pacific Services
Computing Conference, Hangzhou, China, 6 Dec 2010
5. Morand, D., Garcia, I., Lalanda, P.: Autonomic enterprise service bus. In: Proceedings of the
Service Oriented Architectures in Converging Networked Environments (SOCNE), Toulouse,
France, 5 Sept 2011
6. Hohpe, G., Woolf, B.: Enterprise Integration Patterns; Designing, Building, and Deploying
Messaging Solutions. Addison-Wesley, Boston (2003)
7. Gamma, E., Helm, R., Johnson, R., Vlissides, J.: Design Patterns: Elements of Reusable
Object-Oriented Software. Addison Wesley, Reading (1995)
8. Garcia, I., Morand, D., Debbabi, B., Lalanda, P., Bourret, P.: A reflective framework for mediation applications. In: Proceedings of the 10th International Middleware Workshop on Adaptive
and Reflective Middleware, Lisbon, Portugal, 12 Dec 2011
9. OMG.: Deployment and configuration of component-based distributed applications specification. http://www.omg.org/spec/DEPL. Apr 2006
10. Cazzola, W., Savigni, A., Sosio, A., Tisato, F.: Architectural reflection: bridging the gap
between a running system and its architectural specification. In: Proceedings of 6th IEEE
Reengineering Forum (REF98), pp. 12-112-6, Firenze, Italia, 811 Mar 1998
11. Cazzola, W., Savigni, A., Sosio, A., Tisato, F.: Architectural reflection: concepts, design,
and evaluation. Technical Report RI-DSI 23499, DSI, Universit degli Studidi Milano,
May 1999
12. IEEE Comput, Special issue on Models @ Run.Time, 42(10) (2009)
13. France, R., Rumpe, B.: Model-driven development of complex software: a research roadmap.
In: Future of Software Engineering, pp. 259268. IEEE Computer Society Washington, DC,
USA (2007)
14. Georgiadis, I., Magee, J., Kramer, J.: Self-organising software architectures for distributed
systems. In: Workshop on Self-Healing Systems, pp. 3338, Charleston, SC, 2002
15. Sykes, D., Magee, J., Kramer, J.: FlashMob: distributed adaptive self-assembly. In: Proceedings
of the 6th International Symposium on Software Engineering for Adaptive and Self-Managing
Systems (SEAMS), pp. 100109, Honolulu, 2011
References
261
16. Nafz, F., Seebach, H., Steghfer, J.-P., Anders, G., Reif, W.: Constraining self-organisation
through corridors of correct behaviour: the restore invariant approach. In: Organic
ComputingA Paradigm Shift for Complex Systems. Autonomic Systems, vol. 1, Part 1,
pp. 7993. Springer, Basel (2011)
17. Ulieru, M., Doursat, R.: Emergent engineering: a radical paradigm shift. Int. J. Auton. Adapt.
Commun. Syst. (IJAACS) 4(1), 3960 (2011)
18. Diaconescu, A., Lalanda, P.: Self-growing applications from abstract architectures an application to data-mediation systems. In: IEEE Symposium Series on Computational Intelligence
(SSCI 2011) IEEE Workshop on Organic Computing (OC 2011), Paris, France, 1115 Apr
2011
19. Debbabi, B., Diaconescu, A., Lalanda, P.: Controlling self-organising software applications
with archetypes. In: 6th IEEE International Conference on Self-Adaptive and Self-Organizing
Systems (SASO 2012), Lyon, France, 1014 Sept 2012
20. Diaconescu, A., Lalanda, P.: A decentralised, architecture-based framework for self-growing
applications. In: Proceedings of the 6th ACM/IEEE International Conference on Autonomic
Computing and Communications (ICAC 2009), Barcelona, Spain, 1519 June 2009
21. Diaconescu, A., Debbabi, B., Lalanda, P.: Self-growing software from architectural blueprints.
In: 3rd Morphogenetic Engineering Workshop (MEW 2011), satellite of the 20th European
Conference on Artificial Life (ECAL 2011), Paris, France, 812 Aug 2011
22. Anthony, R.J.: Emergence: a paradigm for robust and scalable distributed applications.
In: International Conference on Autonomic Computing ICAC, New York, 2004
23. Jelasity, M., Montresor, A., Babaoglu, O.: Gossip-based aggregation in large dynamic networks.
ACM Trans. Comput. Syst. 23(3), 219252 (2005)
24. Landauer, C., Bellman, K.L.: Knowledge-based integration infrastructure for complex systems.
Int. J. Intell. Control Syst. 1(1), 133153 (1996)
25. Landauer, C.: Problem posing as a software engineering paradigm. In: Proceedings of
the 21st International Conference on Systems Engineering (ICSENG11), 1618 August,
Las Vegas, USA, pp. 346351 (2011). http://dx.doi.org/10.1109/ICSEng.2011.69
10
The purpose of this last chapter is twofold. First, it draws together the lessons we have
learned about autonomic computing and the techniques that are used, at the time of
writing, to design and implement self-managed software systems. Our purpose is
clearly to help readers to understand, develop and maintain autonomic systems.
The second objective of this concluding chapter is to look ahead and foresee the
future of autonomic computing, while also attempting to point out some of the most
important challenges to address in order to attain the full autonomic computing
vision. To achieve this risky exercise, we view the topic from the perspective of how
autonomic systems will be engineered and how assurances regarding their behaviours can be made. We acknowledge that targeting system-level autonomy will
presumably necessitate integrated solutions, incorporating multiple autonomic elements, each one dealing with different management concerns and operating at various
granularity levels. In this context, we provide some examples of the more specialised
fields of autonomic networking and autonomic machines. We also have a discussion
about next-generation software engineering techniques, approaches and tools that
would be required to meet future computing system requirements.
263
264
10.1
10
We believe that autonomic computing is bound to change the way software systems
are developed. This new field is addressing some of the issues resulting from the
ever-increasing complexity of software administration and the growing difficulty
encountered by software administrators in performing their job effectively.
Undeniably, properly managing software systems is a considerable challenge for
society and has to be handled urgently. The lack of appropriate responses to this
issue could force us to reconsider our reliance on software to support our businesses
and daily activities.
Autonomic computing can rely on advances in several scientific fields, sometimes very different from each other. In particular, autonomic computing is rooted
in several (surprisingly) complementary fields, most notably including control theory
and biology. To some extent, these two fields have paved the way to the classical
autonomic architectures where a number of cooperating context-aware control
loops can bring about a wide range of adaptations. Autonomic computing also relies
heavily on software engineering best practices to dynamically structure, monitor
and change software systems. In some way, software engineering provides the necessary techniques to implement the multiple control loop vision, and this can be said
to be derived from biology in some instances. Further, autonomic computing is
highly dependent on knowledge representation, reasoning techniques and learning
as defined in various computer science fields like artificial intelligence or more
distant domains like economics and psychology. From a certain perspective, these
latter fields provide the means to reason about particular situations and make decisions regarding the possible courses of administrative actions to be undergone.
Yet, despite these influences, we believe that autonomic computing is a field in
its own. Indeed, the runtime administration of software systems requires the definition of specific methods and techniques, still to be improved and explored. This is a
formidable challenge, exacerbated by the complexity and variety of todays software systems and their execution environments, that will demand intensive research
for the years to come.
In fact, because they seek to unburden system administrators and improve overall
effectiveness, autonomic software systems are certainly more difficult to conceive
and implement than systems without autonomic capabilities. Autonomic systems
are expected to absorb the complexity of usually manual administrative tasks and
provide intuitive, high-level interfaces for human administrators. Meeting this
requirement leads to increased system complexity overall, albeit with the added
advantage of minimising perceived system complexity for administrators and users.
This observation is certainly the main motivation of this book. Indeed, along
with necessary explanations about the goals and origins of autonomic computing,
we believe that it is of major importance to present, in a coherent way, the different
techniques that are available today to design and implement autonomic software
systems. Of course, given the large number of techniques that exist, as well as the
high-quality projects in the autonomic computing area, we had to make tough
choices regarding what we include and what we cannot possibly cover in one book.
Decisions were guided by the rapid applicability of the techniques. In particular, we
10.2
265
dedicated quite an important part of this book to review software engineering techniques and principles for architecting, monitoring, reasoning and adapting systemsfor they are keys to software production in general. The rest of this chapter
is concerned with longer term lines of work.
10.2
In parallel to the general focus on autonomic computing, there have been a number
of groups who have isolated a section of the general computing field and focused
on self-management within those fields. Such work is more specific to the nature
of the fields themselves but also overlaps somewhat with the general notion of
autonomic computing that we present in this book. Two specific areas are autonomic communications and autonomic machines. The former is examining the
subject from a more topological point of view, where the network is a graph that
self-heals. The latter examines how one can imagine autonomicity right down to
the very metal of the machine itself. To be autonomic, a large-scale distributed
system will most likely have to integrate aspects of several fields, including autonomic communications, autonomic machines and autonomic software.
266
10
10.2
267
268
10
when the computer moves the address may change and I do not want to have to deal
with that. Another example is that I would prefer to view the data in the network as
having an entity, e.g. its a particular video stream rather than just a set of packets).
Essentially, SDN decouples network control (routing and forwarding decisions)
from the network topology (nodes and how they are linked). This means that these
different aspects can be implemented using different distribution models whereby the
control elements can become more sophisticated and can even be run on a different
platform from the traditionally low-powered switch or router technologies of the past.
Finally, cloud computing, with its dynamics and complexity, brings increased
network resource demands in terms of fast reconfigurations and flexible resource
deployments brought about by the introduction of machine virtualization. Therefore,
it is not surprising to see that the vast majority of research on autonomic computing
remains with cloud and large data centre computing.
The topic of autonomic communications is very briefly introduced here; for
fuller discussion of the topic, we direct the reader to comprehensive surveys of the
subject [1, 3] or [2].
10.2
269
Their assumption is that this system will run on energy-efficient multicore computers scalable to 1,000s of cores. Here they have taken a factored approach to both
the hardware and low-level software and operating systems. For example, they use
embedded memories consisting of low-voltage SRAMs capable of greater voltage
and frequency scaling to significantly save energy. Dynamic cache-coherency
schemes that can grow with the system have been developed and are again energy
efficient, being able to adapt to system usage patterns. They describe this as a 4D
approach to cache-coherency, which combines policy support and optimisations
that depend on the operating context of the system at runtime. SEFOS3 is the
self-aware, operating system specially designed for such systems composed of
1,000 + cores. Given this assumption, they also provide support for helper threads,
which assist the applications main threads of computation.
Using this factored hardware and systems software, the SEEC system relies on a
goal-oriented computational model that abstracts traditional procedural programming into goals that are actuated in the self-aware, factored multicore system. SEEC
explicitly incorporates energy and resiliency into the hardware, operating system,
compiler and languages. In this way, the programmer defines goals such as correlate the weather and room temperature streams burning less than 10W, and the
system should follow. Using methods based on machine learning and control theory,
they are already able to show how their approach performs at orders of magnitude
more energy-efficient and dependable architectures.
Key to this is the Angstrom support that exposes sensors and adaptations that
traditionally would have been managed independently by hardware. This allows
SEEC to control and coordinate hardware behaviours with actions specified by
other parts of the system, allowing the SEEC runtime system to meet application
goals while reducing costs (e.g. power consumption).
SEEC forms an observedecideact loop, much like the MAPE-K loop discussed
in previous chapters. Here it continuously monitors its goals and resources using
intelligence to map resources to meet goals given current system state. Every component of the Angstrom system, from applications to hardware, is designed to be
autonomic in that all contribute to the specification of the observedecideact loop
via an interface to specify goals and separate interfaces to specify actions (e.g. allocating processing cores or cache allocation).
To make this tractable, they simplify the systems monitoring function in terms
of three application specified areas (goals): performance, accuracy and power.
Performance is defined in terms of a target heart rate or latency between heartbeats.
Accuracy is a measure of distortion from an application-defined nominal value over
a given set of heartbeats. Then power and energy is specified as target average power
for a given heart rate or between heartbeats. Actuation is then actions that happen in
as low as the systems software and even hardware level as the associated interfaces
of these are exposed. The most interesting part of this system is its decision-making
capacity. It is required to make decisions about actions with which it has had no
270
10
prior experience and yet be able to react quickly, at runtime, to dynamic changes in
application loads and resource fluctuation.
As stated in this book, we do not get autonomic computing for free; monitoring
and decision-making are additional to the main computational load of the system
and must either consume the same set of recourses or be off-loaded to additional
computing resources. The Angstrom approach is to exploit the large number of
processors in the system and combine this with its ability to control the power that
those cores use. That is, to help reduce the costs of runtime decision-making, it pairs
each main processor with a specialised, low-power core called the partner core. The
partner core can inspect and manipulate state (e.g. performance counters) within the
main core and has access to the event queues fed by event probes, and thus the autonomic decision-making is off-loaded.
In a similar fashion, the SpiNNaker project from Manchester University makes
use of bio-inspired techniques, mainly from the human brain, to structure and organise billions of simple computing elements.4 Their aim is to build a highly scalable
parallel processing engine that is energy efficient. As multiprocessor and multicore
systems have become the norm, we can imagine these approaches becoming mainstream in the future.
10.3
Prediction is very difficult, especially about the future.5 While this statement applies
to any subject, it is especially true if that subject is fast moving like computing or if
it has yet to be fully defined, as with autonomic computing. The danger with autonomic computing has always been that it might be seen as a fad and fade away to be
remembered as something that got lots of funding around the turn of the millennium.
The question that has to be asked is, has autonomic computing made an impact? Was
the focus too broad (or too narrow)? Was the term overused or abused in some way?
ICT soothsayers describe a future where technology is highly pervasive. Sensors,
actuators, RFID tags, etc., will be embedded in smart objects, people and their surrounding space. Networks will envelope these devices creating a decentralised
cyber-physical world of systems of systems. All is dynamic, heterogeneous and
complex yet tasked with one thingto deliver reliable, efficient services. This
world is much too complex for humans to manage. Automated system management
is exactly what autonomic computing is about; therefore, it looks like there is a
healthy outlook for this subject.
To examine what the future of autonomic computing will look like and what will
impact the subject, we view the topic from the perspective of how such systems will
be engineered and how assurances regarding their behaviours can be made. We also
predict that the more specialist notions of autonomic communications and low-level
systems, as discussed in the previous sections, will converge in a more tightly coupled way producing a much more complex yet agile sets of systems.
4
5
http://rsif.royalsocietypublishing.org/content/4/13/193.full.pdf
Niels Bohr, Danish physicist (18851962).
10.3
271
272
10
10.3
273
274
10
Autonomicity has the capacity to bring about a further step change. Here, concerning security, one can imagine a third party falsifying the parameters fed into the
autonomic manager to make it adapt or change its operation in an inappropriate
way. For example, if we had an autonomic audio player that operated like the one
that we present in this book (see Chap. 4), there could be malicious third-party software sitting on the client machine, purposefully slowing down the packets being
received in the audio client software, giving it the perception that the bandwidth was
quite low. Then the system would adapt to this by lowering its compression codec.
In turn, this lessens the quality of the audio playing but also frees up the bandwidth,
which the malicious software could make use of.
This example shows that because the autonomic system is fed with environmental data and data from the managed resources, this open point is a place where
vulnerabilities lie. Environmental complexity and system dynamism can render an
autonomic system vulnerable even in the absence of explicit malicious behaviour.
For example, even in a closed loop system, if the autonomic system cannot perceive the environment correctly, it will not behave well. So any self-managing
system that is embedded in a dynamic environment has to deal with uncertainty.
This is especially so if that environment is the physical world we live in. For example, this would especially apply to sensor networks embedded in a building or in a
grape field or even just a system that has humans in the loop. Physical environments are by definition unpredictable; hence, at any point in time, there could be a
mismatch between the models of the environment understood by the autonomic
system and the actual environment. Furthermore, an autonomic system may have
no control over other processes that influence its environment. Therefore, there is
a movement to look at self-organising systems that exploit emergence to improve
their ability to remain robust to dynamic operating conditions. Exploiting these
principles is a promising direction to deal with uncertainty in decentralised autonomic systems.
This leads us to a conversation about what monitors the behaviour of the selfmonitoring system? As we mentioned earlier, there is a movement, inspired by the
emerging fields of distributed control theory that encourages the decoupling of the
autonomic system into either hierarchies or collaborations of distributed systems.
Yet, even in a distributive collaborative setting, sharing complete knowledge
among decentralised adaptation managers constrains the scalability of the system.
The alternative is to not share complete knowledge, but this means that each of the
decoupled components only has partial knowledge of the system. That is, they are
only interested in, and able to control, the bits they are responsible for. This limits
the types of decision-making techniques that can be used to implement the knowledge component of the MAPE-K loop. For example, nonlinear programming and
queuing network models rely on the availability of system-wide knowledge. With
distribution, the lack of complete knowledge forces each self-adaptive unit to reach
potentially suboptimal solutions when taken from the system-wide view.
Nevertheless, we are beginning to see the development of algorithms that converge
to optimal (or near optimal) solutions. However, in practice, engineering
10.3
275
276
10.4
10
Conclusion
To return to our question, asking if autonomic computing was a fad of the new
millennium. Given that data centres and systems that compose cloud computing
infrastructures are already entirely instrumented and many of the management functions are now automated, we can say that there has certainly been an impact. At the
software level, most component and service-oriented models, frameworks and
technologies developed today provide inherent support for dynamic monitoring and
adaptation, including hot-deployment, hot-swapping, dynamic bindings and configurations. These represent basic touchpoints which are essential for enabling
the autonomic management of applications that rely on such platforms. Additionally,
platforms provide an increasing variety of basic autonomic capabilities including
automatic configuration, connectivity management, instance replication or downsizing. Indeed, as previously exemplified, several technologies had already started
providing automatic management functions before the autonomic computing domain
was explicitly defined. This only strengthens the position of autonomic computing,
showing the progressive emergence of self-management issues in our ICT systems
and the necessity to recognise and address them as first-order concerns in a dedicated domain. So, like most things that make sense, we can conclude that autonomic
computing is subtly being added, as a natural solution, without celebration or pomp.
It is here to stay and has a strong future [3].
How this future manifests is a product of the work that we present here in this
book, combined with the growing body of work either described as self-managing,
self-optimising, context-aware, self-adaptive or even simply autonomic. Further, as
technologies and computer science as a whole grow, these new ideas can be influenced by and inspire the autonomic computing area. For example, we have seen in
this book the degree of adaptability, agility and intelligence an autonomic system
has is closely tied to the improvements, heuristics and speed of computation of artificial intelligence systems. As machine leaning gets more sophisticated, faster to
run, smart, etc., we will see more online adaptation, and this will also become more
sophisticated. More system intelligence, combined with improved system awareness
of administrative objectives and human values, will bring about more predictable
and safe behaviours, which in turn will breed trust and reliability.
As another example, a complementary approach to traditional artificial intelligence
is one that exploits self-organisation and emergent behaviour. A better understanding and capacity to govern this type of phenomena can equally provide a means of
ensuring predictable results at the system level (even in the presence of unpredictability at the finer-grain levels). Hence, progress in these research areas can also
benefit autonomic computing and enable the construction of dependable and trustworthy autonomic systems. After all, trust and reliability are core to the uptake of
autonomic computing as we gradually take the human out of the loop.
Surely, as emphasised throughout this book, it is essential that humans can
remain in control of their autonomic systems. System autonomy should allow them
to do so, even if it is merely modifying high-level objectives or management policies,
rather than repetitively intervening to fix low-level technical issues. Hence, there
will also be a necessity to rethink human interactions with autonomic computing
10.5
Key Points
277
systems. Novel interfaces will have to be designed to reflect the autonomic systems
capacity to follow higher management directives and to provide insights into its
success status and reasoning process.
Significantly more research is needed to achieve the full vision of autonomic management in our increasingly complex computing systems [3]. Comprehensive solutions
will require the integration of results from several research domains, investigating both
natural and artificial systems. The necessity for cross-domain research provides a great
opportunity for computer science advancements. Notably, it can help enrich software
engineering with novel paradigms, algorithms and architectures inspired from other
disciplines, which have already been confronted with the management of complexity
and unpredictability (e.g. biology, ecosystems, economy, artificial intelligence or
cybernetics). As indicated in the defining motivation of autonomic computing, the
ability to introduce self-management capabilities in ICT systems is essential if we are
to pursue the current trend of system development and computer embodiment within
our society. Software engineering must evolve accordingly in order to provide the
means to reason about, develop and maintain autonomic computing systems.
10.5
Key Points
278
10
References
1. Dobson, S., Denazis, S., Fernndez, A., Gati, D., Gelenbe, E., Massacci, F., Nixon, P., Saffre,
F., Schmidt, N., Zambonelli, F.: A survey of autonomic communications. ACM Trans. Auton.
Adapt. Syst. (TAAS) 1(2), 223259 (2006)
2. Sestini, F.: Situated and autonomic communications: an EC FET European initiative. ACM
Comput. Commun. Rev. 36(2), 1417 (2006)
3. Kephart, J.: Autonomic computing: the first decade. In: Keynote at the 8th International
Conference on Autonomic Computing (ICAC), Germany, 2011
4. An architectural blueprint for autonomic computing. IBM Whitepaper, June 2005
5. Dardenne, A., van Lamsweerde, A., Fickas, S.: Goal-directed requirements acquisition. Sci.
Comput. Program. 20(12), 350 (1993)
6. Kramer, J., Magee, J.: Self-managed systems: an architectural challenge. In: Future of Software
Engineering, pp. 259268. IEEE Computer Society, Washington, DC, USA (2007)
7. Brooks, R.A.: A robust layered control system for a mobile robot. In: Cambrian Intelligence:
The Early History of the New AI, pp. 326. MIT Press, Cambridge (1999). ISBN 10:
0262024683
8. Astrom, K., Wittenmark, B.: Adaptive Control, 2nd edn. Addison-Wesley, Reading (1995)
9. Soderstrom, T., Stoica, P.: System Identification. Prentice-Hall, Englewood Cliffs (1989)
10. Bourcier, J., Diaconescu, A., Lalanda, P., McCann, J.A.: AutoHome: an Autonomic
Management Framework for Pervasive Home Applications. ACM Trans. Auton. Adapt. Syst.
(TAAS) 6(1), 8:18:10 (2011)
11. Maurel, Y., Lalanda, P., Diaconescu, A.: Towards a service-oriented component model for
autonomic management. In: 8th IEEE International Conference on Services Computing (SCC
2011), Washington, DC, USA, 49 July 2011
12. Frey, S., Diaconescu, A., Demeure, I.: Architectural integration patterns for autonomic management
systems. In: 9th IEEE International Conference and Workshops on the Engineering of
Autonomic and Autonomous Systems (EASe 2012), Novi Sad, Serbia, 1113 Apr 2012
We have stated numerous times throughout this book that autonomic computing is
a fast-growing and ever-changing field. Therefore, in true component-based
software engineering fashion, we have abstracted out most of the dynamic components of the book that refer to the learning environment and exercises and placed
these in an environment better suited to managing this dynamismthat is, they can
be found on a Web page! We hope this Web environment will grow with the book,
learning from the feedback that we receive from practitioners and students alike.
279
280
281
1
This is a software engineering practice where the component coupling, or object coupling in this
case, occurs at runtime by the assembler object and not known at compile time.
282
iPOJO IDE
Learning OSGi/iPOJO technologies may take some time, even for good JAVA
developers. Students need to get familiar with new concepts like components or
services, but they also have to learn new development environments (including
XML configuration files and annotations).
In order to allow students to more rapidly focus on autonomic concepts, we have
developed an iPOJO IDE (integrated development environment) allowing the rapid
and simplified development of iPOJO applications. This environment provides a set
of facilities to assist the developer in the creation and deployment of iPOJO components. In particular, a number of classes and files are (partially) generated. Also,
deployment can be fully automated. In that context, we had to make tough choices
for the sake of simplicity. However, the IDE keeps all the iPOJO key concepts, and
the projects managed by the environment are standard OSGi projects. Developers
are free to access and edit them directly, making the tool an ideal transition tool to
writing more complex OSGi applications.
The IDE is provided as an Eclipse plug-in. Eclipse is a very popular standard
IDE for developing JAVA applications. Eclipse comes with many features supporting development through the use of plug-ins. For instance, it is possible to run applications on an embedded OSGi platform within Eclipse. In this way, it is natural and
easy to use the Eclipse debugger.
The IDE assists developers in the different development phases:
At design time for defining iPOJO components, their configurations and dependencies. Several wizards are provided to specify component types, provided and
required services, service properties, configurations, etc. Also, the iPOJO configuration files are automatically generated.
At implementation time for implementing components and the provided services. The IDE can generate template implementation classes to facilitate coding. Also, at any time, validity between component specification and their
implementation can be verified. Finally, the IDE is able to reflect changes in the
component specification onto the implementations without impacting the existing code.
At compilation time when building the OSGi bundles and managing project
dependencies. The IDE automatically manages most library and Eclipse project
dependencies. In particular, it knows the iCASA dependencies and can import
them automatically.
At configuration time for configuring each component instances. The use of a
wizard ensures that the configured properties have indeed been declared in the
http://felix.apache.org/site/apache-felix-ipojo.html
283
component definition. This prevents a common problem when using OSGi where
properties are identified by a single String that is disseminated across the code
and configuration filescausing a lot of typos.
At deployment time by making the deployment a one-click process. The user
has the choice to deploy the application in an OSGi platform embedded
within Eclipse or in a remote platform. If an application has already been
deployed, the IDE does the necessary update. This makes application testing
more straightforward.
284
Index
A
Active monitoring, 111, 112, 140
Adaptive maintenance, 9
Adaptive monitoring, 148149
Administrator, 716, 20, 2328, 3133, 35, 36,
40, 41, 52, 63, 83, 85, 89, 96, 97,
99, 104, 112, 116, 169, 172, 173,
191, 195, 208, 219, 235, 241, 242,
255, 263, 264, 266, 280
Adoption model, 41, 42, 253, 268
Agents, 26, 57, 104, 196, 220,
255, 273
Amorphous computing, 50, 51, 275
Analysis (in MAPE-K), 113115
Angstrom project, 268
Apoptotic computing, 74, 275
Architecture, 2, 4, 14, 1719, 4446, 48,
50, 53, 58, 61, 62, 7173, 81,
8587, 91, 95126, 130, 131,
135, 136, 140144, 147, 153,
158, 165, 169173, 188, 194,
199, 202, 203, 208, 218, 220,
227, 232, 235237, 240, 249,
251, 253, 264268, 272, 273,
275, 277, 278, 281
Architecture definition language, 202
Arpanet, 265
Artificial intelligence, 18, 19, 57, 58, 61, 62,
8289, 92, 105, 108, 117, 185, 190,
264, 276, 277
Autonomic manager, 107
Autonomic communications, 39, 226,
265268, 270
Autonomic computing
benchmark, 231232
influences, 5862
B
Bayesian techniques, 208
Benchmarking, 229232
Binary code, 154159, 163, 164, 167, 169
Biology, 18, 25, 51, 57, 59, 6174, 91, 264,
277, 278
C
Central nervous system (CNS), 65, 66,
71, 72
Cilia, 18, 19, 131, 147, 235260, 272
Classifiers, 52, 208
CNS. See Central nervous system (CNS)
Code
integration, 159
upgrade, 168
Components off the shelf (COTS),
12, 100, 111
Computing context, 28, 53, 96, 97, 104, 106,
117, 129
Connectionist, 86, 87
Context, 3, 23, 59, 96, 129, 153, 189, 217,
238, 263
285
286
Control
feedback loop, 18, 50, 77
systems, 50, 57, 61, 68, 70, 7382, 91, 92,
104, 273
Corrective maintenance, 9
Cost, 35, 8, 1013, 15, 16, 24, 25, 31, 35, 39,
53, 58, 75, 100, 101, 103, 104, 106,
108, 110, 112, 117, 118, 123, 124,
133, 134, 138, 140, 142, 144, 153,
154, 159, 160, 169, 182, 196, 201,
205207, 218222, 227, 228, 231,
233, 236, 238, 245, 250, 265, 269
COTS. See Components off the
shelf (COTS)
Coupling, 6, 98, 103, 156, 161, 171,
228, 280
Cube, 254260
Index
Hierarchical based monitoring, 135
Homeostasis, 63, 91, 218, 221, 223
I
IBM
manifesto, 14, 30, 31, 33, 40
reference architecture, 97100
Inductive reasoning, 188
Intelligence, 18, 19, 45, 57, 58, 61, 62, 8289,
91, 92, 104, 105, 108, 117, 185,
190, 206, 218220, 227, 264, 269,
276, 277
iPOJO, 147, 173, 177182, 241, 245250,
252, 259, 279283
J
Java class loader, 174
D
Decentralized autonomic managers, 27, 38,
43, 72, 85, 121125, 198, 202,
208, 220, 225228, 243, 254,
255, 258260, 265267, 270,
273, 274
Deductive reasoning, 188
Deliberative managers, 108
Descriptive knowledge, 187, 190
Dynamic adaptation, 159, 161, 163, 166, 167,
249251, 277
Dynamic libraries, 165
Dynamic linking, 163164, 166
E
Effectors, 66, 6871, 98, 99, 102104, 109,
110, 115, 119, 120, 122, 126, 161,
196, 205
Equilibrium, 60, 63, 91, 196, 229
Evolutionary computation, 52
Execution (in MAPE-K), 117119
G
Game theory, 59, 60, 196
Goal-based reasoning, 87
Goals, 12, 23, 57, 95, 129, 153, 187, 217,
249, 264
H
Heuristics, 87, 89, 116, 117, 121, 205, 276
Hierarchical, 85, 97, 121124, 126, 135, 140,
143, 144, 146, 166, 180, 220, 249,
267, 272, 273
K
Knowledge, 4, 28, 60, 95, 129, 185, 226,
237, 263
by acquaintance, 187, 194, 213
by description, 187, 213
representation, 83, 8789, 185, 189191,
208, 264
L
Learning, 36, 48, 52, 61, 62, 66, 68, 69,
83, 8890, 92, 105, 107, 108,
149, 197, 198, 207209, 264,
269, 279283
Linker, 156, 163
Logic-based reasoning, 206207, 214
M
Managed artefact, 98, 100106, 109, 111117,
119, 121, 126, 130, 145, 185,
187, 188, 190, 191, 194196,
205, 207, 213
MAPE-K model, 108110, 119, 120, 126
Meta-model, 200, 201
Meta object protocol, 167, 181, 245
Models, 4, 26, 58, 97, 130, 168, 191,
218, 238
Monitoring
in MAPE-K, 110113
overhead, 134140
probes, 146
tools, 111, 134, 146148
Moores law, 11
Index
N
NASA autonomic projects,
4649
Nervous system, 25, 57, 6274, 91
O
OC. See Organic computing (OC)
Ontology, 89, 104, 189, 194, 238
Open services gateway initiative (OSGi),
147, 173178, 182, 241, 249,
280, 282, 283
Operating system (OS), 3, 8, 29, 44, 98,
104, 111, 119, 133, 136,
138142, 147, 149, 154,
157, 159, 161165, 181,
225, 227, 268, 269
Organic computing (OC), 4951
OS. See Operating system (OS)
OSGi. See Open services gateway initiative
(OSGi)
OSGi bundles, 173, 178, 282
P
Partial knowledge, 188, 274
Passive monitoring, 111, 112, 140
Perfective maintenance, 9
Performance metrics, 112, 132, 133, 229, 231
Performance monitoring, 131135
Peripheral nervous system (PNS),
65, 66
PID controller, 7880
Planning (in MAPE-K), 115117
PNS. See Peripheral nervous
system (PNS)
Prescriptive knowledge, 187, 190
Preventive maintenance, 9
Programming languages, 111, 156158,
160, 163, 164, 166168, 199,
203204, 214
Propositional logic, 207
Q
Quality of service, 9, 35, 39, 63, 98, 101, 102,
105, 161, 170, 192, 218219, 225,
228, 232, 233, 238
Quiescence, 160, 250, 259
R
Reaction time, 221, 233
Reflex arc, 66, 70, 73
Reflex-based managers, 107
287
Rolling upgrades, 160
Rules, 26, 51, 52, 59, 60, 107, 108, 114,
116, 124, 134, 143, 149, 166,
168, 191193, 195, 199, 201208,
211, 213, 221, 223, 227, 228,
254, 266
S
Search based reasoning, 204206, 214
SEEC, 268, 269
Self-*, 17, 23, 3339, 41, 44, 50, 53, 95, 126,
266, 279
Self-configuration, 3437, 43, 53, 190, 220
Self-healing, 3438, 43, 49, 53, 125, 138, 232
Self-optimisation, 3439, 53, 74, 130
Self-protection, 3437, 53, 71, 74
Sensitivity, 69, 221223, 227, 233
Sensors, 28, 63, 95, 130, 193, 269
Service level agreement (SLAs), 26, 41, 42,
102, 131, 146, 218
Service level objectives (SLOs), 26, 218
Service-oriented components, 147, 172, 173,
177, 182, 245, 259, 280
SLAs. See Service level agreement (SLAs)
SLO. See Service level objectives (SLO)
Software
adaptation, 153155, 182, 280
artefacts, 2, 6, 8, 10, 19, 96, 97, 104, 112,
126, 158
complexity, 14, 14, 24, 58
component, 5, 8, 98, 101, 106, 111, 130,
168170, 172, 268
deployment, 68
development, 1, 2, 46, 12, 17, 19, 105,
181, 199
engineering, 120, 57, 62, 65, 72, 90,
103, 111, 112, 142, 154, 155,
157, 170, 182, 194, 199, 200,
236, 258, 263, 264, 266, 271,
273, 277, 279, 280
evolution, 6, 9, 199
intangibility, 2, 19
integration, 171, 235239, 241
maintenance, 912
mediation, 18, 65, 106, 131, 147,
235260, 272
monitoring, 147
process, 4
system, 113, 1520, 24, 26, 43, 52, 68, 82,
96, 103, 104, 106, 143, 153156,
158164, 167, 168, 170, 172, 177,
181, 182, 187, 194, 199, 236, 245,
254, 263, 264, 269, 270, 273, 277
update, 19, 35, 160
288
Stabilisation, 222223
Stability, 75, 8081, 90, 123, 125, 149, 162,
219, 222224, 229, 232, 233, 258
Stigmergy, 51, 198, 206
Symbolic, 86, 87, 102, 114, 163, 189
System profiling, 137138
T
Touchpoint (Cilia), 242245, 251, 252
Touchpoints, 77, 99, 110, 126, 130, 143,
187191, 194, 205, 213, 276
Trust, 36, 266, 267, 273277
Index
U
Ultra stable systems, 91, 275
Usage context, 28, 29, 96, 106, 116
Utility functions, 114, 115, 124, 125, 191,
196198, 213, 273
V
Variability, 103, 104, 158