AMF User Guide
AMF User Guide
AMF User Guide
USER GUIDE
Disclaimer
The contents of this document are subject to revision without notice due to
continued progress in methodology, design and manufacturing. Ericsson shall
have no liability for any error or damage of any kind resulting from the use of
this document.
Trademark List
All trademarks mentioned herein are the property of their respective owners.
These are shown in the document Trademark Information.
Contents
1 Introduction 1
1.1 Prerequisites 1
2 Basic Concepts 3
2.1 Availability Management Framework 3
2.2 Application 3
2.3 Cluster and Node 4
2.4 Component and Service Unit 4
2.5 Health Monitoring 4
2.6 Workload 7
2.7 Assignment 7
2.8 Failover and Switchover 8
2.9 Error Detection, Recovery, Repair, and Escalation 8
2.10 Information Model 9
2.11 Redundancy Model 9
2.12 Administrative Operations 10
6 General Concerns 29
6.1 Daemonizing 29
6.2 Logging 29
6.3 Error Handling 29
6.4 Standards Compliance 30
6.5 File System Layout 30
6.6 User Management 30
Reference List 35
1 Introduction
Scope
This document is a simplified version of the AMF specification but also contains
information related to the AMF system environment and other concerns.
Target Groups
This document is intended for application designers and developers.
1.1 Prerequisites
It is assumed that the reader is familiar with the SA-Forum system architecture
and concepts. For more information, refer to www.saforum.org.
The reader is advised to have a copy of the AMF specification (Reference [1]) at
hand when reading this document, as many references are made to it. Especially
some pictures complement this document.
2 Basic Concepts
— Send alarms.
2.2 Application
By application in the AMF context is usually meant the server part in a client-server
application. There are many types of servers such as web servers, database
servers, and gaming servers.
Green field applications are applications written from scratch possibly with the
AMF integration in mind. If so, they can freely use the AMF concepts depending
on their ambition level to provide service availability and become Service
Availability-aware (SA-aware).
An AMF application can consist of only a single operating system process but this
gives quite a bit of overhead because of the AMF modeling requirements. It is,
however, a good starting point when there are plans to make the application High
Availability (HA) or distributed, or both.
Components are grouped into Service Units (SUs), a logical entity completely
associated with an AMF node. All components in an SU execute on the same
AMF node.
— Passive
— External active
— Internal active
With active monitoring, latent faults, such as a looping and not responding
program, can be detected, which is not the case using passive monitoring.
When active monitoring is used, it is also possible to validate the data received
from the service monitored. For example, if system uptime is requested from an
SNMP agent (because of active monitoring of it), the result can be validated and
checked to see if it is reasonable. This kind of monitoring is out of the scope of the
AMF and this document, besides it is service-specific. If used, it gives even higher
service availability because another class of errors can be detected.
The recovery action taken by the AMF when a fault has been detected is
configurable but can, for example, be COMPONENT_RESTART. If a monitored process
dies, it is restarted again by the AMF. A recommended recovery action can also be
specified in the API used to report errors.
As operating system features are used, the component is not actively involved
in the monitoring and its code is not instrumented, hence the name passive
monitoring.
To use passive monitoring for other types of components (or for a subprocess),
it must be started using function saAmfPmStart() and stopped using function
saAmfPmStop().
AM_START starts a monitor process that periodically assesses the health of the
monitored application by making a simple service request to it. The AMF is not
involved in the actual monitoring, that is, the responsibility of the monitor process.
When the monitor detects a health problem with its monitored service, it
is to call function saAmfComponentErrorReport() . This implies that the
monitor itself is written in C/C++ or that a helper command exists that wraps
saAmfComponentErrorReport() so that it can be called by a script implemented
monitor.
In this case no one monitors the monitor, but as the monitor is simple and small
it can probably be considered fault free by review. If this is not appropriate, the
monitor can be implemented as an AMF SA-aware component to which the AM
commands send monitoring requests.
For more information about this feature, refer to Sections 4.8–4.10 in Reference
[1].
As the code is instrumented, this type of monitoring is normally only used for
SA-aware components.
A health check can be triggered by the component itself or by the AMF. When
triggered by the AMF, health check requests are sent periodically to the component
with a certain configurable period. The AMF expects a response within a certain
configurable time called the duration. The duration is always shorter than the
period.
A component can have several health checks active at the same time. Each health
check is identified by a key – a name. Some reasoning for this: depending on the
check performed, the impact on the service provided varies. A normal service
request has little impact and can be run with a shorter period. More detailed
component audits can have more service impact and are to be run with a longer
period.
Configuration of period (and duration) must be done with high load in mind. It is a
trade-off between fast true error detection and avoidance of false error detection.
A longer period is good to avoid false error detection but it takes longer to detect
latent faults. A health check period is normally in the second range or even 10 s of
second range, it is most likely not less than a second. The health check duration
most likely must be longer than the callback time-out, typically twice as long. It
depends on the AMF implementation if two supervision timers run at the same
time or if health checks are skipped when some other supervision is active, for
example, callback time-out.
Errors are reported to the AMF in two ways. When the AMF invoked health checks
are used, a negative response is given using function saAmfResponse() . When
component invoked health checks are used, the component responds with a
negative response using function saAmfHealthcheckConfirm().
2.6 Workload
A normal non-AMF-aware program provides service directly when started. There
is no distinction between the program and the service it provides. However, if the
service or work the program performs can be categorized and quantified, it can
also be modeled and managed. This categorized and quantified work/service is
what the AMF means by workload. Workload is a core concept used by the AMF to
enable high availability and is important to understand. When an application uses
the workload concept, the AMF enables for sophisticated redundancy schemes.
A simple example can be a web server that starts and initializes but does not bind
to port 80 until assigned the corresponding active workload. On another node,
the same program can be running as standby waiting to be activated if the other
instance goes down. This is an example of a simple 2N redundancy scheme.
With AMF concepts, the workload is called a Service Instance (SI) and these are
assigned to SUs. An SI is further broken down in to Component Service Instances
(CSIs), which are assigned to components (processes) and visible in the API for
the program designer.
2.7 Assignment
The AMF assigns a workload in active or standby state to an application. This
means that the application upon receiving the assignment is to start providing
service according to the state of the assignment, and the amount and type of
service as described by the workload.
After an error has been detected and reported, the AMF tries to recover the
application provided service from the error. Recovery is performed automatically
by the AMF to ensure that all assignments are reassigned to a non-erroneous
component. If the AMF cannot reassign the workload, it sends the alarm
‘‘workload unassigned’’, which means that a service is not available at all.
If the SU is restarted too many times during the SU probation time, the recovery
action is escalated to failover.
An application can also use the IMM to store its specific configuration data, thus
making it possible to configure and manage in SA-Forum intended way.
— 2N
— N+M
— N-way
— N-way active
— NoRedundancy
To represent resources under its control, the AMF uses an abstract system model
consisting of various logical entities. This model is needed to describe the system
model in a way the AMF understands. The AMF cannot manage an application
unless a corresponding model has been configured.
Most of the AMF logical entities are software entities. This means that they are
used to describe the instances of software execution under the AMF control
and the management policies and relationships between them. For example,
components represent executing programs while the SU describes relationships
(containment and dependencies) between components and the recovery policy
used when an error has been detected.
For an overview of the logical entities, refer to Figure 1 and Section 3.1 in
Reference [1].
Similar software entities are generalized into a versioned entity type. These are
of a certain base entity type. A base entity type can be visualized as an empty
base class, only needed to host versioned entity types. It does not contain any
configuration attributes.
Concepts Example
Is Of
Realizes
The reason why types simplify configuration is because common attributes can be
gathered in the versioned entity type. Imagine a system with many instances of
the same component. Less need to duplicate information, the better.
All software entities are of a certain versioned entity type. This relationship
is defined by an attribute in the software entity. For example, an instance of
the SaAmfComponent class uses attribute saAmfCompType to describe of what
versioned entity type it is.
The AMF B.04 system model can at a first glance feel and look overwhelming
with all its classes. But only 10 out of 33 classes are directly used when modeling
an application. The remaining 23 classes are entity types, runtime classes, and
non-software entities (such as nodes).
For the AMF instance view with relationships, refer to Figure 29 in Reference [1].
A component is the smallest entity that error detection, recovery, and repair are
performed on. Components have a state model where specifically the presence
state reflects the life cycle.
Components can either be integrated with the AMF (SA-aware component) or not
(non-SA-aware component).
Components integrated with the AMF use the API and are aware/designed for the
workload concept. For a code example of such a component, refer to Appendix X
in Reference [1].
Service Unit
Component
Component
Component
The AMF manages redundant SUs to ensure service availability if there are
failures. A Service Group (SG) is a logical entity that groups several SUs, see
Figure 3. The SG protects one or more SIs. An SG has a corresponding redundancy
model that defines how the SUs are used to provide service availability. SUs are
hosted on different nodes in the cluster.
Service Group
Component Component
Component Component
Component X1 Component Y1
The application entity groups one or more SGs to provide a higher-level service,
see Figure 4.
Application
Component Component
Component Component
Service Unit X1 Service Unit Y1
Figure 4 Application
CSIs are quantified and categorized by its name and an extra modeling object of
class SaAmfCSIAttribute. Attributes are name=value pairs that describe the
workload in a way understandable for a component.
When a component is assigned a CSI with the callback, configured attributes are
passed.
One or more CSIs are grouped into a Service Instance (SI), see Figure 5.
Service Instance
Component
Component
Component
Service
Instance
The set of nodes defines the AMF cluster. During the life span of a system, the
cluster membership changes as nodes join and leave the cluster. Reasons for a
changing membership can be as follows:
The AMF operates on a single cluster. The number of nodes can vary from at least
one to many. The middleware is responsible for managing these objects.
SAF also specifies a Cluster Membership (CLM) cluster and nodes. The AMF nodes
are mapped to CLM nodes. It is out of scope of this document to describe this any
further. For more information, refer to Section 3.1.1.1 in Reference [1].
For information about the relationships for node and cluster, refer to Figure 27
in Reference [1].
The middleware probably comes with a few node groups predefined. Applications
can also create their own node groups to simplify tasks such as a complex upgrade
scenario.
LOCKED-INSTANTIATION
The entity is not allowed to be started (instantiated).
SHUTTING-DOWN
A transitional state where the service is gracefully shut
down, when done the state becomes LOCKED.
LOCK_INSTANTIATION
An order to terminate the affected components and
transition to LOCKED-INSTANTIATION administrative
state.
UNLOCK_INSTANTIATION
An order to instantiate the affected components and
transition to LOCKED administrative state. Has no effect
on non-SA-aware components.
For more information, refer to Section 3.2 and Section 9.4 in Reference [1].
The operational state reflects the ability of a logical entity to provide service. The
state can be seen as the entities error status. If no error exists that prevents the
entity to provide service, its operational state is ENABLED.
If any error exists that makes it impossible for the entity to provide service, the
operational state is DISABLED. For example, if a node is rebooted, all SUs mapped
to the node are DISABLED while the node is down.
The operational state is not related to the administrative state. The operational
state can be DISABLED but the administrative state is UNLOCKED. This is the case if
a node goes down because of a hardware error.
UNINSTANTIATED
The component is not started.
INSTANTIATION-FAILED
Failed state when instantiation has failed.
TERMINATION-FAILED
Failed state when termination has failed.
When a component enters the FAILED state, an alarm is produced by the AMF.
3.4.4 HA State
3.5 Dependencies
3.5.1 Workload
A CSI can depend on other CSIs in the same SI. The dependencies one
particular CSI has to other CSIs is configured with the multi-value attribute
saAmfCSIDependencies in the CSI configuration object. This attribute is not a
list (order implied) as specified, it is an unordered set.
3.5.2 Components
The AMF allows configuring an instantiation level for components to model
dependencies between components in the same SU. The AMF instantiates and
terminates components according to this level.
3.6 Ranking
SUs and SIs can be ranked. A rank is a positive integer (>0), the lower value
the higher rank. For example, an SU with rank=1 is higher ranked compared to
another SU with rank=2.
A higher rank (lower integer value) for an SU means that it is assigned before
other SUs. A higher rank for an SI means that it is selected for assignment before
other SIs.
SA-aware components are gracefully terminated by the AMF using the terminate
callback. The cleanup script is run afterwards to clean up temporary files such as
Process ID (PID) files created when the starting the component or when an error
has been detected such as termination failed.
As non-SA-aware components by definition do not use the AMF API, they are
terminated gracefully by command TERMINATE.
The script and its arguments are specified in the component instance or in the
component type (as they are common between instances).
The script must be able to control a process, for example, stop it. It is recommended
to use a PID file for that purpose. The component process is to create the PID file
when it has started successfully. If the AMF wants to clean up a component, it
calls the script, the PID file is read, and a KILL signal is sent to the process.
The AMF API is simple but it requires knowledge of the model and concepts. The
API is mainly relevant only for SA-aware components integrated with the AMF,
but parts of the API are useful for small command/tools.
Basically a component does some up front initialization and after that waits
for events on an AMF provided file descriptor. When such an event occurs, it is
dispatched and the requests serviced as callbacks.
The main use of the API is best described with some high-level pseudo code (no
error handling):
main()
{
// Initialize my service
myservice_initialize()
// Initialize with AMF
callbacks.healthcheck = my_healthcheck_cb
callbacks.csiset = my_csiset_cb
callbacks.csiremove = my_csiremove_cb
callbacks.terminate = my_terminate_cb
handle = saAmfInitialize(callbacks)
saAmfResponse(OK)
}
my_csiremove_cb()
{
saAmfResponse(OK)
}
my_terminate_cb()
{
myservice_shutdown()
exit(SUCCESS)
}
— Assignments and changes in them are received from the AMF as callbacks as
a consequence of calling saAmfDispatch().
— When using other SA-Forum-defined services like the IMM, it fits nicely into
this program structure because the callback mechanism is the same between
most SAF services.
— The process forever loops in an event loop listening for events on file
descriptor. This is a common design pattern for a server program.
— If the legacy software is internal property, its code can be modified. If done in
an elegant way, the same application is to be possible to use in both an AMF
system environment and in its original system environment.
— Use a proxy component to manage the legacy software, which in this case is a
separate ‘‘proxied’’ AMF component. The proxy solution is appropriate when
the redundancy model of the legacy software differs from the proxy entity.
This category contains simple programs not integrated with any middleware. Such
a program provides service directly when started. Either one program instance
provides the complete service or many instances provide the same service with
more capacity. An example can be a web server. Instances can run on many nodes
as long as they all have access to the same file system. Adding an instance only
means that more service requests per second can be serviced (a bit simplified).
5.3 Recommendations
The wrapper component integration approach is recommended for the ‘‘simple’’
type of application. Reasons are that the wrapper logic is much simpler than the
proxy/proxied variant. Also the AMF model is simpler with only one component
that models both the wrapper and the ‘‘wrapped’’ component.
6 General Concerns
6.1 Daemonizing
The AMF components are usually long lived processes, at least the SA-aware
ones. They are started when the system is bootstrapped and must behave as
other daemons in the Unix® world.
More information can be found on the Internet but consider the following:
— Drop privileges.
— Close all open files including standard streams (stdout, and so on).
— Create a PID file (also known as lock file) for use by the controlling script.
6.2 Logging
A daemon process is not to use printf type of output to file. One reason is that
such files normally contain no or a non-standard time stamp. Log rotation is also
required in a long running system.
High-level application logging can, for example, go SAF Log and more detailed
processor local logging to syslog.
— Report the error to the AMF and choose either of the following:
When the application process is started, it is to change group and user accordingly
– drop its privileges.
Unix groups and users are normally not deleted when removing a package but
manually by an operator. In the case of BaseMW that corresponds to a remove
campaign.
7.1 Building
Building an AMF application is simple. From a C/C++ source file, include the AMF
header file saAmf.h and, when linking the program, link against the SaAmf library.
7.2 Packaging
An AMF application must be packaged in the native packaging format as
supported by the underlying Linux distribution, for example, RPMs. It is important
to remember that the AMF controls the life cycle of the program, not the Linux
init process.
Upgrade and remove are also normally done using SMF campaigns and it is
again out of scope of this document to describe how this is done. For upgrade
campaigns, tools exist that can create such a campaign based on the current
configuration and the wanted configuration.
It is clearly possible to bypass the SMF and use the IMM directly for configuring
the application model in the IMM. However, it is not the official way of doing it
and is only mentioned here for completeness.
The file is specified in XML and is included in the software bundle (package).
The ETF.xml file contains information needed by offline tools to generate upgrade
campaigns.
Reference List