AEServices Consolidated High Availability White Paper
AEServices Consolidated High Availability White Paper
AEServices Consolidated High Availability White Paper
Version 1.0
AVAYA 1|Page
HA White Paper for AE Services
Version 1.0
Table of Contents
1 Introduction .......................................................................................................................................... 4
2 Geo Redundant High Availability (GRHA) ............................................................................................. 5
2.1 Overview ....................................................................................................................................... 5
2.2 Key features .................................................................................................................................. 7
2.2.1 Controlled failover of AE Services to standby datacenter AE Services VM .......................... 7
2.2.2 Automatic activation of AE Services on standby datacenter ................................................ 7
2.2.3 Automatic recovery from split brain condition..................................................................... 8
2.3 Benefits of GRHA........................................................................................................................... 8
2.4 Effect of a controlled/uncontrolled failover on AE Services clients ............................................. 9
2.5 Limitations: ................................................................................................................................... 9
3 Machine Preserving High Availability (MPHA) .................................................................................... 10
3.1 Overview ..................................................................................................................................... 10
3.2 Key features ................................................................................................................................ 11
3.2.1 Seamless failover in the event of controlled failover requests .......................................... 11
3.2.2 Almost seamless failover in the event of sudden failures for MPHA protected VMs ........ 13
3.2.3 Automatic recovery from split brain condition................................................................... 13
3.2.4 Adaptive check-pointing ..................................................................................................... 13
3.2.5 No limit on CPUs allocated to Application Virtual Machine ............................................... 14
3.3 What does MPHA provide?......................................................................................................... 14
3.4 Power requirements for System Platform HA systems .............................................................. 14
3.5 Effect of uncontrolled failover on Application VM clients .......................................................... 15
3.6 Effect of Uncontrolled failover on Application Enablement (AE) Services Clients ..................... 15
4 Fast Reboot High Availability (FRHA) .................................................................................................. 16
4.1 Overview ..................................................................................................................................... 16
5 The AE Services and High Availability ................................................................................................. 17
5.1 Device, Media and Call Control Service (DMCC) ......................................................................... 17
5.2 TSAPI, CVLAN, DLG and Transport Services ................................................................................ 18
AVAYA 2|Page
HA White Paper for AE Services
Version 1.0
AVAYA 3|Page
HA White Paper for AE Services
Version 1.0
1 Introduction
This white paper is intended for those responsible for architecting, designing, and/or deploying
an application or an Avaya Aura® Application Enablement Services server in a High Availability
(HA) configuration.
Uninterrupted telephony is important for many enterprises, especially for mission critical
applications. Avaya Aura® Application Enablement (AE) Services on System Platform (SP)
supports a high availability (HA) cluster of two nodes. The active server node automatically fails
over to the standby node in the event of a hardware failure. Client applications are able to
reestablish communication with the AE Services cluster when the failover is complete.
The AE Services HA solution is not supported on the AE Services Software-Only and Bundled
offers.
GRHA is not a state preserving HA in the AE Services 6.3.1 Release. However starting in the AE
Services 6.3.3 release, GRHA is a partial state preserving HA. The state associated with the AE
Services service DMCC is preserved. When a controlled failover occurs, AE Services are stopped
on the current active VM and AE Services are started on the new active VM (previously the
standby VM). During this phase the DMCC service will load its preserved state data. GRHA
allows two AE Services Virtual Machines (VMs) to be placed in two datacenters that are
separated by a LAN/WAN. The VM host can be either Avaya Aura System Platform or VMware.
MPHA is a state preserving HA based on check pointing a running VM at frequent intervals (e.g.,
50milli seconds). At each check point, memory (including CPU registers) and disk state of a
protected VM are synchronized with the standby server. In the case of a failover, the standby
server becomes active and in the process activates the replicated (synchronized at the latest
check point) VM.
FRHA is a partial state preserving HA similar to GRHA offered in the AE Service 6.3.3 release.
MPHA and FRHA are offered via System Platform where the active and standby System
Platform servers hosting the AE Services servers are connected via a crossover cable.
AVAYA 4|Page
HA White Paper for AE Services
Version 1.0
This white paper also focuses on the AE Services interactions with Avaya Aura® Communication
Manager (CM) 6.0 (or later) in a HA configuration for the survivable core server (also known as
Enterprise Survivable Server or ESS) and the survivable remote server (also known as Local
Survivable Processor or LSP).
With the Machine Preserving High Availability (MPHA) and Fast Reboot High Availability (FRHA)
solutions that are offered via Avaya Aura System Platform, active and standby AE Services
servers are connected via a crossover cable. As per the CAT5 and CAT6 Ethernet cable
specification, the cable between the servers should not be longer than 100 meters. The GRHA
solution removes this limitation.
GRHA allows two AE Services Virtual Machines (VMs) to be placed in two datacenters that are
separated by a LAN/WAN with round trip time within 100 milliseconds. To ensure AE Services
AVAYA 5|Page
HA White Paper for AE Services
Version 1.0
server does not failover due to hardware failure, the GRHA offer when used with System
Platform can leverage the MPHA technology to provide hardware protection for AE Services VM
in each datacenter. Note: MPHA is not supported on VMware.
For more information on MPHA, please refer to the MPHA section in this white paper.
GRHA
< 100ms RTT
AE Services AE Services
Server Server
MPHA
Active System MPHA Hot Standby Active System Hot Standby
System Platform LAN/WAN
Platform Platform System Platform
PSTN
Communication Communication
AE Client Manager Manager (ESS) AE Client
applications applications
G650/G450 G650/G450
Media Gateway Media Gateway
Datacenter-1 Datacenter-2
Note: MPHA provides hardware protection in each datacenter
In this document the term “controlled failover” refers to a failover requested by either an
administrator or by software logic when it detects degradation in the state of health of the
current active server. The term “uncontrolled failover” refers to a failover which occurs because
the current active server is not reachable from the current standby server.
GRHA is not a state preserving HA in the AE Services 6.3.1 Release. In the AE Services 6.3.3 and
later release, GRHA is a partial state preserving HA solution for the AE Service DMCC only.
When a controlled failover occurs, AE Services are stopped on the current active VM and AE
Services are started on the new active VM (previously the standby VM). In case of an
uncontrolled failover, AE Services are started on the new active (previously standby) VM.
Depending on the reason for the uncontrolled failover, the previous active VM could be in an
isolated network or it could be in the shutdown state.
AVAYA 6|Page
HA White Paper for AE Services
Version 1.0
Controlled failover requests can be made by the system administrator (from the AE Services
Management Console) or can be requested by the Arbiter (running on each AE Services server)
if it detects health deterioration on the active server. The Arbiter daemon will not request a
failover if the standby server health is the same as the active server or worse.
AVAYA 7|Page
HA White Paper for AE Services
Version 1.0
Note that the current active AE Services VM will remain active in any of the following scenarios,
even if the current standby AE Services VM is administered as the preferred node:
AVAYA 8|Page
HA White Paper for AE Services
Version 1.0
• Three levels of GRHA licenses: SMALL, MEDIUM and LARGE. Please refer to the Avaya
Aura® Application Enablement Services 6.3.1 Administration and Maintenance Guide,
section “Administering the Geo Redundant High Availability feature” for more
information.
After the application connects using a new session, it must re-establish all monitors and register
all the endpoints as if the AE Services server came out of a reboot.
The time it takes for the application to start receiving service would depend on the total time
associated with following activities:
• Time it takes for the standby to detect a failure of the active AE Services VM. This depends
on administered “failure detection” interval and applies only in case of uncontrolled
failover. For a controlled failover, this time is almost 0.
• Time it takes for AE Services to be activated on the new active server. Currently this time is
approximately 1 minute.
• Time it takes for the application to connect to the new AE Services server and to recreate
all its monitors/registrations/associations. This time depends on the number of devices the
application is trying to monitor/register.
2.5 Limitations:
• GRHA is not supported on the Software-Only and Bundled AE Services offers.
• GRHA is not supported on System Platform if the AE Services VM is not protected using
MPHA technology for the AE Services 6.3.1 release. Note, the use of MPHA with GRHA is
optional for the AE Services 6.3.3 and later release.
• GRHA is supported only on IPV4 networks.
• GRHA does not protect against AE Services software failures
AVAYA 9|Page
HA White Paper for AE Services
Version 1.0
• Fast Reboot HA (FRHA): when enabled for a VM, the VM is rebooted on the new server
every time a failover occurs, for both controlled and uncontrolled failovers.
• Live Migration HA (LMHA): when enabled for a VM, the VM is Live Migrated to the new
server when a controlled failover occurs. For uncontrolled failovers the VM will be
rebooted on the new server (previously standby).
• Machine Preserving HA (MPHA): when enabled for a VM, the VM is activated on the new
(previously standby) server for both controlled and uncontrolled failovers.
This section focuses on the MPHA feature. For more formation on LRHA configurations please
see “Installing and Configuring Avaya Aura™ System Platform” and “Administering
Avaya Aura™ System Platform” documents available at http://support.avaya.com.
In System Platform 6.2.1 and later, if MPHA is selected for a VM, the remaining VMs are
automatically set for LMHA. LMHA is only available in the context of a VM enabled with MPHA.
In this document the term “controlled” failover refers to a failover requested by either an
administrator or by software logic, when it detects degradation in the state of health of the
current active server. The term “uncontrolled” failover refers to a failover which occurs because
the current active server is not reachable from the current standby server.
AVAYA 10 |
Page
HA White Paper for AE Services
Version 1.0
Memory replication at its core is based on the Xen hypervisor and its Live-migration technology.
The Disk replication is based on the open source application Distributed Replicated Block Device
(DRBD).
X-over
Disk replication using DRBD
• If a faulty hardware is detected by the State of Health daemon. The State of Health
daemon checks the health of the following hardware components:
– FANs on the motherboard
– Motherboard Temperature sensors
– Motherboard Voltage sensors
– Motherboard Current sensors
– Presence of power source
– Health of Hard drives
– RAID controller status
– RAID controller battery status
• The State of Health daemon also checks if the “dom0” (aka host) root files system is
read-write enabled and if “dom0” has at least 6% of free memory. A controlled failover
is initiated if the root file system becomes read-only or if the active dom0 has less than
6% of free memory.
• If “dom0” cannot reach an administered network destination (via an ICMP ping). The
default frequency and the network destination can be changed when HA is enabled. By
default it takes about 10 seconds to detect network failure. The 10 second delay in
detecting the network failure may cause sockets to drop or lose messages over the
network. Applications running on the guest VM are responsible for recovering from
such data loss.
• For controlled failover, CDOM and the Services VM (if configured) will be migrated to
the new active server.
• On a controlled failover, a memory protected VM will be available for service on the
new server within 500 milliseconds.
Health objects monitored by the State of Health daemon are listed in the following table.
3.2.2 Almost seamless failover in the event of sudden failures for MPHA
protected VMs
Uncontrolled failover can happen if the current active server fails suddenly or is not reachable
over the crossover link or via the switched IP network. The previously check-pointed (50-100ms
old) VM is then activated on the new active server. The VM may have to recover from its lost
state/sockets.
• Replication will disengage temporarily if replication interferes with the VM’s ability to
provide service or if the VM is used over its capacity.
• In some cases certain requests related to memory replication may timeout and
therefore abandon replication temporarily. Memory replication will re-engage within
20 seconds automatically. An error will be logged when this happens.
• If an uncontrolled failover occurs when replication is not fully engaged, the protected
VM will be rebooted on the new active server.
AVAYA 13 |
Page
HA White Paper for AE Services
Version 1.0
Server1 Server2
PM1 PS1 (with UPS) PS1 (with UPS)
PM2 PS2 (with UPS) PS2 (with UPS)
Server1 Server2
PM1 PS1 (with UPS) PS1 (with UPS)
PM2 PS1 (with UPS) PS1 (with UPS)
Server1 Server2
PM1 PS1 PS1
PM2 PS2 PS2
PM1/PM2: Power module 1 and 2 respectively, representing the two power modules installed on each server.
PS1/PS2: Power Source1 and 2 respectively. Two power sources mean each power source is connected to a different power grid
or generator.
AVAYA 14 |
Page
HA White Paper for AE Services
Version 1.0
UPS: Uninterrupted power supply. Expectation is that the UPS kicks in when needed in such a way that there is power to the
server continuously without a power glitch. The UPS does not have to be per-server basis. An UPS servicing an entire datacenter
will suffice as long as it meets “without a glitch” requirement.
Note that the above scenario could happen only in case of “Uncontrolled” failover. The
following are some of the conditions where an uncontrolled failover may occur:
• Clients and Communication Manager are ~100 milliseconds ahead of time with respect
to the newly active AE Services VM.
AVAYA 15 |
Page
HA White Paper for AE Services
Version 1.0
• If there was any TCP traffic during the last incomplete check point (~100 milliseconds)
the TCP sockets between the AE Services VM and its clients may drop.
• Clients may lose some events.
• Data written to the disk since the last checkpoint may be lost.
• To preserve the Transport link, Communication Manger must have CM6.2 Service Pack
2 (or newer) installed.
• DMCC clients can re-connect to the AE Services VM and resume the session. DMCC
clients may lose events that were generated during failover. For first party call control,
Communication Manager will refresh the current state of the display and lamp state
associated with various buttons.
• For TR87 clients, the SIP dialogs can survive a socket drop. Therefore all association
created will remain intact after failover. SIP Dialogs that were in transient state in the
past 100 milliseconds could be affected.
• If TSAPI, CVLAN and DLG client sockets drop, the client applications have to reestablish
all associations.
• In the future, the TSAPI and JTAPI SDKs will re-establish these sockets upon a socket
failure and preserve the previous session. It will also launch an audit and recover from
the lost state in the AE Services VM.
AVAYA 16 |
Page
HA White Paper for AE Services
Version 1.0
AE Services
Server
G650/G450
AE Client Media Gateway
applications
Communication
Communication
Manager
Manager
In addition to the AE Services server failover feature, the DMCC service provides recovery from
a software fault or a shutdown that does not allow the DMCC Java Virtual Machine (JVM)
process to exit normally. The DMCC Service Recovery feature is available on all AE Services
offers: Software-Only, Bundled, VMware and System Platform (SP). When the DMCC JVM
process is restarted after an abnormal exit, the DMCC service is initialized from persisted state
information on the hard disk. This persisted state information is saved during normal operation,
and represents the last known state of the DMCC service prior to a JVM abnormal exit. The
AVAYA 17 |
Page
HA White Paper for AE Services
Version 1.0
state information includes session, device, device/call monitoring and H.323 registration data.
Following the restart of the DMCC JVM, the persisted information is used to re-create the
sessions, device IDs, monitors and H.323 registrations that existed just prior to the software
fault. Note that only H.323 registrations that use the Time-To-Service feature on
Communication Manager can be recovered.
From a client application’s point of view, the DMCC recovery appears as a temporary network
interruption that requires the client to reestablish any disconnected sessions. When the client
application reestablishes the session, the DMCC service will send events to the client for any
resources that could not be recovered. These events may include “monitor stopped” and/or
“terminal unregistered” event messages, and will enable the client to determine what, if
anything, needs to be restored (using new service requests). Otherwise, the client application
may continue to operate as usual.
Note that an AE Services server failover or a restart of the DMCC JVM results in the teardown
and re-creation of the H.323 endpoint registrations, but is limited only to the AE Services
server. The Communication Manager is unaware that this is taking place, and sees the
endpoints as still registered. Be aware that this may have an effect on any calls in progress for
these endpoints. If the client application specified “server-media” dependency mode for a call,
then the call (and any recording on the call) will be terminated. Alternatively, if the client
application specified “client-media” dependency mode for the call, then the call should survive
the failover (or restart). However, it is possible that some state changes for the endpoint and
its call may have been missed by the AE Services server during the failover. For example, if the
far-end hangs up the call at the same moment that a failover occurs, then it is possible that any,
or all, of the resultant “HookswitchEvent”, “MediaStopEvent”, “DisplayUpdatedEvent” and
“LampModeEvent” messages could be lost.
5.3 Recommendations
The following items are recommended:
• Communication Manager should be configured for H.323 registration using the Time-
To-Service feature.
• AE Services 6.1 and later should use the PE interface – even in survivable server
environments. PE connections offer reduced complexity:
AVAYA 18 |
Page
HA White Paper for AE Services
Version 1.0
• A local HA cluster of AE Services 6.1 and later on System Platform or VMware servers is
used.
• An application that uses the Device, Media and Call Control (DMCC) service should keep
trying to reestablish the DMCC session when it loses its socket communication link to
the AE Services server because the DMCC runtime state is preserved during a failover.
This applies to all AE Services configurations.
• An application that uses the CallVisor Local Area Network (CVLAN), Definity Local Area
Network Gateway (DLG) or Telephony Service Application Programming Interface
(TSAPI) service should reestablish its socket connections and its monitors/associations
if it loses the socket connection to the service on the AE Services server because no
runtime state is preserved for these services. TSAPI applications also need to
reestablish route registrations.
Avaya recommends that all applications in a survivable server configuration connect to a local
AE Services 6.1 (or later) server that, in turn, is connected to either the media server at the
main site or a media gateway/survivable server at the remote site. In this configuration, the
applications and associated AE Services servers at the remote sites are always active and are
supplying functionality for the local resources at the remote site. This type of configuration
ensures the most seamless survivability in an enterprise survivable configuration.
AVAYA 19 |
Page
HA White Paper for AE Services
Version 1.0
Since the AE Services 6.1 and Communication Manager 6.0 release, switch connections on both
the Control Local Area Network interface cards (CLANs) and Processor Ethernet (PE)
connections are fully supported in all survivable server configurations. Additionally, any Device,
Media and Call Control (DMCC) endpoints registered to the primary switch using the Time-To-
Service feature (TTS) will automatically re-register to a survivable server. DMCC endpoint
registrations not using the Time-To-Service feature will be unregistered when the
Communication Manager fails over to a survivable server.
AE Services 6.1 and later allows the Communication Manager survivable server nodes to be
administered within a switch connection, in a priority order, along with the PE IP address of
those nodes. When used in conjunction with Communication Manager 6.0 or later, it provides
the means to deterministically control the AE Services server connectivity in failover situations.
For more information about Communication Manager’s survivability deployment, see: Avaya
Aura® Solution Deployment, available on the Avaya Support Web site,
http://support.avaya.com
AVAYA 20 |
Page
HA White Paper for AE Services
Version 1.0
Second, the port network controls to which Communication Manager node AE Services is
connected via a CLAN. In disaster recovery scenarios, AE Services has no control over which
Communication Manager node will provide service to that port network, and therefore has no
control over which Communication Manager node will provide service to AE Services (over that
CLAN AEP connection). In this case, Communication Manager’s administration controls under
what conditions and to which Communication Manager node the port network will connect.
Conversely, PE connections are made directly to Communication Manager nodes, and AE
Services can therefore directly control (via its own administration) from which Communication
Manager node it requests service. Depending on the system topology and disaster recovery
requirements, either type of connectivity can be successfully utilized.
1
A work around is provided for AE Services 6.1 with non-CM 6.0 or greater switches when PE is used with a single
ESS or LSP node. See PSN 3156u - PE support for ESS and LSP scenarios for more details.
AVAYA 21 |
Page
HA White Paper for AE Services
Version 1.0
feature when connectivity is lost to the primary Communication Manager media server.
There is a slight possibility that an endpoint using the Time-To-Service feature will fail
to re-register, in which case an unregistered event for that endpoint will also be sent to
the client application.
At this point, the application should begin attempts to re-register the DMCC endpoints
(that failed the automatic re-registration) with the same IP address(es) it was using
before. Note that it takes a little over 3 minutes for the media gateway to connect to a
survivable server. For this reason, it is recommended that the application keep trying
to register with the same media gateway (through the AE Services server) for that
amount of time before it tries to register with a survivable remote server (if one exists).
When the media gateways connect with the survivable server, the registration
attempts will begin to succeed. After the application has successfully re-registered all
DMCC endpoints, it should reestablish its previous state and resume operation.
b. CallInformation Services within DMCC, Call Control Services within DMCC, and all
other CTI services
The CallInformation and Call Control services within DMCC and all other CTI Services
(TSAPI, CVLAN, DLG and JTAPI) use the Transport (AEP) link to communicate with
Communication Manager. The transport links (Switch Connections) on each AE Services
server should be administered to communicate only with PEs for Communication
Manager media servers that are local to the AE Services server’s site. If the system is
configured in this fashion, the application/AE Services server will not have to take any
unusual action to recover in the event that a gateway loses connectivity to the primary
Communication Manager node and transitions to a survivable server.
If a media gateway loses connectivity to the primary server for an extended period of
time (configurable on Communication Manager), it will register with the local
survivable server. Within 5 seconds of that registration, that survivable server will
inform AE Services 6.1 (or later) that is has transitioned from idle to active. AE Services
6.1 (or later) will re-evaluate its current session. If this survivable server node has a
higher precedence than the current Communication Manager node in use (or if there is
no current session), it will be used. If an AE Services server changes Communication
Manager nodes, it will notify any connected applications of this event via a
LinkDownEvent, for DMCC CallInformationServices, or a CTI link down indication, for
CTI services. For Call Control Services within DMCC, Avaya recommends that
applications add a CallInformationListener and look for a LinkDownEvent for indication
that connectivity to the primary site is down. (In future releases, Call Control Services
clients will receive a MonitorStop request for all call control monitors if the link is lost
AVAYA 22 |
Page
HA White Paper for AE Services
Version 1.0
to the main site.) Depending on the CTI application programming interface (API),
clients will receive an appropriate event when the connectivity to the primary site is
down. CVLAN clients will receive an “abort” for each association. TSAPI clients will
receive a CSTAMonitorEnded event if the client is monitoring a device and/or a
CSTASysStatEvent with a link down indication if the client is monitoring system status.
TSAPI clients will also receive a CSTARouteEnd event for any active routing dialogs, and
a CSTARouteRegisterAbort event for any registered routing devices. Avaya JTAPI 5.2
and later clients will receive a PROVIDER_OUT_OF_SERVICE event if the client has
ProviderListeners. Otherwise, a ProvOutOfServiceEv event will be received if the client
has ProviderObservers. DLG clients will receive a link status event with a link down
indication and a cause value.
The AE Services server will then automatically notify the application that the CTI link is
back up, and the application can begin to resume normal operations. Since there is no
run-time state preserved on a transition from a primary Communication Manager
media server to a survivable server (as there is with an interchange on a duplicated
Communication Manager media server pair) all application state must be reestablished.
Note that, from the AE Services server’s and application’s perspectives, the failure
scenario and recovery actions appear exactly the same as a long network outage
between the AE Services server and the gateways.
AVAYA 23 |
Page
HA White Paper for AE Services
Version 1.0
connectivity is lost from the local media gateways (like G650) to the primary
Communication Manager media server. There is a slight possibility that an endpoint
using the Time-To-Service feature will fail to re-register, in which case an unregistered
event for that endpoint will also be sent to the client application.
At this point, the application should begin attempts to re-register the DMCC endpoints
(that failed the automatic re-registration) with the same IP address(es) it was using
before. Note that it takes a little over 3 minutes for the media gateway (like G650) to
connect to a survivable core server (ESS). For this reason, it is recommended that the
application keep trying to register with the same CLAN (through the AE Services server)
for that amount of time before it tries to register with a survivable remote server (if
one exists). When the media gateways (like G650) connect with the survivable core
server (ESS), the registration attempts will begin to succeed. After the application has
successfully re-registered all DMCC endpoints, it should reestablish its previous state
and resume operation.
b. CallInformation Services within DMCC, Call Control Services within DMCC, and all
other CTI services
The CallInformation and Call Control services within DMCC and all other CTI Services
(TSAPI, CVLAN, DLG and JTAPI) use the Transport (AEP) link to communicate with
Communication Manager. The transport links (Switch Connections) on each AE Services
server should be administered to communicate only with CLANs in the media gateways
that are local to the AE Services server’s site. If the system is configured in this fashion,
the application/AE Services server will not have to take any unusual action to recover in
the event that a gateway loses connectivity to the primary Communication Manager
media server and transitions to a survivable core server (ESS).
AVAYA 24 |
Page
HA White Paper for AE Services
Version 1.0
an “abort” for each association. TSAPI clients will receive a CSTAMonitorEnded event if
the client is monitoring a device and/or a CSTASysStatEvent with a link down indication
if the client is monitoring system status. TSAPI clients will also receive a CSTARouteEnd
event for any active routing dialogs, and a CSTARouteRegisterAbort event for any
registered routing devices. Avaya JTAPI 5.2 and later clients will receive a
PROVIDER_OUT_OF_SERVICE event if the client has ProviderListeners. Otherwise, a
ProvOutOfServiceEv event will be received if the client has ProviderObservers. DLG
clients will receive a link status event with a link down indication and a cause value.
The AE Services server will then automatically attempt to reestablish the AEP links.
Note that it takes a little over 3 minutes for the media gateway (like G650) to connect
to a survivable core server (ESS). Once the media gateway has registered with the
survivable core server (ESS), the AE Services server will succeed in establishing its AEP
links very soon thereafter (after around 30 seconds). As soon as an AEP link is
established, the application will be notified that the CTI link is back up, and the
application can begin to resume normal operations. Since there is no run-time state
preserved on a transition from a primary Communication Manager media server to a
survivable core server (as there is with an interchange on a duplicated Communication
Manager media server pair) all application state must be reestablished. Note that,
from the AE Services server’s and application’s perspectives, the failure scenario and
recovery actions appear exactly the same as a long network outage between the AE
Services server and the gateways.
AVAYA 25 |
Page
HA White Paper for AE Services
Version 1.0
AVAYA 26 |
Page
HA White Paper for AE Services
Version 1.0
communication paths (i.e., previously established communication path that has been lost as a
result of an outage).
AVAYA 27 |
Page
HA White Paper for AE Services
Version 1.0
AES AES
Server Server
Active System MPHA Standby System Active System FRHA Standby System
Platform Platform Platform Platform
LAN/WAN
G650 PSTN
AE Client Media Gateways G450 AE Client
Media Gateway
applications applications
Communication
Communication Manager(LSP)
Manager
Communication
Manager
AES AES
Server Server
MPHA
Active System MPHA Standby System Active System Standby System
Platform LAN/WAN
Platform Platform Platform
PSTN
G650 G650
AE Client Media Gateways Media Gateways AE Client
applications applications
Communication Communication
Manager (ESS) Manager (ESS)
Services server at the main site is connected via CLANs to Communication Manager. Satellite
site A (e.g., branch office) has a G450 media gateway with a survivable remote server (LSP).
The AE Services server at satellite site A is connected to both the primary Communication
Manager server and the survivable remote server (LSP) via PE connections. Remote site B has
an S8800 survivable core server (ESS) and G650 media gateways. The AE Services server at
remote site B is connected via CLANs to Communication Manager. Remote site C has an S8800
survivable core server (ESS) and G650 media gateways. The AE Services server at remote site C
is connected to the primary Communication Manager and the survivable core servers at remote
sites B and C via PE connections. Furthermore, all AE Services servers are configured to stay on
the survivable server as long as they are providing service.
Avaya recommends that all applications have a local AE Services server. In this configuration,
the applications and associated AE Services server at the remote sites are always active and are
supplying functionality for the local resources at the remote site. As described in this
document, this type of configuration ensures the most seamless service in a survivable
configuration.
AVAYA 29 |
Page
HA White Paper for AE Services
Version 1.0
AES AES
Server Server
Active System MPHA Standby System Active System FRHA Standby System
Platform Platform Platform Platform
LAN/WAN
G650 PSTN
AE Client Media Gateways G450 AE Client
Media Gateway
applications applications
Communication
Communication Manager (LSP)
Manager
Communication
Manager
AES AES
Server Server
MPHA
Active System MPHA Standby System Active System Standby System
Platform LAN/WAN
Platform Platform Platform
PSTN
G650 G650
AE Client Media Gateways Media Gateways AE Client
applications applications
Communication Communication
Manager (ESS) Manager (ESS)
recommended to configure the primary search list of the G450 media gateway such that it
contains CLANs (or PE) of only one site (i.e. headquarters in this case). The secondary search
list should contain the survivable remote server (LSP) at the local site. The AE Services server
will detect connectivity failure with the primary site (headquarters) and will automatically start
using the survivable remote server (LSP) to provide service.
The G650 media gateways at remote site B will connect to the local survivable core server (ESS)
in case of a WAN outage. The AE Services server at this site will automatically connect with the
survivable core server (ESS) through the G650 media gateways. All of this will be transparent to
the AE Services server and its applications except for what will appear to be a brief network
outage.
The G650 media gateways at remote site C will connect to the local survivable core server (ESS)
in case of a WAN outage. The AE Services server at this site will detect connectivity failure with
the primary site (headquarters) and will automatically start using the survivable core server
(ESS) to provide service.
The site at the headquarters will continue to function as it did previously in case of a WAN
outage.
Note: Each of the remote sites and the headquarter site will not be able to access each other’s
resources during a WAN outage. Also, at each of the remote sites, this will be transparent to
the AE Services applications except for what will appear to be a brief network outage (described
in detail section 6.4.4).
AVAYA 31 |
Page
HA White Paper for AE Services
Version 1.0
AES AES
Server Server
Active System MPHA Standby System Active System FRHA Standby System
Platform Platform Platform Platform
LAN/WAN
G650 PSTN
AE Client Media Gateways G450 AE Client
Media Gateway
applications applications
Communication
Communication Manager (LSP)
Manager
Communication
Manager
AES AES
Server Server
MPHA
Active System MPHA Standby System Active System Standby System
Platform LAN/WAN
Platform Platform Platform
PSTN
G650 G650
AE Client Media Gateways Media Gateways AE Client
applications applications
Communication Communication
Manager (ESS) Manager (ESS)
AVAYA 32 |
Page
HA White Paper for AE Services
Version 1.0
AVAYA 33 |
Page
HA White Paper for AE Services
Version 1.0
AES AES
Server Server
Active System MPHA Standby System Active System FRHA Standby System
Platform Platform Platform Platform
LAN/WAN
G650 PSTN
AE Client Media Gateways G450 AE Client
Media Gateway
applications applications
Communication
Communication Manager (LSP)
Manager
Communication
Manager
AES AES
Server Server
MPHA
Active System MPHA Standby System Active System Standby System
Platform LAN/WAN
Platform Platform Platform
PSTN
G650 G650
AE Client Media Gateways Media Gateways AE Client
applications applications
Communication Communication
Manager (ESS) Manager (ESS)
AVAYA 34 |
Page
HA White Paper for AE Services
Version 1.0
At this point, the application should begin attempts to re-register the DMCC endpoints
(that failed the automatic re-registration). After the application has successfully re-
registered all DMCC endpoints, it should reestablish its previous state and resume
operation.
b. CallInformation Services within DMCC, Call Control Services within DMCC, and all
other CTI services
The CallInformation and Call Control services within DMCC and all other CTI Services
(TSAPI, CVLAN, DLG and JTAPI) use the Transport (AEP) link to communicate with
Communication Manager.
AVAYA 35 |
Page
HA White Paper for AE Services
Version 1.0
be received if the client has ProviderObservers. DLG clients will receive a link status
event with a link down indication and a cause value.
Services server to be a geographically redundant standby server. In this case, the session will
become active as soon as the survivable server node becomes active.
Survivable servers are always ready to provide service, and wait for port networks or media
gateways to register to them. Within 5 seconds of that registration, that survivable server will
inform AE Services 6.1 (or later) that it has transitioned from idle to active, which allows AE
Services 6.1 (or later) to use that node if necessary. Likewise, as soon as the last port network
or media gateway unregisters, it will inform AE Services 6.1 (or later) that it has transitioned
back to idle, which will stop AE Services 6.1 (or later) from using it.2
If PE connectivity is used to connect to multiple pre Communication Manager 6.0 nodes, then
AE Services 6.1 (or later) will not be able to tell which nodes are actually active, and therefore it
may choose (based on the priority administered on the survivability hierarchy OAM screen) to
use an idle Communication Manager node, which would not be able to provide any reasonable
service to the end user applications. This is not an issue with CLAN connectivity. Since CLANs
2
Communication Manager maintenance can take 30-60 seconds to decide that the last media gateway has
unregistered after it actually has unregistered, so AE Services will be notified within 5 seconds after that delay.
There is no such maintenance delay for port networks.
AVAYA 37 |
Page
HA White Paper for AE Services
Version 1.0
reside in port networks, any Communication Manager node to which AE Services 6.1 (or later)
connects via CLAN is active by definition (since the port network must be registered to that
Communication Manager node to provide the connectivity).
3
Note: A Communication Manager node can be either a simplex media server or a duplicated media server pair.
Both servers on a duplicated system have the same cluster ID/MID.
AVAYA 38 |
Page
HA White Paper for AE Services
Version 1.0
CM Communication Manager
HA High Availability
MID Module ID
SP System Platform
TTS Time-To-Service
AVAYA 39 |
Page
HA White Paper for AE Services
Version 1.0
AVAYA 40 |
Page