Alarm Management - Ammonia Plants Experience in Improving Alarm Systems With Focus On Process and Human Factors
Alarm Management - Ammonia Plants Experience in Improving Alarm Systems With Focus On Process and Human Factors
Alarm Management - Ammonia Plants Experience in Improving Alarm Systems With Focus On Process and Human Factors
Ammonia Plants
Experience in
Improving Alarm
Systems with focus on
Process and Human
Factors
Understanding and using the geometric relationship between an operating envelope and
its approximating hypercube eliminates many false alarms. This substantially improves the credibility of
the alarm system to the operator and allows earlier annunciation with more time for the operator to
respond. The new alarms give the operator earlier and positive warning of deviation from whatever
combination of business, environmental and process performance objectives are the operating windows
chosen objective thus contributing to the economic performance of the plant and so earning the alarm
system a share of the business case for further investment.
There are two major alarm systems in a process plant. The first is the Safety Alarm System responsible
for taking control and shutting down the process in extreme process excursions which both the process
control system and the operator have been unable to prevent. Its role is to prevent an extreme excursion
from turning into a disaster with liabilities and costs that can run into hundreds of millions of dollars. Its
costs are viewed as an insurance premium against a disaster that most plants will never experience.
The second is the Operator Alarm system intended to draw the process operator’s attention to a situation
beyond the capability of the process control system to prevent and requiring application of the operator’s
considerably greater human intelligence to resolve and correct before the safety system intervenes and
shuts down the plant. Automatic plant shutdowns are expensive in lost production time and possible
consequential plant damage. Operator alarms give the operator time to intervene and correct the situation
to avoid a shutdown.
Most plants accept that ‘Normal’ operation refers to the Operating Envelope within which desired
economic results are achieved similarly to Figure 1 and place the operator alarm limits where they
imagine the boundary to be.
2. The economic cost of violating an alarm limit is the delta cost between the material produced and operating
costs of desired and undesired operation.
Figure 1. Operator alarm limits at the boundary of where the process normally operates
But the first practical problem behind the advice to ‘put the operator alarms on the boundary of where the
plant normally operates’ is that there has been no way to determine the location of the boundary of
normal operation when the operating objective is that of meeting all KPI’s, including those that cannot be
measured in real-time, at all times.
The consequence of this is that Figure 2 is a representation of alarm limits as they really are in practice.
Some alarm limits are set in the orange recovery space where they will, at best, annunciate late, giving
the process disturbance more time to grow and requiring a larger corrective action, or in many cases are
set so wide that they can never annunciate. Other alarm limits are set inside the green ‘normal operation’
space where they will annunciate unnecessarily some of the time creating false alarms and leading to
their being labelled as ‘bad actors’.
Without knowledge of the location of boundary or of how alarms relate to each other there is little that can
be done to cure a bad actor other than to push the alarm limit ‘outwards’ towards or past the guessed
position of the boundary.
Figure 2. Operator alarm limits as they usually are since the boundary of normal operation is unknown
Effective alarm management is particularly important in the process industries given the potential Major
Accident Hazard (MAH) implications of their operations. In the UK, onshore high hazard sites are
regulated by the Health and Safety Executive (HSE) under the COMAH (Control of Major Accident
Hazard) Regulations (HSE, 2006). A core requirement of these regulations is for operators to submit a
Safety Report which demonstrates to the regulator that their activities are, as far as is reasonably
practicable, safe and that MAH events are suitably controlled. One aspect of this is the need to
demonstrate that alarm systems have been both properly conceived during plant design, and are subject
to ongoing management and review to ensure that alarms continue to support safe and reliable
operations.
There are two aspects to this: firstly, duty holders must ensure that their alarm system is safe, and that it
offers reliable protection against MAH events. Secondly, they must provide evidence to the regulator that
the system is designed in accordance with best practice and that there is a verification process in place
which ensures that the system fully supports effective operator response. This includes providing a
demonstration that best practice standards for alarm design are being applied on site.
Improving alarm systems – the challenge
It is essential that high hazard sites operate with confidence that, when a high-criticality alarm arises,
those charged with the task of responding to the alarm can indeed do so. This confidence is particularly
important at times of high workload or elevated alarm levels, for example during a serious plant upset.
Failure to ascertain whether a reliable operator response is probable undermines the foundations upon
which the entire alarm system is based.
However, providing this verification can be difficult. Firstly, modern process plants are complex, with
distributed systems to maintain process control across extensive networks. Secondly, the number of
variables associated with the effective design and presentation of an alarm can be significant. In short,
there are many alarms to assess and many factors to consider for each alarm.
Many MAH sites with complex alarm systems utilise alarm management software as part of their
assurance strategy. Such software provides data for alarm system performance which can be used to
judge the overall adequacy of the system (for example average alarm rate, number of alarms following an
upset, number and distribution of alarms by priority). This information is important from the perspective of
performance monitoring and for developing alarm rationalisation strategies to reduce alarm load and
improve system performance. Alarm metrics can also be interrogated at a deeper level to examine, for
example, response times to particular alarms. Alarm management software is therefore often viewed as
an important tool in the quest to improve alarm systems.
However, such software often provides little insight into how the operator interacts with the DCS to
respond to an alarm and whether, and where, the operator encounters any difficulty in doing so. With the
exception of drawing conclusions about the overall alarm load, such software rarely provides much
analysis regarding which specific features of the alarm system present problems to the operator and the
aspects of system design that need to be addressed to improve alarm reliability.
Therefore, in the context of achieving reliable verification that operators will respond to an alarm, the
limitations of tools that measure overall alarm load as the sole means to achieve this should be
recognised.
Where sites utilise this method as the only means of alarm system analysis it could be argued that the
reliability of response, at times of highest need during a serious plant upset, may often be based upon
little more than assumption.
Given the complexity of the task facing many MAH operators, a pragmatic solution is therefore required to
provide the verification which they, and the Regulator, require: that their alarm system is safe and that a
reliable response to the most critical process alarms is possible.
For example, in the guide, specific information relating to individual alarm design is often incorporated
within wider guidance relating to organisational arrangements to support alarm systems. Moreover,
information relating to how alarms should be presented to facilitate prompt and effective identification by
operators is distributed throughout the document, rather than being collated in one discrete, easy-to-
interpret section.
Unless significant time is spent reviewing the guidance, it may be difficult to identify the key information
against which alarms should be assessed to determine that a specific alarm adheres to the various
requirements of the guidance. The extensive nature of EEMUA 191 means that this approach, when
coupled with the number of potential alarms to be reviewed at any given site, may appear an
overwhelming challenge.
An alternative approach is to carry out full task and failure analyses of the highest criticality alarms in the
context of response tasks (see, for example, Energy Institute, 2011). While this would represent a
thorough approach it may present its own challenges. For example, whilst such analyses should give a
fully-rounded analysis of the task in the operating context, these analyses can be complex and potentially
time consuming, and will often require external HF support. In addition, whilst such approaches provide
an excellent framework for identifying potential failures for the full range of different task types, they do
not necessarily provide specific support for assessing the cognitive aspects of alarm response (e.g.
diagnosing the causes of alarms and deciding upon appropriate responses). Finally, EEMUA 191 outlines
a substantial number of specific design expectations and it is uncertain whether a traditional failure
analysis approach would reliably identify all of these factors.
The Alarm Review Tool, or ART, provides a means for MAH operators to reliably and rapidly analyse
critical alarms and their associated management systems against the alarm system design principles
described in EEMUA 191. It distils the key guidance from EEMUA 191 into related sections, meaning that
the user can be confident that they have considered all of the relevant information for a specific alarm
from the guidance without having to hunt through the document.
The process has been designed to provide a comprehensive analysis of the alarm system, and currently
comprises four core elements:
Critical alarm screening: This is a facility for alarm filtering to determine whether alarms which are
currently assigned highest criticality within the system justify that categorisation. This screening
helps, in the first instance, identify alarms which have been wrongly prioritised. This ensures that
time spent analysing alarms is initially focused on those alarms which are most important. Such
high level screening can also assist with rationalisation by identifying alarms which are not truly
critical.
Individual alarm review: This element facilitates a quick but thorough review of individual safety-
critical alarms against the usability principles outlined in EEMUA 191. This examines all HF aspects
of alarm response from signal presentation, availability of DCS information for diagnosis, to
execution of response. This depth of analysis provides the necessary verification that alarm design
is optimised.
Alarm management system review: This is an in-depth assessment of the management system
which supports the alarm system. It examines the adequacy of organisational arrangements for the
ongoing maintenance, development and review of the alarm system. It is envisaged that this review
would take place periodically – for example by undertaking an initial management system review
then possibly only re-reviewing at a later date if significant organisational changes have occurred
which affect the management of the alarm system.
Alarm performance metrics: This provides a facility for recording and trending alarm metrics in
relation to ongoing rationalisation provided via the alarm review tool. This chart alarm system
improvements in relation to any changes made to problem alarms.
The analysis can be completed as a paper analysis. However, a software tool has also been developed to
speed to assessment process and facilitate the aggregation of multiple analyses. This is still in the
process of being developed, however screenshots from a prototype of this software are included in this
article to illustrate the process.
Figure 3. Example statements in the ‘Maintain Salience’ phase of the critical alarm review process
Figure 4. Example summary report for one critical alarm
The ease with which alarms can be added carries the risk of alarm overload, and the ease with which
alarms can be modified or suppressed can, in the absence of proper change-anagement protocols, lead
to serious degradation of an alarm system’s reliability.
With plant safety, environmental safety, regulatory compliance and bottom-line success depending on
effective alarming, it has become clear to industry that proper alarm management is an essential part of
overall best practices. [Ref 3]
In the earlier days of pneumatic controls, adding a new control room alarm was quite involved and
required significant effort. Present day DCS makes it very easy to add new alarms to process variables.
However, once added, change management procedures can make its removal quite involved.
Additionally, almost inevitably, a significant amount of plant incident investigation reports will recommend
adding new alarms.
Alarm Management has been defined in literature as the “Process by which alarms are engineered,
monitored, and managed to ensure safe, reliable operations”. A key misconception is that Alarm
Management is only about reducing the number of alarms. The objective is to improve the quality of
normal and abnormal operations alarm rates.
Recognizing that an Alarm Management program would improve operator workload, improve plant
reliability and avoid unplanned outages as well as avoid possible safety and environmental incidents,
CNC & N2000 embarked on such a project.
Project Execution The objective of the project was to develop an alarm management philosophy and a
rationalized alarm system in alignment with the principles outlined in the EEMUA3 Standard (Publication
191:1999) and the industry’s best practices.
The hardware used allowed data to be collected from the DCS and securely broadcast this data through a
dedicated server on the business LAN. The software selected had two key elements.
Data Collection – This application collects and stores alarm and event history for long term archive and
data analysis. The software’s analysis and monitoring capabilities instantly identify problematic alarms
and help to immediately reduce operator alerts and resulting alarms. Operators can also select the real-
time viewer alarms to view additional documentation on the alarms. This documentation provides the
operator with guidance on resolving the abnormal situation. Operators can access and verify information
regarding the cause of each alarm, its priority, the appropriate response and the consequences of not
responding.
Management of Change – This application serves as the master alarm database during the rationalization
process. After the alarm rationalization process is completed, it then becomes the means of managing
change to any and all alarm configurations. The master alarm database is designed to record and log all
alarm changes, such as what and when were alarm changes made, what are the new configuration
settings (and what were the old ones), who made the changes, the reasons behind the changes and
finally, how the change was authorized.
Phase 2 – Alarm Philosophy
The second phase was to review the facility’s existing alarm philosophy and make the necessary
recommendations and subsequent changes for improvement of same. However, no written alarm
philosophy document was able to be located and a new alarm philosophy document needed to be
developed.
This document establishes rules for configuring the DCS to help improve the alarm system.
The purpose is to reduce the number of alarms, eliminate redundant and nuisance alarms, and properly
prioritize alarms.
This alarm philosophy document allows CNC/N2000 to have a set of guidelines to assess the need for
alarms and the corrective response required by the operators. Through the use of this document, the
following objectives are expected to be achieved:
The fourth phase was implementation of the new system with the requisite training of personnel to
manage and manipulate the system as required.
Phases 3 and 4 were essentially combined where the selected vendor demonstrated the alarm
rationalization process to a team of personnel from CNC & N2000 comprising of Process & Electrical
Engineers as well as Senior Operators. The specialist contractor makes quarterly visits to the plants to
review progress and observe the sessions to maintain effectiveness.
The DCS architecture is such that the plant is subdivided into approximately 26 “areas”.
Teams were then set up and a schedule developed for rationalization exercises to be done with a target
completion of within 12 months.
On completion of each area rationalization, an MOC is done and circulated for management approval.
Before implementation, the Operations team members who were part of the rationalization exercises have
to visit all shifts and review each MOC with them in detail for maximum understanding and buy-in.
Examples are taken from the CNC Plant over a four month period; November 2009 – February. It should
be noted that the plant was shutdown for a turnaround and returned to production on 30 October 2009.
This means that in early November, there would have been a transient period of plant optimization.
Additionally, the plant had to be shutdown on 22 November 2009 for another issue. This means that of
the four month period chosen, November 2009 represented an abnormal month with December 2009 and
January and February 2010 representing steady state months. For this reason, where applicable, key
performance indicators (KPIs) are shown separately for the two different states within the chosen period.
Average number of alarms per hour
This is the ratio of the total number of alarms annunciated to an operator during the analysis period to the
total number of hours during the same period. It is a measure of an average level of load imposed on the
operator by the alarm system. The target by best practices is less than 12 alarms per hour.
Referring to Figure 5, it can be seen that the November events roughly doubled the average when
compared to the steady state period. What can be seen here is that there is room for improvement by
reducing the average through alarm rationalization exercises.
Figure 5. Alarm Distribution Over Time
Referring to Figure 5, it is clear that the peak hourly alarm rate is significantly greater than target and
significant benefit can be gained through alarm rationalization.
The yellow zone on Figure 6 illustrates CNC’s target zone for alarm system performance.
The target dot is the EEMUA 191 best-practice performance level. The yellow zone is based on Industrial
experience and EEMUA 191 figures 42 & 454.
Figure 6 – Alarm Performance Assessment Chart These KPIs would be evaluated after the results of
each rationalized DCS area have been implemented to ascertain that the KPIs are approaching the target
criteria.
Referring to Figure 5, it can be seen that the November events roughly doubled the average when
compared to the steady state period.
What can be seen here is that there is room for improvement by reducing the average through alarm
rationalization exercises.
Maximum number of alarms per hour This is the measure of the worst-case load on an operator during
any ten-minute time period.
The target by best practices is less than 15 alarms per 10 minutes or 90 alarms per hour.
Referring to Figure 5, it is clear that the peak hourly alarm rate is significantly greater than target and
significant benefit can be gained through alarm rationalization. Percentage of hours with alarms greater
than 30 This is a measure of proportion of time an alarm system was in an upset state. Such a
performance indicator judges the reasonable level of manageability of the alarm system (1 alarm per 2
minutes). The target by best practices is less than 2% of operating time.
The Alarm Management project has been implemented and rationalization of the different DCS areas is in
progress on a phased basis.
Industry guidelines place the CNC alarm system in what is defined as a “stable” mode of operation and
the plant is working to further improve the performance of the system.