Relationship Between Availability and Reliability
Relationship Between Availability and Reliability
Relationship Between Availability and Reliability
Availability is defined as the probability that the system is operating properly when it is
requested for use. In other words, availability is the probability that a system is not failed or
undergoing a repair action when it needs to be used. At first glance, it might seem that if a
system has a high availability then it should also have a high reliability. However, this is not
necessarily the case. This article will explore the relationship between availability and
reliability and will also present some of the specified classifications of availability.
As stated earlier, availability represents the probability that the system is capable of
conducting its required function when it is called upon given that it is not failed or
undergoing a repair action. Therefore, not only is availability a function of reliability, but it is
also a function of maintainability. Table 1 below displays the relationship between reliability,
maintainability and availability. Please note that in this table, an increase in maintainability
implies a decrease in the time it takes to perform maintenance actions.
As you can see from the table, if the reliability is held constant, even at a high value, this
does not directly imply a high availability. As the time to repair increases, the availability
decreases. Even a system with a low reliability could have a high availability if the time to
repair is short.
Unrestricted
Availability Classifications
The definition of availability is somewhat flexible, depending on what types of downtimes
are considered in the analysis. As a result, there are a number of different classifications of
availability. In BlockSim, the following availabilities can be obtained directly from simulation
or can be indirectly calculated with values returned from analysis:
Point Availability
Point, or instantaneous, availability is the probability that a system (or component) will be
operational at any random time, t. This is very similar to the reliability function in that it
gives a probability that a system will function at the given time, t. Unlike reliability, the
instantaneous availability measure incorporates maintainability information. At any given
time t, the system will be operational if the following conditions are met:
Unrestricted
Mean Availability
The mean availability is the proportion of time during a mission or time-period that the
system is available for use. It represents the mean value of the instantaneous availability
function over the period (0, T):
Operational Availability
Operational availability is a measure of availability that includes all experienced sources of
downtime, such as administrative downtime, logistic downtime, etc. The equation for
operational availability is:
(1)
where the operating cycle is the overall time period of operation being investigated and
uptime is the total time the system was functioning during the operating cycle.
When there is no logistic downtime or preventive maintenance specified, Eqn. (1) returns
the mean availability of the system. The system's availability measure returned in BlockSim
approaches the operational availability as more sources of downtime are specified, such as
crew logistic downtime, spares logistic downtime, restock logistic downtime, etc. In all other
cases, the availability measure is the mean availability. A separate availability measure, the
point availability, is also returned by BlockSim.
Unrestricted
Note that the operational availability is the availability that the customer actually
experiences. It is essentially the a posteriori availability based on actual events that
happened to the system. The previous availability definitions are a priori estimations based
on models of the system failure and downtime distributions. In many cases, operational
availability cannot be controlled by the manufacturer due to variation in location, resources
and other factors that are the sole province of the end user of the product.
Unrestricted
Reliability, Availability, and Maintainability
Definition: Reliability, Availability, and Maintainability (RAM or RMA) are system design attributes that have
significant impacts on the sustainment or total Life Cycle Costs (LCC) of a developed system. Additionally, the RAM
attributes impact the ability to perform the intended mission and affect overall mission success. The standard
definition of Reliability is the probability of zero failures over a defined time interval (or mission), whereas
Availability is defined as the percentage of time a system is considered ready to use when tasked. Maintainability is
a measure of the ease and rapidity with which a system or equipment can be restored to operational status following
a failure.
MITRE SE Roles & Expectations: MITRE systems engineers (SEs) are expected to understand the
purpose and role of Reliability, Availability, and Maintainability (RAM) in the acquisition
process, where it occurs in systems development, and the benefits of employing it. MITRE
systems engineers are also expected to understand and recommend when RAM is
appropriate to a situation and if the process can be tailored to meet program needs. They
are expected to understand the technical requirements for RAM as well as strategies and
processes that encourage and facilitate active participation of end users and other
stakeholders in the RAM process. They are expected to monitor and evaluate contractor
RAM technical efforts and the acquisition program's overall RAM processes and recommend
changes when warranted.
Background
Reliability is the wellspring for the other RAM system attributes of Availability and
Maintainability. Reliability was first practiced in the early start-up days for the National
Aeronautics and Space Administration (NASA) when Robert Lusser, working with Dr.
Wernher von Braun's rocketry program, developed what is known as "Lusser's Law" [1].
Lusser's Law states that that the reliability of any system is equal to the product of the
reliability of its components, or the so-called weakest link concept.
The term "reliability" is often used as an overarching concept that includes availability and
maintainability. Reliability in its purest form is more concerned with the probability of a
failure occurring over a specified time interval, whereas availability is a measure of
something being in a state (mission capable) ready to be tasked (i.e., available).
Maintainability is the parameter concerned with how the system in use can be restored after
Unrestricted
a failure, while also considering concepts like preventive maintenance and Built-In-Test
(BIT), required maintainer skill level, and support equipment. When dealing with the
availability requirement, the maintainability requirement must also be invoked as some level
of repair and restoration to a mission-capable state must be included. One can see how
logistics and logistic support strategies would also be closely related and be dependent
variables at play in the availability requirement. This would take the form of sparing
strategies, maintainer training, maintenance manuals, and identification of required support
equipment. The linkage of RAM requirements and the dependencies associated with logistics
support illustrates how the RAM requirements have a direct impact on sustainment and
overall LCC. In simple terms, RAM requirements are considered the upper level overarching
requirements that are specified at the overall system level. It is often necessary to
decompose these upper level requirements into lower level design-related quantitative
requirements such as Mean Time Between Failure/Critical Failure (MTBF or MTBCF) and
Mean Time To Repair (MTTR). These lower level requirements are specified at the system
level; however, they can be allocated to subsystems and assemblies. The most common
allocation is made to the Line Replaceable Unit (LRU), which is the unit that has lowest level
of repair at the field (often called organic) level of maintenance.
Much of this discussion has focused on hardware, but the complex systems used today are
integrated solutions consisting of hardware and software. Because software performance
affects the system RAM performance requirements, software must be addressed in the
overall RAM requirements for the system. The wear or accumulated stress mechanisms that
characterize hardware failures do not cause software failures. Instead, software exhibits
behaviors that operators perceive as a failure. It is critical that users, program offices, test
community, and contractors agree early as to what constitutes a software failure. For
example, software "malfunctions" are often recoverable with a reboot, and the time for
reboot may be bounded before a software failure is declared. Another issue to consider is
frequency of occurrence even if the software reboot recovers within the defined time
window as this will give an indication of stability of the software. User perception of what
constitutes a software failure will surely be influenced by both the need to reboot and the
frequency of "glitches" in the operating software.
Unrestricted
metric and visualization that are applicable across products and releases. A novel,
quantitative software readiness criteria model [2] has recently been developed to support
objective and effective decision making at product shipment. The model has been
"socialized" in various forums and is being introduced to MITRE work programs for
consideration and use on contractor software development processes for assessing
maturity. The model offers:
Using this approach with development test data can measure the growth or maturity of a
software system along the following five dimensions:
1. Software Functionality
2. Operational Quality
3. Known Remaining Defects (defect density)
4. Testing Scope and Stability
5. Reliability
As evidenced above, the strongest and most recent government support for increased focus
on reliability comes from the DoD, which now requires most programs to integrate reliability
engineering with the systems engineering process and to institute reliability growth as part
of the design and development phase [4]. The scope of reliability involvement is further
expanded by directing that reliability be addressed during the Analysis of Alternatives (AoA)
Unrestricted
process to map reliability impacts to system LCC outcomes [5]. The strongest policy
directives have come from the Chairman of the Joint Chiefs of Staff (CJCS) where a RAM-
related sustainment Key Performance Parameter (KPP) and supporting Key System
Attributes (KSAs) have been mandated for most DoD programs [6]. Elevation of these RAM
requirements to a KPP and supporting KSAs will bring greater focus and oversight, with
programs not meeting these requirements prone to reassessment and reevaluation and
program modification.
Consistent RAM requirements. The upper level RAM requirements should be consistent with
the lower level RAM input variables, which are typically design related and called out in
technical and performance specifications. A review of user requirements and flow down of
requirements to a contractual specification document released with a Request For Proposal
(RFP) package must be completed. If requirements are inconsistent or unrealistic, the
program is placed at risk for RAM performance before contract award.
Ensure persistent, active engagement of all stakeholders. RAM is not a stand-alone specialty
called on to answer the mail in a crisis, but rather a key participant in the acquisition
process. The RAM discipline should be involved early in the trade studies where
performance, cost, and RAM should be part of any trade-space activity. The RAM SME needs
to be part of requirements development with the user that draws on a defined Concept of
Operations (CONOPS) and what realistic RAM goals can be established for the program. The
RAM SME must be a core member of several Integrated Product Teams (IPTs) during
system design and development to establish insight and a collaborative relationship with the
contractor team(s): RAM IPT, Systems Engineering IPT, and Logistics Support IPT.
Additionally, the RAM specialty should be part of the test and evaluation IPT to address RAM
test strategies (Reliability Growth, Qualification tests, Environmental testing, BIT testing,
and Maintainability Demonstrations) while interfacing with the contractor test teams and the
government operational test community.
Unrestricted
Remember—RAM is a risk reduction activity. RAM activities and engineering processes are a
risk mitigation activity used to ensure that performance needs are achieved for mission
success and that the LCC are bounded and predictable. A system that performs as required
can be employed per the CONOPS, and sustainment costs can be budgeted with a low risk
of cost overruns. Establish reliability Technical Performance Measures (TPMs) that are
reported on during Program Management Reviews (PMRs) throughout the design,
development, and test phases of the program, and use these TPMs to manage risk and
mitigation activities.
Institute the Reliability Program Plan. The Reliability (or RAM) Program Plan (RAMPP) is
used to define the scope of RAM processes and activities to be used during the program. A
program office RAMPP can be developed to help guide the contractor RAM process. The
program-level RAMPP will form the basis for the detailed contractor RAMPP, which ties RAM
activities and deliverables to the Integrated Master Schedule (IMS).
Employ reliability prediction and modeling. Use reliability prediction and modeling to assess
the risk in meeting RAM requirements early in the program when a hardware/software
architecture is formulated. Augment and refine the model later in the acquisition cycle, with
design and test data during those program phases.
Reliability testing. Be creative and use any test phase to gather data on reliability
performance. Ensure that the contractor has planned for a Failure Review Board (FRB) and
uses a robust Failure Reporting And Corrective Action System (FRACAS). When planning a
reliability growth test, realize that the actual calendar time will be 50–100% more than the
actual test time to allow for root cause analysis and corrective action on discovered failure
modes.
Don't forget the maintainability part of RAM. Use maintainability analysis to assess the
design for ease of maintenance, and collaborate with Human Factors Engineering (HFE)
SMEs to assess impacts to maintainers. Engage with the Integrated Logistics Support (ILS)
IPT to help craft the maintenance strategy, and discuss levels of repair and sparing. Look
for opportunities to gather maintainability and testability data during all test phases. Look at
Fault Detection and Fault Isolation (FD/FI) coverage and impact on repair time lines. Also
consider and address software maintenance activity in the field as patches, upgrades, and
new software revisions are deployed. Be aware that the ability to maintain the software
depends on the maintainer's software and IT skill set and on the capability built into the
Unrestricted
maintenance facility for software performance monitoring tools. A complete maintenance
picture includes defining scheduled maintenance tasks (preventive maintenance) and
assessing impacts to system availability.
Unrestricted
Reliability, Availability, and Maintainability
From SEBoK
Reliability, availability, and maintainability (RAM) are three system attributes that are of
tremendous interest to systems engineers, logisticians, and users. Collectively, they affect
economic life-cycle costs of a system and its utility.
This article focuses primarily on the reliability of physical system elements. Software
reliability is a separate discipline. Readers interested in software reliability should refer to
the IEEE Std 1633 (IEEE 2008).
Contents
· 1 Probability Models for Populations
· 2 Data Issues
· 3 Design Issues
· 4 Post-Production Management Systems
· 5 Models
· 6 System Metrics
· 7 System Models
· 8 Software Tools
· 9 References
o 9.1 Works Cited
o 9.2 Primary References
o 9.3 Additional References
· 10 SEBoK Discussion
Let T be a random time to failure. Reliability can be thought of as the complement of the
cumulative distribution function (CDF) for T for a given set of environmental conditions e:
Unrestricted
Maintainability is defined as the probability that a system or system element can be
repaired in a defined environment within a specified period of time. Increased
maintainability implies shorter repair times (ASQ 2011).
Maintainability models present some interesting challenges. The time to repair an item is
the sum of the time required for evacuation, diagnosis, assembly of resources (parts, bays,
tool, and mechanics), repair, inspection, and return. Administrative delay (such as holidays)
can also affect repair times. Often these sub-processes have a minimum time to complete
that is not zero, resulting in the distribution used to model maintainability having a
threshold parameter.
Data Issues
True RAM models for a system are generally never known. Data on a given system is
assumed or collected, used to select a distribution for a model, and then used to fit the
parameters of the distribution. This process differs significantly from the one usually taught
in an introductory statistics course.
First, the normal distribution is seldom used as a life distribution, since it is defined for all
negative times. Second, and more importantly, reliability data is different from classic
Unrestricted
experimental data. Reliability data is often censored, biased, observational, and missing
information about covariates such as environmental conditions. Data from testing is often
expensive, resulting in small sample sizes. These problems with reliability data require
sophisticated strategies and processes to mitigate them.
One consequence of these issues is that estimates based on limited data can be very
imprecise.
Design Issues
System requirements should include specifications for reliability, maintainability, and
availability, and each should be conditioned on the projected operating environments.
If a proposed design does not meet the preliminary RAM specifications, it can be adjusted.
Critical failures are mitigated so that the overall risk is reduced to acceptable levels. This
can be done in several ways:
1. Fault tolerance is a strategy that seeks to make the system robust against the failure of
a component. This can be done by introducing redundancy. Redundant units can
operate in a stand-by mode. A second tolerance strategy is to have the redundant
components share the load, so that even if one or more of them fail the system
continues to operate. There are modeling issues associated with redundancy, including
switching between components, warm-up, and increased failure rates for surviving units
under increased load when another load-sharing unit fails. Redundancy can be an
expensive strategy as there are cost, weight, volume, and power penalties associated
with stand-by components.
2. Fault avoidance seeks to improve individual components so that they are more reliable.
This can also be an expensive strategy, but it avoids the switching issues, power,
weight, and volume penalties associated with using redundant components.
3. A third strategy is to repair or replace a component following a preventive maintenance
schedule. This requires the assumption that the repair returns the component to “good
as new” status, or possibly to an earlier age-equivalent. These assumptions can cause
difficulties; for example, an oil change on a vehicle does not return the engine to ‘good
as new’ status. Scheduled replacement can return a unit to good as new, but at the cost
Unrestricted
of wasting potential life for the replaced unit. As a result, the selection of a replacement
period is a non-linear optimization problem that minimizes total expected life-cycle costs.
These costs are the sum of the expected costs of planned and unplanned maintenance
actions.
4. A fourth strategy is to control the environment so that a system is not operated under
conditions that accelerate the aging of its components.
Any or all of the above strategies (fault tolerance, fault avoidance, preventive maintenance,
and environmental control) may be applied to improve the designed reliability of a system.
One such tracking system is generically known as a FRACAS system (Failure Reporting and
Corrective Action System). Such a system captures data on failures and improvements to
correct failures. This database is separate from a warranty data base, which is typically run
by the financial function of an organization and tracks costs only.
A FRACAS for an organization is a system, and itself should be designed following systems
engineering principles. In particular, a FRACAS system supports later analyses, and those
analyses impose data requirements. Unfortunately, the lack of careful consideration of the
backward flow from decision to analysis to model to required data too often leads to
inadequate data collection systems and missing essential information. Proper prior planning
prevents this poor performance.
Of particular importance is a plan to track data on units that have not failed. Units whose
precise times of failure are unknown are referred to as censored units. Inexperienced
analysts frequently do not know how to analyze censored data, and they omit the censored
units as a result. This can bias an analysis.
An organization should have an integrated data system that allows reliability data to be
considered with logistical data, such as parts, personnel, tools, bays, transportation and
evacuation, queues, and costs, allowing a total awareness of the interplay of logistical and
RAM issues. These issues in turn must be integrated with management and operational
Unrestricted
systems to allow the organization to reap the benefits that can occur from complete
situational awareness with respect to RAM.
Models
There are a wide range of models that estimate and predict reliability (Meeker and Escobar
1998). Simple models, such as exponential distribution, can be useful for ‘back of the
envelope’ calculations.
There are more sophisticated probability models used for life data analysis. These are best
characterized by their failure rate behavior, which is defined as the probability that a unit
fails in the next small interval of time, given it has lived until the beginning of the interval,
and divided by the length of the interval.
Models can be considered for a fixed environmental condition. They can also be extended to
include the effect of environmental conditions on system life. Such extended models can in
turn be used for accelerated life testing (ALT), where a system is deliberately and carefully
overstressed to induce failures more quickly. The data is then extrapolated to usual use
conditions. This is often the only way to obtain estimates of the life of highly reliable
products in a reasonable amount of time (Nelson 1990).
Also useful are degradation models, where some characteristic of the system is
associated with the propensity of the unit to fail (Nelson 1990). As that characteristic
degrades, we can estimate times of failure before they occur.
The initial developmental units of a system often do not meet their RAM specifications.
Reliability growth models allow estimation of resources (particularly testing time)
necessary before a system will mature to meet those goals (Meeker and Escobar 1998).
Maintainability models describe the time necessary to return a failed repairable system to
service. They are usually the sum of a set of models describing different aspects of the
maintenance process (e.g., diagnosis, repair, inspection, reporting, and evacuation). These
models often have threshold parameters, which are minimum times until an event can
occur.
Logistical support models attempt to describe flows through a logistics system and
quantify the interaction between maintenance activities and the resources available to
Unrestricted
support those activities. Queue delays, in particular, are a major source of down time for a
repairable system. A logistical support model allows one to explore the trade space between
resources and availability.
All these models are abstractions of reality, and so at best approximations to reality. To the
extent they provide useful insights, they are still very valuable. The more complicated the
model, the more data necessary to estimate it precisely. The greater the extrapolation
required for a prediction, the greater the imprecision.
Extrapolation is often unavoidable, because high reliability equipment typically can have
long life and the amount of time required to observe failures may exceed test times. This
requires strong assumptions be made about future life (such as the absence of masked
failure modes) and that these assumptions increase uncertainty about predictions. The
uncertainty introduced by strong model assumptions is often not quantified and presents an
unavoidable risk to the system engineer.
System Metrics
Probabilistic metrics describe system performance for RAM. Quantiles, means, and modes of
the distributions used to model RAM are also useful.
Availability has some additional definitions, characterizing what downtime is counted against
a system. For inherent availability, only downtime associated with corrective maintenance
counts against the system. For achieved availability, downtime associated with both
corrective and preventive maintenance counts against a system. Finally, operational
availability counts all sources of downtime, including logistical and administrative, against
a system.
Unrestricted
Criticality is the product of a component’s reliability, the consequences of a component
failure, and the frequency with which a component failure results in a system failure.
Criticality is a guide to prioritizing reliability improvement efforts.
Many of these metrics cannot be calculated directly because the integrals involved are
intractable. They are usually estimated using simulation.
System Models
There are many ways to characterize the reliability of a system, including fault trees,
reliability block diagrams, and failure mode effects analysis.
Unrestricted
Figure 1. Fault Tree. (SEBoK Original)
RBDs are often nested, with one RBD serving as a component in a higher level model. These
hierarchical models allow the analyst to have the appropriate resolution of detail while still
permitting abstraction.
RBDs depict paths that lead to success, while fault trees depict paths that lead to failure.
Unrestricted
Figure 2. Simple Reliability Block Diagram. (SEBoK Original)
A Failure Mode Effects Analysis is a table that lists the possible failure modes for a
system, their likelihood, and the effects of the failure. A Failure Modes Effects Criticality
Analysis scores the effects by the magnitude of the product of the consequence and
likelihood, allowing ranking of the severity of failure modes (Kececioglu 1991).
System models require even more data to fit them well. “Garbage in, garbage out” (GIGO)
particularly applies in the case of system models.
Software Tools
The specialized analyses required for RAM drive the need for specialized software. While
general purpose statistical languages or spreadsheets can, with sufficient effort, be used for
reliability analysis, almost every serious practitioner uses specialized software.
Minitab (versions 13 and later) includes functions for life data analysis. Win Smith is a
specialized package that fits reliability models to life data and can be extended for reliability
growth analysis and other analyses. Relex has an extensive historical database of
component reliability data and is useful for estimating system reliability in the design phase.
There is also a suite of products from ReliaSoft (2007) that is useful in specialized analyses.
Weibull++ fits life models to life data. ALTA fits accelerated life models to accelerated life
test data. BlockSim models system reliability, given component data.
Unrestricted
Difference between Availability and Reliability
People often confuse reliability and availability. Simply put availability is a measure of the % of time
the equipment is in an operable state while reliability is a measure of how long the item performs its
intended function. We can refine these definitions by considering the desired performance standards.
Availability is an Operations parameter as, presumably, if the equipment is available 85% of the time,
we are producing at 85% of the equipment’s technical limit. This usually equates to the financial
performance of the asset. Of course quality and machine speed need to be considered in order to have
a proper representation of how close we are to this technical limit. This is called OEE.
Reliability is a measure of the probability that an item will perform its intended function for a specified
interval under stated conditions. There are two commonly used measures of reliability:
* Mean Time Between Failure (MTBF), which is defined as: total time in service / number of failures
* Failure Rate (λ), which is defined as: number of failures / total time in service.
A piece of equipment can be available but not reliable. For example the machine is down 6 minutes
every hour. This translates into an availability of 90% but a reliability of less than 1 hour. That may be
okay in some circumstances but what if this is a paper machine? It will take at least 30 minutes of run
time to get to the point that we are producing good paper.
Generally speaking a reliable machine has high availability but an available machine may or may not
be very reliable
Unrestricted