Relationship Between Availability and Reliability

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Relationship Between Availability and Reliability

Availability is defined as the probability that the system is operating properly when it is
requested for use. In other words, availability is the probability that a system is not failed or
undergoing a repair action when it needs to be used. At first glance, it might seem that if a
system has a high availability then it should also have a high reliability. However, this is not
necessarily the case. This article will explore the relationship between availability and
reliability and will also present some of the specified classifications of availability.

Availability and Reliability


Reliability represents the probability of components, parts and systems to perform their
required functions for a desired period of time without failure in specified environments with
a desired confidence. Reliability, in itself, does not account for any repair actions that may
take place. Reliability accounts for the time that it will take the component, part or system
to fail while it is operating. It does not reflect how long it will take to get the unit under
repair back into working condition.

As stated earlier, availability represents the probability that the system is capable of
conducting its required function when it is called upon given that it is not failed or
undergoing a repair action. Therefore, not only is availability a function of reliability, but it is
also a function of maintainability. Table 1 below displays the relationship between reliability,
maintainability and availability. Please note that in this table, an increase in maintainability
implies a decrease in the time it takes to perform maintenance actions.

Table 1: Relationship between reliability, maintainability and availability.

As you can see from the table, if the reliability is held constant, even at a high value, this
does not directly imply a high availability. As the time to repair increases, the availability
decreases. Even a system with a low reliability could have a high availability if the time to
repair is short.

Unrestricted
Availability Classifications
The definition of availability is somewhat flexible, depending on what types of downtimes
are considered in the analysis. As a result, there are a number of different classifications of
availability. In BlockSim, the following availabilities can be obtained directly from simulation
or can be indirectly calculated with values returned from analysis:

· Point (instantaneous) availability


· Average up-time availability (mean availability)
· Steady state availability
· Operational availability

Point Availability
Point, or instantaneous, availability is the probability that a system (or component) will be
operational at any random time, t. This is very similar to the reliability function in that it
gives a probability that a system will function at the given time, t. Unlike reliability, the
instantaneous availability measure incorporates maintainability information. At any given
time t, the system will be operational if the following conditions are met:

1. It functioned properly during time t with probability R(t), or,


2. It functioned properly since the last repair at time u, 0 < u < t, with probability:

With m (u) being the renewal density function of the system.

The point availability is the summation of these two probabilities, or:

Unrestricted
Mean Availability
The mean availability is the proportion of time during a mission or time-period that the
system is available for use. It represents the mean value of the instantaneous availability
function over the period (0, T):

Steady State Availability


The steady state availability of the system is the limit of the instantaneous availability
function as time approaches infinity. The instantaneous availability function approaches the
steady state value very closely at time approximate to four times the MTBF:

Operational Availability
Operational availability is a measure of availability that includes all experienced sources of
downtime, such as administrative downtime, logistic downtime, etc. The equation for
operational availability is:

(1)

where the operating cycle is the overall time period of operation being investigated and
uptime is the total time the system was functioning during the operating cycle.

When there is no logistic downtime or preventive maintenance specified, Eqn. (1) returns
the mean availability of the system. The system's availability measure returned in BlockSim
approaches the operational availability as more sources of downtime are specified, such as
crew logistic downtime, spares logistic downtime, restock logistic downtime, etc. In all other
cases, the availability measure is the mean availability. A separate availability measure, the
point availability, is also returned by BlockSim.

Unrestricted
Note that the operational availability is the availability that the customer actually
experiences. It is essentially the a posteriori availability based on actual events that
happened to the system. The previous availability definitions are a priori estimations based
on models of the system failure and downtime distributions. In many cases, operational
availability cannot be controlled by the manufacturer due to variation in location, resources
and other factors that are the sole province of the end user of the product.

Unrestricted
Reliability, Availability, and Maintainability
Definition: Reliability, Availability, and Maintainability (RAM or RMA) are system design attributes that have
significant impacts on the sustainment or total Life Cycle Costs (LCC) of a developed system. Additionally, the RAM
attributes impact the ability to perform the intended mission and affect overall mission success. The standard
definition of Reliability is the probability of zero failures over a defined time interval (or mission), whereas
Availability is defined as the percentage of time a system is considered ready to use when tasked. Maintainability is
a measure of the ease and rapidity with which a system or equipment can be restored to operational status following
a failure.

Keywords: availability, maintainability, RAM, reliability, RMA

MITRE SE Roles & Expectations: MITRE systems engineers (SEs) are expected to understand the
purpose and role of Reliability, Availability, and Maintainability (RAM) in the acquisition
process, where it occurs in systems development, and the benefits of employing it. MITRE
systems engineers are also expected to understand and recommend when RAM is
appropriate to a situation and if the process can be tailored to meet program needs. They
are expected to understand the technical requirements for RAM as well as strategies and
processes that encourage and facilitate active participation of end users and other
stakeholders in the RAM process. They are expected to monitor and evaluate contractor
RAM technical efforts and the acquisition program's overall RAM processes and recommend
changes when warranted.

Background
Reliability is the wellspring for the other RAM system attributes of Availability and
Maintainability. Reliability was first practiced in the early start-up days for the National
Aeronautics and Space Administration (NASA) when Robert Lusser, working with Dr.
Wernher von Braun's rocketry program, developed what is known as "Lusser's Law" [1].
Lusser's Law states that that the reliability of any system is equal to the product of the
reliability of its components, or the so-called weakest link concept.

The term "reliability" is often used as an overarching concept that includes availability and
maintainability. Reliability in its purest form is more concerned with the probability of a
failure occurring over a specified time interval, whereas availability is a measure of
something being in a state (mission capable) ready to be tasked (i.e., available).
Maintainability is the parameter concerned with how the system in use can be restored after

Unrestricted
a failure, while also considering concepts like preventive maintenance and Built-In-Test
(BIT), required maintainer skill level, and support equipment. When dealing with the
availability requirement, the maintainability requirement must also be invoked as some level
of repair and restoration to a mission-capable state must be included. One can see how
logistics and logistic support strategies would also be closely related and be dependent
variables at play in the availability requirement. This would take the form of sparing
strategies, maintainer training, maintenance manuals, and identification of required support
equipment. The linkage of RAM requirements and the dependencies associated with logistics
support illustrates how the RAM requirements have a direct impact on sustainment and
overall LCC. In simple terms, RAM requirements are considered the upper level overarching
requirements that are specified at the overall system level. It is often necessary to
decompose these upper level requirements into lower level design-related quantitative
requirements such as Mean Time Between Failure/Critical Failure (MTBF or MTBCF) and
Mean Time To Repair (MTTR). These lower level requirements are specified at the system
level; however, they can be allocated to subsystems and assemblies. The most common
allocation is made to the Line Replaceable Unit (LRU), which is the unit that has lowest level
of repair at the field (often called organic) level of maintenance.

Much of this discussion has focused on hardware, but the complex systems used today are
integrated solutions consisting of hardware and software. Because software performance
affects the system RAM performance requirements, software must be addressed in the
overall RAM requirements for the system. The wear or accumulated stress mechanisms that
characterize hardware failures do not cause software failures. Instead, software exhibits
behaviors that operators perceive as a failure. It is critical that users, program offices, test
community, and contractors agree early as to what constitutes a software failure. For
example, software "malfunctions" are often recoverable with a reboot, and the time for
reboot may be bounded before a software failure is declared. Another issue to consider is
frequency of occurrence even if the software reboot recovers within the defined time
window as this will give an indication of stability of the software. User perception of what
constitutes a software failure will surely be influenced by both the need to reboot and the
frequency of "glitches" in the operating software.

One approach to assessing software "fitness" is to use a comprehensive model to determine


the current readiness of the software (at shipment) to meet customer requirements. Such a
model needs to address quantitative parameters (not just process elements). In addition,
the method should organize and streamline existing quality and reliability data into a simple

Unrestricted
metric and visualization that are applicable across products and releases. A novel,
quantitative software readiness criteria model [2] has recently been developed to support
objective and effective decision making at product shipment. The model has been
"socialized" in various forums and is being introduced to MITRE work programs for
consideration and use on contractor software development processes for assessing
maturity. The model offers:

· An easy-to-understand composite index


· The ability to set quantitative "pass" criteria from product requirements
· Easy calculation from existing data
· A meaningful, insightful visualization
· Release-to-release comparisons
· Product-to-product comparisons
· A complete solution, incorporating almost all aspects of software development activities

Using this approach with development test data can measure the growth or maturity of a
software system along the following five dimensions:

1. Software Functionality
2. Operational Quality
3. Known Remaining Defects (defect density)
4. Testing Scope and Stability
5. Reliability

For greater detail, see ref. [2].

Government Interest and Use


Many U.S. government acquisition programs have recently put greater emphasis on
reliability. The Defense Science Board (DSB) performed a study on Developmental Test and
Evaluation (DT&E) in May 2008 and published findings [3] that linked test suitability failures
to a lack of a disciplined systems engineering approach that included reliability engineering.
The Department of Defense (DoD) has been the initial proponent of systematic policy
changes to address these findings, but similar emphasis has been seen in the Department of
Homeland Security (DHS) as many government agencies leverage DoD policies and
processes in the execution of their acquisition programs.

As evidenced above, the strongest and most recent government support for increased focus
on reliability comes from the DoD, which now requires most programs to integrate reliability
engineering with the systems engineering process and to institute reliability growth as part
of the design and development phase [4]. The scope of reliability involvement is further
expanded by directing that reliability be addressed during the Analysis of Alternatives (AoA)

Unrestricted
process to map reliability impacts to system LCC outcomes [5]. The strongest policy
directives have come from the Chairman of the Joint Chiefs of Staff (CJCS) where a RAM-
related sustainment Key Performance Parameter (KPP) and supporting Key System
Attributes (KSAs) have been mandated for most DoD programs [6]. Elevation of these RAM
requirements to a KPP and supporting KSAs will bring greater focus and oversight, with
programs not meeting these requirements prone to reassessment and reevaluation and
program modification.

Best Practices and Lessons Learned [7] [8]


Subject Matter Expertise matters. Acquisition program offices that employ RAM Subject
Matter Experts (SMEs) tend to produce more consistent RAM requirements and better
oversight of contractor RAM processes and activities. The MITRE systems engineer has the
opportunity to "reach back" to bring MITRE to bear by strategically engaging MITRE-based
RAM SMEs early in programs.

Consistent RAM requirements. The upper level RAM requirements should be consistent with
the lower level RAM input variables, which are typically design related and called out in
technical and performance specifications. A review of user requirements and flow down of
requirements to a contractual specification document released with a Request For Proposal
(RFP) package must be completed. If requirements are inconsistent or unrealistic, the
program is placed at risk for RAM performance before contract award.

Ensure persistent, active engagement of all stakeholders. RAM is not a stand-alone specialty
called on to answer the mail in a crisis, but rather a key participant in the acquisition
process. The RAM discipline should be involved early in the trade studies where
performance, cost, and RAM should be part of any trade-space activity. The RAM SME needs
to be part of requirements development with the user that draws on a defined Concept of
Operations (CONOPS) and what realistic RAM goals can be established for the program. The
RAM SME must be a core member of several Integrated Product Teams (IPTs) during
system design and development to establish insight and a collaborative relationship with the
contractor team(s): RAM IPT, Systems Engineering IPT, and Logistics Support IPT.
Additionally, the RAM specialty should be part of the test and evaluation IPT to address RAM
test strategies (Reliability Growth, Qualification tests, Environmental testing, BIT testing,
and Maintainability Demonstrations) while interfacing with the contractor test teams and the
government operational test community.

Unrestricted
Remember—RAM is a risk reduction activity. RAM activities and engineering processes are a
risk mitigation activity used to ensure that performance needs are achieved for mission
success and that the LCC are bounded and predictable. A system that performs as required
can be employed per the CONOPS, and sustainment costs can be budgeted with a low risk
of cost overruns. Establish reliability Technical Performance Measures (TPMs) that are
reported on during Program Management Reviews (PMRs) throughout the design,
development, and test phases of the program, and use these TPMs to manage risk and
mitigation activities.

Institute the Reliability Program Plan. The Reliability (or RAM) Program Plan (RAMPP) is
used to define the scope of RAM processes and activities to be used during the program. A
program office RAMPP can be developed to help guide the contractor RAM process. The
program-level RAMPP will form the basis for the detailed contractor RAMPP, which ties RAM
activities and deliverables to the Integrated Master Schedule (IMS).

Employ reliability prediction and modeling. Use reliability prediction and modeling to assess
the risk in meeting RAM requirements early in the program when a hardware/software
architecture is formulated. Augment and refine the model later in the acquisition cycle, with
design and test data during those program phases.

Reliability testing. Be creative and use any test phase to gather data on reliability
performance. Ensure that the contractor has planned for a Failure Review Board (FRB) and
uses a robust Failure Reporting And Corrective Action System (FRACAS). When planning a
reliability growth test, realize that the actual calendar time will be 50–100% more than the
actual test time to allow for root cause analysis and corrective action on discovered failure
modes.

Don't forget the maintainability part of RAM. Use maintainability analysis to assess the
design for ease of maintenance, and collaborate with Human Factors Engineering (HFE)
SMEs to assess impacts to maintainers. Engage with the Integrated Logistics Support (ILS)
IPT to help craft the maintenance strategy, and discuss levels of repair and sparing. Look
for opportunities to gather maintainability and testability data during all test phases. Look at
Fault Detection and Fault Isolation (FD/FI) coverage and impact on repair time lines. Also
consider and address software maintenance activity in the field as patches, upgrades, and
new software revisions are deployed. Be aware that the ability to maintain the software
depends on the maintainer's software and IT skill set and on the capability built into the

Unrestricted
maintenance facility for software performance monitoring tools. A complete maintenance
picture includes defining scheduled maintenance tasks (preventive maintenance) and
assessing impacts to system availability.

Understand reliability implications when using COTS. Understand the operational


environment and the COTS hardware design envelopes and impact on reliability
performance. Use Failure Modes Effects Analysis (FMEA) techniques to assess integration
risk and characterize system behavior during failure events.

References & Resources


1. Military Handbook 338, Electronic Reliability Design Handbook, October 1998.
2. Quantifying Software Reliability and Readiness, Abhaya Asthana, Jack Olivieri, IEEE Communications Quality Reliability (CQR)
Proceedings, 2009.
3. Report of the Defense Science Board Task Force on Developmental Test and
Evaluation, May 2008.
4. Department of Defense Instruction, Number 5000.02, Operation of the Defense Acquisition System, December 2008.
5. Department of Defense Reliability, Availability, Maintainability, and Cost Rationale
Report Manual, June 2009.
6. Manual for the Operation of the Joint Capabilities Integration and Development System,
January 2011.
7. DoD Guide For Achieving Reliability, Availability, and Maintainability, August 2005.
8. Reliability Information Analysis Center, Reliability Toolkit: Commercial Practices Edition .

Unrestricted
Reliability, Availability, and Maintainability
From SEBoK

Reliability, Availability, and Maintainability

Reliability, availability, and maintainability (RAM) are three system attributes that are of
tremendous interest to systems engineers, logisticians, and users. Collectively, they affect
economic life-cycle costs of a system and its utility.

This article focuses primarily on the reliability of physical system elements. Software
reliability is a separate discipline. Readers interested in software reliability should refer to
the IEEE Std 1633 (IEEE 2008).

Contents
· 1 Probability Models for Populations
· 2 Data Issues
· 3 Design Issues
· 4 Post-Production Management Systems
· 5 Models
· 6 System Metrics
· 7 System Models
· 8 Software Tools
· 9 References
o 9.1 Works Cited
o 9.2 Primary References
o 9.3 Additional References
· 10 SEBoK Discussion

Probability Models for Populations


Reliability is defined as the probability of a system or system element performing its
intended function under stated conditions without failure for a given period of time (ASQ
2011). A precise definition must include a detailed description of the function, the
environment, the time scale, and what constitutes a failure. Each can be surprisingly
difficult to define as precisely as one might wish. Different failure mechanisms are referred
to as failure modes and can be modeled separately or aggregated into a single failure
model.

Let T be a random time to failure. Reliability can be thought of as the complement of the
cumulative distribution function (CDF) for T for a given set of environmental conditions e:

Unrestricted
Maintainability is defined as the probability that a system or system element can be
repaired in a defined environment within a specified period of time. Increased
maintainability implies shorter repair times (ASQ 2011).

Availability is the probability that a repairable system or system element is operational at a


given point in time under a given set of environmental conditions. Availability depends on
reliability and maintainability and is discussed in detail later in this topic (ASQ 2011).

Each of these probability models is usually specified by a continuous, non-negative


distribution. Typical distributions used in practice include exponential (possibly with a
threshold parameter), Weibull (possibly with a threshold parameter), log-normal, and
generalized gamma.

Maintainability models present some interesting challenges. The time to repair an item is
the sum of the time required for evacuation, diagnosis, assembly of resources (parts, bays,
tool, and mechanics), repair, inspection, and return. Administrative delay (such as holidays)
can also affect repair times. Often these sub-processes have a minimum time to complete
that is not zero, resulting in the distribution used to model maintainability having a
threshold parameter.

A threshold parameter is defined as the minimum probable time to repair. Estimation of


maintainability can be further complicated by queuing effects, resulting in times to repair
that are not independent. This dependency frequently makes analytical solution of problems
involving maintainability intractable and promotes the use of simulation to support analysis.

Data Issues
True RAM models for a system are generally never known. Data on a given system is
assumed or collected, used to select a distribution for a model, and then used to fit the
parameters of the distribution. This process differs significantly from the one usually taught
in an introductory statistics course.

First, the normal distribution is seldom used as a life distribution, since it is defined for all
negative times. Second, and more importantly, reliability data is different from classic

Unrestricted
experimental data. Reliability data is often censored, biased, observational, and missing
information about covariates such as environmental conditions. Data from testing is often
expensive, resulting in small sample sizes. These problems with reliability data require
sophisticated strategies and processes to mitigate them.

One consequence of these issues is that estimates based on limited data can be very
imprecise.

Design Issues
System requirements should include specifications for reliability, maintainability, and
availability, and each should be conditioned on the projected operating environments.

A proposed design should be analyzed prior to development to estimate whether or not it


will meet those specifications. This is usually done by assuming historical data on actual or
similar components represents the future performance of the components for the proposed
system. If no data is available, conservative engineering judgment is often applied. The
system dependency on the reliability of its components can be captured in several ways,
including reliability block diagrams, fault trees, and failure mode effects and criticality
analyses (FMECA) (Kececioglu 1991).

If a proposed design does not meet the preliminary RAM specifications, it can be adjusted.
Critical failures are mitigated so that the overall risk is reduced to acceptable levels. This
can be done in several ways:

1. Fault tolerance is a strategy that seeks to make the system robust against the failure of
a component. This can be done by introducing redundancy. Redundant units can
operate in a stand-by mode. A second tolerance strategy is to have the redundant
components share the load, so that even if one or more of them fail the system
continues to operate. There are modeling issues associated with redundancy, including
switching between components, warm-up, and increased failure rates for surviving units
under increased load when another load-sharing unit fails. Redundancy can be an
expensive strategy as there are cost, weight, volume, and power penalties associated
with stand-by components.
2. Fault avoidance seeks to improve individual components so that they are more reliable.
This can also be an expensive strategy, but it avoids the switching issues, power,
weight, and volume penalties associated with using redundant components.
3. A third strategy is to repair or replace a component following a preventive maintenance
schedule. This requires the assumption that the repair returns the component to “good
as new” status, or possibly to an earlier age-equivalent. These assumptions can cause
difficulties; for example, an oil change on a vehicle does not return the engine to ‘good
as new’ status. Scheduled replacement can return a unit to good as new, but at the cost

Unrestricted
of wasting potential life for the replaced unit. As a result, the selection of a replacement
period is a non-linear optimization problem that minimizes total expected life-cycle costs.
These costs are the sum of the expected costs of planned and unplanned maintenance
actions.
4. A fourth strategy is to control the environment so that a system is not operated under
conditions that accelerate the aging of its components.

Any or all of the above strategies (fault tolerance, fault avoidance, preventive maintenance,
and environmental control) may be applied to improve the designed reliability of a system.

Post-Production Management Systems


Once a system is fielded, its reliability and availability should be tracked. Doing so allows
the producer / owner to verify that the design has met its RAM objectives, to identify
unexpected failure modes, to record fixes, to assess the utilization of maintenance
resources, and to assess the operating environment.

One such tracking system is generically known as a FRACAS system (Failure Reporting and
Corrective Action System). Such a system captures data on failures and improvements to
correct failures. This database is separate from a warranty data base, which is typically run
by the financial function of an organization and tracks costs only.

A FRACAS for an organization is a system, and itself should be designed following systems
engineering principles. In particular, a FRACAS system supports later analyses, and those
analyses impose data requirements. Unfortunately, the lack of careful consideration of the
backward flow from decision to analysis to model to required data too often leads to
inadequate data collection systems and missing essential information. Proper prior planning
prevents this poor performance.

Of particular importance is a plan to track data on units that have not failed. Units whose
precise times of failure are unknown are referred to as censored units. Inexperienced
analysts frequently do not know how to analyze censored data, and they omit the censored
units as a result. This can bias an analysis.

An organization should have an integrated data system that allows reliability data to be
considered with logistical data, such as parts, personnel, tools, bays, transportation and
evacuation, queues, and costs, allowing a total awareness of the interplay of logistical and
RAM issues. These issues in turn must be integrated with management and operational

Unrestricted
systems to allow the organization to reap the benefits that can occur from complete
situational awareness with respect to RAM.

Models
There are a wide range of models that estimate and predict reliability (Meeker and Escobar
1998). Simple models, such as exponential distribution, can be useful for ‘back of the
envelope’ calculations.

There are more sophisticated probability models used for life data analysis. These are best
characterized by their failure rate behavior, which is defined as the probability that a unit
fails in the next small interval of time, given it has lived until the beginning of the interval,
and divided by the length of the interval.

Models can be considered for a fixed environmental condition. They can also be extended to
include the effect of environmental conditions on system life. Such extended models can in
turn be used for accelerated life testing (ALT), where a system is deliberately and carefully
overstressed to induce failures more quickly. The data is then extrapolated to usual use
conditions. This is often the only way to obtain estimates of the life of highly reliable
products in a reasonable amount of time (Nelson 1990).

Also useful are degradation models, where some characteristic of the system is
associated with the propensity of the unit to fail (Nelson 1990). As that characteristic
degrades, we can estimate times of failure before they occur.

The initial developmental units of a system often do not meet their RAM specifications.
Reliability growth models allow estimation of resources (particularly testing time)
necessary before a system will mature to meet those goals (Meeker and Escobar 1998).

Maintainability models describe the time necessary to return a failed repairable system to
service. They are usually the sum of a set of models describing different aspects of the
maintenance process (e.g., diagnosis, repair, inspection, reporting, and evacuation). These
models often have threshold parameters, which are minimum times until an event can
occur.

Logistical support models attempt to describe flows through a logistics system and
quantify the interaction between maintenance activities and the resources available to

Unrestricted
support those activities. Queue delays, in particular, are a major source of down time for a
repairable system. A logistical support model allows one to explore the trade space between
resources and availability.

All these models are abstractions of reality, and so at best approximations to reality. To the
extent they provide useful insights, they are still very valuable. The more complicated the
model, the more data necessary to estimate it precisely. The greater the extrapolation
required for a prediction, the greater the imprecision.

Extrapolation is often unavoidable, because high reliability equipment typically can have
long life and the amount of time required to observe failures may exceed test times. This
requires strong assumptions be made about future life (such as the absence of masked
failure modes) and that these assumptions increase uncertainty about predictions. The
uncertainty introduced by strong model assumptions is often not quantified and presents an
unavoidable risk to the system engineer.

System Metrics
Probabilistic metrics describe system performance for RAM. Quantiles, means, and modes of
the distributions used to model RAM are also useful.

Availability has some additional definitions, characterizing what downtime is counted against
a system. For inherent availability, only downtime associated with corrective maintenance
counts against the system. For achieved availability, downtime associated with both
corrective and preventive maintenance counts against a system. Finally, operational
availability counts all sources of downtime, including logistical and administrative, against
a system.

Availability can also be calculated instantaneously, averaged over an interval, or reported as


an asymptotic value. Asymptotic availability can be calculated easily, but care must be
taken to analyze whether or not a systems settles down or settles up to the asymptotic
value, as well as how long it takes until the system approaches that asymptotic value.

Reliability importance measures the effect on the system reliability of a small


improvement in a component’s reliability. It is defined as the partial derivative of the
system reliability with respect to the reliability of a component.

Unrestricted
Criticality is the product of a component’s reliability, the consequences of a component
failure, and the frequency with which a component failure results in a system failure.
Criticality is a guide to prioritizing reliability improvement efforts.

Many of these metrics cannot be calculated directly because the integrals involved are
intractable. They are usually estimated using simulation.

System Models
There are many ways to characterize the reliability of a system, including fault trees,
reliability block diagrams, and failure mode effects analysis.

A Fault Tree (Kececioglu 1991) is a graphical representation of the failure modes of a


system. It is constructed using logical gates, with AND, OR, NOT, and K of N gates
predominating. Fault trees can be complete or partial; a partial fault tree focuses on a
failure mode or modes of interest. They allow 'drill down' to see the dependencies of
systems on nested systems and system elements. Fault trees were pioneered by Bell Labs
in the 1960s.

Unrestricted
Figure 1. Fault Tree. (SEBoK Original)

A Reliability Block Diagram (RBD) is a graphical representation of the reliability


dependence of a system on its components. It is a directed, acyclic graph. Each path
through the graph represents a subset of system components. As long as the components in
that path are operational, the system is operational. Component lives are usually assumed
to be independent in a RBD. Simple topologies include a series system, a parallel system, a
k of n system, and combinations of these.

RBDs are often nested, with one RBD serving as a component in a higher level model. These
hierarchical models allow the analyst to have the appropriate resolution of detail while still
permitting abstraction.

RBDs depict paths that lead to success, while fault trees depict paths that lead to failure.

Unrestricted
Figure 2. Simple Reliability Block Diagram. (SEBoK Original)

A Failure Mode Effects Analysis is a table that lists the possible failure modes for a
system, their likelihood, and the effects of the failure. A Failure Modes Effects Criticality
Analysis scores the effects by the magnitude of the product of the consequence and
likelihood, allowing ranking of the severity of failure modes (Kececioglu 1991).

System models require even more data to fit them well. “Garbage in, garbage out” (GIGO)
particularly applies in the case of system models.

Software Tools
The specialized analyses required for RAM drive the need for specialized software. While
general purpose statistical languages or spreadsheets can, with sufficient effort, be used for
reliability analysis, almost every serious practitioner uses specialized software.

Minitab (versions 13 and later) includes functions for life data analysis. Win Smith is a
specialized package that fits reliability models to life data and can be extended for reliability
growth analysis and other analyses. Relex has an extensive historical database of
component reliability data and is useful for estimating system reliability in the design phase.

There is also a suite of products from ReliaSoft (2007) that is useful in specialized analyses.
Weibull++ fits life models to life data. ALTA fits accelerated life models to accelerated life
test data. BlockSim models system reliability, given component data.

Unrestricted
Difference between Availability and Reliability

People often confuse reliability and availability. Simply put availability is a measure of the % of time
the equipment is in an operable state while reliability is a measure of how long the item performs its
intended function. We can refine these definitions by considering the desired performance standards.

Availability is an Operations parameter as, presumably, if the equipment is available 85% of the time,
we are producing at 85% of the equipment’s technical limit. This usually equates to the financial
performance of the asset. Of course quality and machine speed need to be considered in order to have
a proper representation of how close we are to this technical limit. This is called OEE.

Availability can be measured as: Uptime / Total time (Uptime + Downtime)

Reliability is a measure of the probability that an item will perform its intended function for a specified
interval under stated conditions. There are two commonly used measures of reliability:

* Mean Time Between Failure (MTBF), which is defined as: total time in service / number of failures
* Failure Rate (λ), which is defined as: number of failures / total time in service.

A piece of equipment can be available but not reliable. For example the machine is down 6 minutes
every hour. This translates into an availability of 90% but a reliability of less than 1 hour. That may be
okay in some circumstances but what if this is a paper machine? It will take at least 30 minutes of run
time to get to the point that we are producing good paper.

Generally speaking a reliable machine has high availability but an available machine may or may not
be very reliable

Unrestricted

You might also like