Reliability Tools

Reliability Tools:
Reliability tools exist by the dozens: what are the tools, why use the tools, when should
I use the tools, and where should I use the tools? Click on the tools below for answers.
Reliability Tools
Accelerated Reliability
Design Review HALT Monte Carlo
Testing Engineering
Normal Reliability
Availability Effectiveness HASS
Distribution Growth
Bathtub Electronic Reliability
Life Cycle Cost OEE
Curves Components Policies
Block
Pareto Reliability
Diagram ESS Life Units
Distribution Testing
Models
Poisson Simultaneous
Capability Events/Incidents Load-Strength
Distribution Testing
Configuration Probability Software
Exponential Lognormal
Control Plots Reliability
Contract For Process Sudden Death
Failure Maintainability
Reliability Reliability Testing
Cost Of
Failure Forecast Maintenance QFD TPM
Unreliability
Critical Items Maintenance Weibayes
Failure Rates Reliability
List Engineering Estimates
Fault Tree Management’s Reliability Weibull
Data
Analysis Role Audits Analysis
Weibull
Decision Trees FMEA Mean Time RBDs Corrective
Action
Mechanical Reliability-
FRACAS Weibull
Dependability Component Centered
Systems Database
Interactions Maintenance
The details about these tools will be brief as books are written about each item. Think of
the presentations below as hors d’oeuvres (a little snack food or starters)—not the main
course.
The most important reliability tool is a Pareto distribution based on

money—specifically based on the cost of unreliability which directs
attention to work on the most important money problem first. No magic
bullet exists for reliability issues, don’t waste your time looking for a single
magic tool—none exist!
Accelerated Testing-
What: A test method of increasing loads to quickly produce age-to-failure data with
only a few data points are then scaled to reflect normal loads.
Why: The benefit of accelerated testing is to save time and money while quantifying
the relationships between stress and performance along with identifying
design and manufacturing deficiencies to get useful data quickly and at low
cost.
When: Usually performed during the development of devices, components, or
systems. Also applies to items that have been in service to obtain a metric
needed to show how the item is performing under heavy loads. Accelerate
testing is a useful method for solving old, nagging, problems within a
production process.
Where: Used for correlating test results with real life conditions.
Return to top
Availability-
What: A tool for measuring the percent of time an item or system is in a state of
readiness where it is operable and can be committed to use when called upon.
Availability ceases because of a downing event that causes the item/system to
become unavailable to initiate a mission when called upon. In the simplest
view the metric is availability = uptime/(uptime + downtime). For many other
definitions see MIL-HDBK-338, section 5.
Why: The measure is important for knowing the commitment of time for performing
the mission and it usually only involves the use of arithmetic.
When: Often the measurement tool is based on past experiences and the complement
of the measurement tool addresses unavailability to perform the task.
Where: In design of a system it is a calculated value and in operation of a system it is
a performance index that is often easy to use and provides an index that is
understandable to the average person. Today there is a great tendency to
“Enronize” availability metrics by using uptime metrics that present data in
the best light (an issue of data integrity) to maximize managerial bonuses by
excusing (deducting) downtime from the calculations to put “lipstick on the
pig”. Use the KISS principle. Think of availability in terms of the investor’s
typical year of 8760 hours. The no-excuse annual metric in hours is
availability = uptime/8760. Suddenly you’ll find a metric of great interest to
investors that can be benchmarked as a financial issue, and thus motivate the
management team to solve real issues of importance to the business. Please
note, you can have high availability but many failures and thus low reliability
as availability ≠ reliability. Likewise, you can have high availability, but little
output so team the metric with effectiveness to get the complete story.
Return to top
Bathtub Curves-
What: The concept is derived from the human life experience involving infant
mortality, chance failures, plus a wearout period of life since data for births
and deaths is accumulated by government agencies. Most equipment lacks the
birth/death recording by government agencies and most non-human systems
can be regenerated to live/die many times before relegation to the scrap heap.
Why: Failure rates are different for both people and equipment at different phases of
operation and the medicine to be applied to both humans and equipment need
to be considered for effectively treating the roots of the problem.
When: The concept is useful during design, operation, and maintenance of equipment
and systems to understand the failure mechanisms
Where: It explains the human experiences to the ordinary person to relate
equipment/system failures to those experienced in real life so as to coordinate
the design, operation and maintenance of equipment. For other definitions see
MIL-HDBK-338, section 9.
Return to top
Block Diagram Model (same as Reliability Block Diagram Models)-

What: Reliability block diagram (RBD) models are graphical representations of a
calculation methodology for reliability systems.
Why: The RBD models allow calculation of system reliability based on
knowing/assuming failure details of the components, starting with the least
component and growing the model to the greatest system to predict
performance from the elements.
When: RBDs are used in upfront designs as a performance parameter and after the
system is constructed to ferret out poor performing blocks that limit the
system performance.
Where: Frequently used as a trade-off tool to search for the lowest long cost of
ownership and to help sell alternative courses of action for moderating the
effects of reliability issues or overcoming the poor performance by alternative
designs where the results can be calculated before building the system as the
results of the calculations provide knowledge about availability, maintenance
interventions required for failures, and the number of spare parts required to
sustain operations. For other definitions see MIL-HDBK-338, sections 4 and
6.
Return to top
Capability-
What: A measure of how well the product performance meets objectives. In short,
how well are the outputs actually accomplished against a standard?
Capability is frequently the product of efficiency * utilization.
Why: Capability is a component of the effectiveness equation and usually under the
control of production.
When: Data for this metric is frequently produced by the accounting department each
month as a segment of the financial reports for the purpose of handling
variances against the standards.
Where: Frequently in the effectiveness measure it is a weak point (as a measure of
how well the production process des the job for which it was purchased)
requiring substantial improvement that cannot be solved by the usual
reliability and maintainability (RAM) tools. However, this metric may be
deficient from the original design (an issue of design effectiveness) of the
system or from the way the system is operated (an issue of use effectiveness).
Return to top
Configuration Control-
What: Configuration control is involved with the management of change by
providing traceability of failures back into the design standard. If the design
details are not specified, the design will not contain the requirements and thus
implementation of the project will be hit or miss for achieving the desired end
results, beginning with the conceptual design and resulting in the operating
facility.
Why: With active configuration control you know where items are used and
contained, where and why they were installed, where signal originate, what
items are used where and in what environments, what drawing revisions have
occurred and you know if the product conforms to the drawings and
specifications, what alternate materials/components have been used, and what
test reports/certifications are available as original documents for review.
When: Configuration control begins after the first design review to build an unbroken
chain of traceability to aid in avoiding surprises in the field which would
destroy the designed-in criteria for availability, reliability, maintainability, and
cost effectiveness established as a portion of the original design criteria.
Where: Frequently these documentation details are assembled into a dossier with third
party witnessing for use in validating conformance to the design requirements
and provided to the owner of the equipment as witness documents.
Return to top
Contracting For Reliability-

What: Tell your vendors what you want, and want what you say. Provide
explanations of the objectives in written contracts in terms the vendors will
understand.
Why: If you can’t clearly spell out the requirements for availability, reliability, and
maintainability the contractors cannot make these issues features of the
design. Thus, it is important to be specific in the features the design must
manifest. Explanations such as: “You know what I want and what I need, just
do it quickly” are self-defeating expressions of vague generalities that lead to
inferior designs and constant arguments. Be specific about requirements for
building reliability block diagrams, using quality function deployment,
performing failure mode and effects analysis, conducting fault tree analysis,
and finally, conducting design reviews for reliability.
When: Write the specifications before procurement begins. Plan to spend time with
your own purchasing department to explain the details and sell the team on the
financial advantages for including reliability requirements into the
specifications. Likewise, spend time selling your vendors on the requirements
and why they are stated.
Where: These are up front decisions to avoid replication of previous problems that
were built into previous designs and never corrected.
Return to top
Cost Of Unreliability-
What: The cost of unreliability is a big-picture view of system failure costs,
described in annual terms, for a manufacturing plant as if the key elements
were reduced to a series block diagram for simplicity. It looks at the
production system and reduces the complexity to a simple series system where
failure of a single item/equipment/system/processing-complex causes the loss
of productive output along with the total cost incurred for the failure. If the
system IS sold out, then the cost of unreliability must include all appropriate
business costs such as lost gross margin plus repair costs, scrap incurred, etc.
If the system is NOT sold out, and make-up time is available in the financial
year, then lost gross margin for the failure cannot be counted. The cost of
unreliability is a management concern connected to management’s two
favorite metrics: time and money.
Why: In private enterprise, failures must be concerned from a financial viewpoint
and not a “gear-head” approach of simply counting the number of failures;
you must also speak the language of the enterprise, which describes events by
monetary measures over a period of time. The annual cost for failures is
usually not stated in a clear-cut manner nor are failure costs summarized by a
system/sub-system to identify the weak links in a monetary fashion so that
appropriate action is taken to reduce the annual cost of unreliability by
building a clear Pareto distribution to attack the vital (high cost) areas with an
action plan to reduce failures (unreliability) and to reduce the cost of
unreliability.
When: For new a new plant, this can be a design criteria to limit costs of unreliability
for competitive reasons in the marketplace. You must make the hidden costs
of failures obvious as a portion of the strategic plan. For an existing plant,
this can be an exercise in defining the cost of unreliability and building a long-
term plan to reduce the cost of failures as a portion of the tactical plan.
Where: This activity is best performed with high-level involvement of the
management team to provide fundamental understanding of the size of the
icebergs about to rip out the underbelly of the plant and to involve the
organization in a plan to reduce the costs so that profits are pushed upward
because of the improvements. If the cost of unreliability cannot be reduced,
then the costs become extra weight for the saddlebags in the race for survival.
Return to top
Critical Items List-

What: The critical items list is a top-level summary of problems/cost used for
discussions with management about key reliability issues. The summary list
converts technical details to a summary of costs and time while placing the
issues into a Pareto distribution explained in terms of money and the vital few
problems to be solved for competitive reasons.
Why: The purpose of the critical items list is to focus management’s attention on
items that need to be resolved during the design phase as a corrective action
loop for influencing the lifetime costs.
When: The list starts with the first design review as issues are disclosed in design
reviews for reliability.
Where: The critical items list is presented to top-level management as issues to be
accepted or resolved before paper plans become steel and concrete.
Return to top
Data-
What: Data is the informational energy that runs the reliability improvement
machine. Data is acquired at great cost. Data needs to be retained and used to
prevent future failure events. Proper use of data provides an understanding of
failure mechanisms and prevents reoccurrence of bad events that cause safety
or high-cost failures to occur. Reliability data requires definition of a failure.
Failures can be catastrophic failures or slow degradation—you decide by
defining the failures. The units of the measure for the data must be in units of
the degradation—sometimes it is hours, some times it is miles, and so forth—
in short, whatever motivates the failure. Reliability always ceases with a
failure or a removal from service in some aged condition that then generates a
category of data called a suspension or censored data. Data is information in
the form of facts, figures, or engineering databases that is obtained from
engineering tests, experiments, or actual operating conditions. Reliability data
is often incomplete as the exact times to failure are rarely known or recorded
with much precision so that only partial information is available for analysis.
Reliability data comes in two forms: 1) age-to-failure data, and 2)
censored/suspended data such as occurs when unfailed items are removed
from service or when they fail due to a different failure mode than we are
studying—this is useful information and part of the data set. Some data is
better than no data for resolving reliability issues.
Why: Data is the information that, when used in an informed manner, helps prevent
repetition of bad history and allows an enlightened approach to rationally
solving a reliability issue using facts and figures. Intelligent use of data for
reliability issues provides the objective evidence needed for helping to solve
the root cause of failures.
When: Databases of reliability information of past experience is very helpful for
predicting future failure events. The data is helpful if failure rates, or the
reciprocal of failures rates is described in mean times to failure which reduces
the information to an average failure rate or average time to failure. The
reliability data is particularly valuable if retained for components as a Weibull
database with shape factor beta and scale factor eta.
Where: The data is useful for understanding failure modes, for predicting future
failures for a population of equipment during the design stage, and for
predicting future failures with subsequent increases in the aging of equipment.
The role of the reliability engineer is to acquire the failure data and convert
the data into useful information for both current and future use.
Return to top
Decision Trees-
What: Most business decisions have considerable uncertainty, which implies at least
two outcomes if you choose a course of action. Making decisions in the face
of uncertainty requires the costs for taking action and the probability along
with the cost for not taking action and the probability of the occurrence. In
most cases the probabilities are not well known (maybe to one significant
digit) and the costs are not well known (maybe to $10000). The quantitative
assessment is called risk assessment. The issue is to take these not-well
identified issues and devise a strategy that can minimize exposure to risk for
the business. Decision trees are graphical representation of a methodology to
reach the expected values for the decision so as to take or not-take action.
Why: Most business decisions have no exact answers, i.e., no black and white
answers but rather shades of gray. The use of the tool is to help decide which
course of action may be to the advantage of the business given the best
estimates that can be made.
When: Decisive details will only be known into the future and decisions have to be
made today, so use of decision trees are tools to help wisely span from today
into the future with the wisest decisions that can be made from sketchy data.
Where: If you have absolute data, use it. Most decisions must be made with
indecisive information that requires decisions about the odds for a given
event, usually based on estimates—the wiser the estimate the better the
decision, taking into account the probabilities of the outcomes and the money
involved in the decision. Use this tool when few details are available and you
must be the pioneer to cut through the forest to reach the promised land of
opportunity and profitable ventures.
Return to top
Dependability-
What: The International Electrical Congress (IEC) defines dependability as
“Dependability describes the availability performance and its influencing
factors: reliability performance, maintainability performance and maintenance
support performance.” MIL-HDBK-338 defines dependability differently, as
a measure of the degree to which an item is operable and capable of
performing its required function at any (random) time during a specified
mission profile, given that the item is available at mission start. (Item state
during a mission includes the combined effects of the mission-related system
R&M parameters but excludes non-mission time; see availability.)
Dependability is related to reliability with the intention that dependability
would be a more general concept than the measurable issues of reliability,
maintainability, and maintenance.
Why: The key dependability issue is to make equipment and processes work as
advertised, which is, without failure. Dependability aims at facilitating
cooperation by all parties concerned (supplier, organization, and customer by
fostering an understanding of the dependability needs and value to achieve the
overall dependability objectives), so it involves harmonizing conflicting
issues. Dependability has a better viewpoint from the end user of the
equipment or system than from the designer’s viewpoint or the maintainer’s
viewpoint. From a system-effectiveness viewpoint, reliability and
maintainability provide system availability and dependability.
When: You cannot repair yourself to happiness with a failure prone system as the
failure-prone system will be viewed as lacking dependability to function as
required when you need it. Thus, dependability is viewed over the longer
term and not in convenient snapshots, and dependability also involves
lifecycle cost issues.
Where: Reliability contributes directly to uptime by avoiding failures whereas
maintainability contributes directly to reducing downtime by faster repairs.
Thus, reliability and maintainability jointly provide impact on dependability
of the system. Dependable systems must be ready to function, in an operable
state, to produce the desired output, upon demand by the end user, at the
specified quantity and quality of output.
Return to top
Design Reviews For Reliability-

What: Specific questions to ask of design engineers during a review specifically for
reliability using failure data from operations and maintenance are: 1) Show
the calculated availability for the system based on a RAM model, 2) Show the
calculated number of failures during the specified mission time between
turnarounds based on a reliability and maintainability (RAM) model, 3) Show
details of FEMA studies, 4) Show details of FTA calculations, 5) Show the
calculated mean times between downing events, 6) Show the calculated the
mean time between cutbacks from full production capability and losses thus
incurred, 7) Show the QFD matrix and details, and 8) Show the calculated
cost of unreliability.
Why: Design reviews should demonstrate by calculation or through the use of
models and reliability tools that the system is capable of achieving the design
objects rather than making a giant leap of faith that all will be well and good.
Problems found in the design review for reliability are corrected less
expensively on paper than when corrections must be made in the field with
hardware.
When: Design reviews for reliability should be a part of the design process starting
with conceptual designs and ending when the drawings are revised for the as-
built system.
Where: This is a logical extension of the design process to show, rather than tell, how
the system will function. This is performed as a portion of the up-front design
by the numbers process.
Return to top
Effectiveness-
What: The potential or actual probability of a system to perform a mission for a
given level of performance under specified operating conditions is defined as
the product of reliability*availability*maintainability*capability
(dependability is often defined as reliability*maintainability) and all values of
the product are between 0 and 1. Many variants of the effectiveness equation
exist, e.g., OEE, and others. See a parallel comparison with system
effectiveness based productive output results of process reliability
calculations.
Why: The effectiveness equation defines the ability of a product, operating under
specified conditions, to meet operational demands when called upon. This is a
practical measure of how well the system is performing—not how well we
want it to perform, but it is a practical measure of how the system is doing.
Since all the elements are measured between 0 to 1, the elements of the
equation quickly draw the eye to where opportunities exist for making
improvements.
When: The effectiveness equation is useful for trade-off boxes for various
alternatives when plotted on an X-Y scale for effectiveness vs net present
value (NPV) for showing improvement alternatives. For the elements::
reliability defines the probability of a failure-free interval (or the complement
unreliability which describes the probability of failure);
availability defines the probability of the system being up and alive to handle
the demand (or the complement, unavailability which describes the probability
of the system being down);
maintainability defines the probability of making repairs within the allowed
repair standard;
capability defines the probability of production achieving the desired
production results (a measure of how well the product performs compared to
the standard). Frequently it is described as the product of efficiency *
utilization where
efficiency is an output/input relationship such as (output achieved)/(the
standard required) and
utilization is how time is used such as (direct labor)/(direct labor + labor
lost)
[In the old days, if this index decreased to as low as 80% we
went berserk—today,
you can’t get this high because of wasted time when noses are
not to the grindstone!!!].
Where: It is used to describe the performance of both new systems and old systems.
Consider this example for effectiveness: If we are comparing a heavy-duty
truck versus a sports car for transportation, the truck may be more effective
for heavy loads whereas the sports car may be more effective for acceleration
and high speeds—neither are defined by the effectiveness equation until the
mission is defined. The effectiveness index is converted into output quantities
by use of the process reliability technique for quantifying the productive plant
and the non-productive hidden plant based on a pragmatic definition of
nameplate capacity for the plant.
Return to top
Electronic Components-
What: Electronic components are everywhere, and they are getting smaller and more
complex by the year! They are becoming a larger part of modern society
every day. As a class, they are particularly susceptible to increased failures
from temperature, vibration, and shock loading which destroys reliability.
Why: Most electronic devices are small and delicate. Inherent failure rates are often
built into the device by the manufacturing process (similar to building in
human genetic defects), and you cannot find the inherent defects until the
components are stressed. The best remedy for electronic devices to achieve
high reliability is to start with a high quality, durable devices built on a
failure-free process, load the devices only to moderate loads, and to carefully
control the environment to suit the needs of the electronic component.
When: Burn-in tests, of different degrees of severity, following assembly of the
system is imposed to weed out the inherent defects by adding stresses due to
temperature, vibration, and shock loadings to cause the weak units to fail.
Other accelerated tests for electronic devices include ESS, HALT, and HASS.
Where: The usual failure rate distribution for electronic systems is considered to be
the exponential distribution, although some electronic devices such as SCRs
often display a decreasing failure rate described as infant mortality failure
modes by Weibull analysis, and some electronic devices have an increasing
failure rate described as a wearout failure mode for devices such as
electrolytic capacitors and EPROMS. Many electronic failure rates and
electronic models are available in MIL-HDBK-217 and it’s successor PRISM.
Return to top
Environmental Stress Screening (ESS)-

What: A series of screens are conducted under environmental stresses to disclose
weak parts and workmanship defects that require corrections, and this requires
and understanding of burn-in testing and ESS, both of which identify weak
points and eliminate them by motivating early failures. Burn-in is usually a
long process of operating under load(s) and at fixed temperature (in short, this
is a special case of ESS) or it can be operated at varying loads and accelerated
temperatures to achieve a shorter burin-in period, whereas ESS is a
scientifically planned and conducted test, that is usually conducted under
accelerated loads to produce the same test/use results in a shorter period of
time by increasing the stress on the components or assemblies. The objective
of these screens is to produce a failure-free product when released into
operations. ESS is not intended as a test to validate compliance to a design,
however it is intended to force latent defects into becoming defects before the
end user finds them in day-to-day usage.
Why: The extremes of operating conditions such as high power levels, high
temperatures, high vibration levels, etc. produce failures not anticipated from
testing at nominal conditions. Generally, ESS is directly applicable and
interpreted to be applicable to electrical/electronic equipment, however the
same issues/concepts apply to mechanical equipment when the stressing
conditions are loads/pressures/temperatures/vibrations/thermal shocks/etc. So
as for all reliability issues—think broadly!
When: When acquiring data, the tests are done upfront of production. When
controlling early failures that would be discovered by the end user, these test
are done as a portion of the production process to eliminate weak units to
control warranty costs and improve customer satisfactions
Where: Some tests are conducted in the laboratory for quick results and then the data
is used to control product testing/release for the purpose of limiting costs and
preventing the loss of customers from unsatisfactory performance in the field.
Return to top
Events/Incidents-
What: Events/incidents are single events or occurrences, especially one that is
particularly significant, that result in a failure from an non-aging mechanism
for reliability purposes. Usually the event/incident results in a serious
consequence of the loss of functional life of a component or system. The
death of the device must be recorded as censored (suspended) data.
Why: For reliability purposes, failure of the component, device, subassembly, or
system has been a success up to the point in life where a failure from a non-
aging event took place. This means the event-age was a success (up to the
point it was killed by an event/incident) and inclusion of the data is required
as censored/suspended data—this is important data.
When: Include the suspended/censored data into every analysis. Young
suspensions/censored data have little impact on the results of an analysis but
old suspensions have major effect on the analysis.
Where: The data is used for MTBF/MTTF analysis and particularly for Weibull
analysis.
Return to top
Exponential Distribution-
What: The probability of survival and of failure of components or equipment is
under the condition of chance failure ,which means a constant instantaneous
failure rate where the die-off rate is the same for any surviving (unfailed)
population. An old part is as good as a new part. For any survivors in this
memory-less system that have survived to time t, a certain percent of the
survivors will die in a specified interval of time such as 2*t. The reliability of
the system is often described by the exponential distribution because many
times a system is made up of mixed failure modes that in the aggregate will
function like a constant failure rate system. The reliability of exponential
distributions are described mathematically as R(t) = e^(-λt) = e^(-t/Θ) where t
is the mission time, λ is the failure rate, and Θ is the mean time, given that
λ=1/Θ. The exponential distribution is frequently used as a first
approximation to describe reliability based on a simple failure rate or a simple
mean time to failure—particularly if the system or component has multiple
failure modes.
Why: The constant hazard rate, λ, is usually a result of combining many failure rates
into a single number.
When: The exponential distribution is frequently used for reliability calculations as a
first cut based on it’s simplicity to generate the first estimate of reliability
when more details about failure modes are not described.
Where: In electronic systems (which can have many different types of failure modes,
especially since any electrical/electronic system is an amalgam of many
different components) the simple assumption is that the electrical/electronic
package will have a constant failure rate system defined by the exponential
distribution. When in doubt about the failure mechanisms, it is common to
assume use of the exponential distribution with its constant failure rate for
simplicity.
Return to top
Failure-
What: Failure is the loss of function when you needed the function to occur. Failures
for reliability purposes must be precisely defined so they are recorded
correctly. Much life data is incomplete because failures are mixed up with
censored/suspended data where aged items may not have failed or they
represent removals from service before failure, or they have not yet failed for
the mode of failure under study—in short, these censored/suspended items
represent successes and are a portion of data set for study.
Why: We study failed items for the same reason we do autopsies on humans—we
want the data and we want it categorized correctly for making important
decisions. Failures require: 1) a time origin that must be unambiguously
defined, 2) a scale for measuring the passage of time/starts/stops/etc. which
motivates failure, and 3) the meaning of failure must be entirely clear for
recording the event.
When: Failure data must be recorded as it occurs to prevent loss of information.
Failure causes involve design issues, manufacturing issues, assembly issues,
installation issues, or use issues that consume life and motivate failures by
misuse, inherent weakness, or consumption of life by means of a wearout
failure issue.
Failure modes describe the effects under which a failure is observed including
early failures where failure rates decline with usage (infant mortality), where
failure rates are constant with usage (chance failures describe the usual mid
life constant failure rate mortality), and increasing failure rates with usage
(wearout failure rates).
Failure mechanisms describe the physical, chemical, metallurgical, or other
processes which motivate the failures.
Failure criteria are the basis for registering the gravity of a failure and
sometimes temporary changes in the failure state, including duration of the
failure, have an important bearing on how a failure is recorded with the two
largest classifications of failure as complete failure (can’t complete the
intended function) or partial failure (not a complete failure but deficient in
providing all features of the intended function to a level that is noticeable and
undesirable).
Failure onset can be gradual (monitoring is intended to anticipate detection of
pending failure), intermittent (failure occurs in some magnitude but recovers
to complete the intended function), and sudden failure (surprise events that
cannot be anticipated with prior examination or monitoring).
Failure consequences can also be categorized such as critical failures
(significant damage occurs and/or injury to people occurs), major failures
(less severe than a critical failure but of such a magnitude as to substantially
reduce the required function), minor failures (reduces the performance of the
asset but oncly caused minor consequences for the entire system), and benign
failures (failures known and observed by an expert but not detected by a
novice).
Where: The CMMS system is frequently where most data resides but usually in crude
fashion. The failure data is often transferred into the FRACAS system for
converting the symptoms of the failure into the root causes of failure. The
failure data must be converted into action items for making management
decisions about future failures and the corrective action needed.
Return to top
Failure Forecast-
What: Failure forecasting is a projection of failures into the future based on assumed
or documented failure details. It is also known as risk analysis of future
failures. For a constant failure mode system this is very straightforward.
However, for complicated failure modes where the failure rate increases with
time (wearout failure modes) or where failure rates decrease with time (infant-
mortality failure modes), this becomes a more complicated analysis as
described by the Abernethy Risk which is described in The New Weibull
Handbook and implemented in the software package SuperSMITH Weibull
for predicting future failures. Likewise, reliability block diagrams are useful
for predicting future failures when the authentic failure details are supplied to
the Monte Carlo models.
Please note manufacturers follow two general strategies for their equipment:
1) build the equipment to avoid failures even though this increases the
original capital costs, or
2) build equipment and sell the original equipment at a low cost (or even a
break-even cost),
expecting to make profits with the sale of replacement parts.
Thus for end users of the procured equipment, it is important to know the
forecasted failures in the face of supplier protests that “our equipment never
fails”—in that case, ask to see the sale of spare parts for similar equipment
and an estimate of the number of units working to get a crude estimate of the
strategy employed by the equipment supplier.
A failure is an event that renders equipment as non-useful for the intended or
specified purpose during a designated time interval. The failure can be
sudden, partial, or one-shot, intermittent, gradual, complete, or catastrophic.
The degree of failure can be degradation or gradual, sudden, or one-shot, from
weakness, from imperfections, from misuse, and so forth.
A failure mechanism includes a variety of physical processes that results in
failure from chemical, electrical, thermal, or other insults.
Why: Future failures cost money and frequently increase the risk for safety or
environmental problems. For manufacturers, the forecasted failures predict
impending high costs for warranty expenses which can make/break a
company. With good failure forecasts, you can anticipate expected failures
now (after x-usage), future failures when failed units are not replaced, and
future failures when failed units are replaced either with the same failure
modes or with differently designed components with different failure details.
When: This analysis is wisely performed during the design of the equipment,
however many surprises arise from different failure modes built into the
assembled product or incurred by unanticipated usage in operations.
Where: Generally this analysis is made during the up-front design effort—with much
disbelief the products could be “this bad”. Follow-up analysis occurs when
unexpected failure modes arise during operation of the equipment, which
causes loss of service of the equipment and high costs for the end users.
Return to top
Failure Rates-
What: Failure rates, in the simplest form, are Σ(time in use)/Σ(number of failures) or
the reciprocal of mean times to/between failure. For more sophisticated
failure data bases such as Weibull databases the failure rates can be disclosed
without giving away proprietary data such as the shape factor, beta, which tell
the failure mode for the equipment.
Why: Simple failure rates are a precursor of maintenance events and production
interruptions that will occur into the future, which drive up costs and cause
chaos.
When: Failure rates derive from the history of operation or from well-known data
sources such as OREADA, IEEE 500, IEEE 493, EPRI, and other sources
listed in reading lists for reliability including Weibull databases.
Where: The failure rates are used as an awareness criteria for the average person just
as you used automobile fuel consumption rates for understanding the health of
your automobile as well as anticipating your weekly/monthly/annual out-of-
pocket expenditures for gasoline or diesel fuel. The failure rates drive the
maintenance interventions, spare parts, and maintenance cost for the
maintenance department. Similarly they predict the interruptions to the
process and lead to misses on promised deliveries and result in negative
variances for production costs. In sort, failure rates are precursors for the
misery expected for the organization.
Return to top
Fault Tree Analysis-

What: Fault tree analysis (FTA) is a top-down process of defining the top-level
problems and, through a deductive approach using parallel and series
combinations of possible malfunctions, to find the root of the problem and
correct it before the failure occurs. The reliability tool can be used as
qualitative or quantitative methods.
Why: The tool aids the design process, shows weak links that cause failures, and in
the critical legs of the trees, helps to define maintenance strategies for which
pieces of equipment and processes should be defended with the greatest
maintenance vigor to prevent “Murphy” from shutting down the process or
causing serious safety issues. The technique provides a graphical aid for the
analysis and it allows many failure modes including common-cause failures.
Results from a FTA is usually more pessimistic than other analysis tools such
as RBDs as you can see from a study of the Space Shuttle reliability analysis
where each system is studied by multiple reliability tools because of the high
cost/profile of failures.
When: FTA is widely used in the design phase of nuclear power plants, subsea
control and distribution systems, and for oversight studies in layers of
protection studies for process safety and loss control in chemical plants and
refineries so as to prevent accidents and control the costs of risks. The
technique is helpful for identifying critical fault paths, observing vague failure
combinations before they occur in reality, comparing alternate designs for
safety, and setting a methodology to provide management with a tool to
evaluate the overall hazards in a system and avoid single sources of critical
failures. Finally when thinking top-down about failures and where/how they
can occur, the methodology gives a diagram for setting maintenance strategies
for protecting key pieces of equipment/processes to prevent failures.
Where: FTA is helpful for defining potential event sequences and potential incidents,
evaluating the incident consequences of outcomes, and estimating the risks of
events occurring. FTAs work in the design room and on the operating floor
where firsthand knowledge has been gained for preventing failures.
Return to top
FMEA-
What: Failure mode and effect analysis (FMEA) is the study of potential failures that
might occur in any part of a system to determine the probable effect of each
failure on all other parts of the system and on probable operations success.
When criticality analysis is added for sophisticated studies the method is
know as FMEAC. In the automotive world where FMEA is a required portion
of the quality systems, it is frequently known as PFMEA for potential failure
mode and effect analysis. The basic thrust of the analysis tool is to prevent
failures using a simple and cost-effective analysis that draws on the collective
information of the team to find problems and resolve them before they occur.
Why: The analysis is known as a bottom-up (inductive) approach to finding each
potential mode of failure and preventing failures that might occur for every
component of a system. It also used for determining the probable effects on
system operation of each failure mode and, in turn, on probable operational
success, the results of which are ranked in order of seriousness. FMEA can be
performed from different viewpoints such as safety, mission success,
availability, repair costs, failure modes, reliability reputation, production
processes, follow-on service, and so forth.
When: The FMEA is most productive when performed during the design process to
eliminate potential failures. It can also be performed on existing systems
where operations personnel and maintainers are made team members to add
real-life experiences to educate the team in a problem-solving forum that is
constructive to eliminating existing problems.
Where: The analysis can be conducted in the design room or on the shop floor and it is
an excellent tool for sharing experiences to make the team aware of details
that are known to one person but seldom shared with the team. It is also an
extremely productive tools for educating young engineers, young maintainers,
and young operators into details they should be aware can kill the system.
Return to top
FRACAS Systems-
What: Failure reporting corrective action systems (FRACAS) is an organized
database for aiding in solving reliability problems using a common-sense
approach by systematically and permanently removing failure mechanism.
Good historical data from this system can populate a Weibull database.
Why: Use data to solve problem by attacking root causes to reduce failures and
make reliability grow. Fixing failures requires data—not opinions—so use
the data acquisition system in a closed loop to record, analyze, correct, and
verify improvements have been achieved. First data reported is usually a
symptom of a failure and with a failure investigation, the symptom can be
converted into a root cause which requires the system to be editable to
correctly report failures.
When: The maintenance repair order system usually generates evidence of a failure.
Failures with significant costs (repair costs + collateral damage + lost margin
from the failure + other appropriate business costs) must be investigated and
evaluated to reduce failures and to reduce failure costs. Little is to be gained
by spending big money to investigate trivial failures.
Where: This is an engineering tool requiring clerical effort to input the data and build
the Pareto distributions for identifying significant events requiring corrective
action and thus it also becomes a management tool for controlling costs.
Return to top
HALT-
What: Highly accelerated life test (HALT) is an offspring of older environmental
stress screening (ESS) tests and is a testing process for ruggedization of pre-
production products by heavily stressing the product to identify failure modes
quickly and to verify weak links in the system.
Why: HALT tests are intended to quickly find failures and accelerate the
improvement program so that when products are delivered to end users, they
will be mature products by elimination of potential failure modes that would
normally generate a reliability growth program. Usually the HALT programs
reduce time, cost, and delays experienced in new products by recalls, warranty
costs, etc. HALT is similar to HASS but the stresses are more severe. In the
HALT process, design and process flaws are found, root causes identified, and
corrective actions implemented quickly.
When: HALT is used during the development program to get engineers to
acknowledge and correct fatal problems in designs by adding loads (generally
temperature, vibrations, pressures, physical stresses, etc.) by rapidly changing
the load conditions over and above normal operating loads.
Where: HALT is frequently used for electronic systems but also applicable to
mechanical systems where thermal shocks are used to validate designs for
extreme conditions of loads. The tests are performed in the laboratory for
engineering evaluation.
Return to top
HASS-
What: Highly accelerated stress screen (HASS) uses the same stresses as HALT, but
at a lower stress level. Compared to HALT testing, temperature and voltage
extremes may be reduced by 10%-15%, vibration levels reduced 50%, etc.,
depending upon the design although all the stresses may be above rated
product specifications with the motivation to produce test results quickly for
verifying product compliance.
Why: HASS testing is used to verify product performance is on target and has not
shifted toward inferior performance in the manufacturing process. Note that
higher stresses often produce accelerated failures out of proportion to the
increased stress applied.
When: Products are periodically screened by HASS to verify no shifts have occurred
in the manufacturing process.
Where: HASS tests are performed as a quality assurance test in manufacturing
facilities to learn what you don’t know about each product as it is faster than a
simple burn-in test. If 100% of the finished goods do not receive HASS, as
when only a percentage of the product is screened by HASS, this is called a
highly accelerated stress audit (HASA).
Return to top
Life Cycle Cost-

What: Life cycle costs (LCC) are all costs associated with the acquisition and
ownership of a system over its full life. The usual figure of merit is net
present value (NPV). Projects are considered most favorable for large
positive NPVs. However for many cost individual cases, decisions are made
for the least negative NPVs. In all cases, the default position for accounting is
to know the NPV for making no change and this is usually the last alternative
for most people associated with change.
Why: The first cost for capital equipment (acquisition) is between ½ and 1/20 of the
total lifetime cost! The first cost, acquisition cost, is usually definable by a
firm quotation and sustaining costs must be estimated and put into the
appropriate time slots for discounting to obtain the NPV for the project life.
Typical values used in industry for LCC are: discount rate = 12%, tax rate =
38%, and project life is usually between 10 and 20 years.
When: Life cycle cost is usually calculated as an up-front decision-making effort
either for projects or for cost-reduction efforts. I does not work well for doing
the analysis after the project is underway.
Where: LCC is the business of investing money to make changes occur. The NPV
values add the voice of investments to technical decisions to work for the
lowest long-term cost of ownership.
Return to top
Life Units-
What: A measure of use duration applicable to an item. For example, the life units
may be starts-stops, run hours, hot-cold cycles, distances traveled, emergency
starts or starts, shelf life, and other measurements that motivate failures.
Why: Life is consumed by usage of life units. Some life units occur as a sum of the
different cases, for example on a gas-turbine aircraft engine, take-offs
consume more life than landings or enroute conditions which requires a
synthetic value for how life is consumed on a mission. For a land-based,
heavy-duty gas turbine used in the generation of electrical power the number
of starts is not equivalent to hours of operation as other wear mechanisms are
involved; however, 1 trip cycle = 8 normal shutdown cycles and thus
decreases the time between required maintenance actions.
When: Development of a life-consuming profile may be more important than the
literal measurement of an elapsed time to adequately measure consumption of
life that in the end will result in a failure.
Where: Life units have different measures and must be considered to obtain the proper
“common denominator” for calculations.
Return to top
Load-Strength Interactions-
What: For reliability successes, loads must always be less than strengths. When
loads are greater than strengths, failures occur. The issue is determining the
probability of load-strength interference, which is a joint probability of when
loads exceed strengths. The loads should include expected conditions plus the
foolishness of people to violate rules and overload equipment, plus the
vagaries of Mother Nature to impose unexpected static and dynamic loads
from hurricanes, tornadoes, earthquakes, wildfires, and so forth.
Why: Neither loads nor strengths are unmovable point estimates, although most
designers use point values. Failures occur and reliability terminates when
loads exceed strengths.
When: Loads usually increase over time (e.g., airplanes like people, gain weight over
time from accumulation of dirt and extra equipment), strength usually
decrease over time (small fatigue cracks appear with many cycles and load-
bearing strengths decline).
Where: Bridges have finite lives because of load-strength interactions, wings break off
of airplanes from fatigue, etc. A few failures are dramatic but most failures
sneak up from the unknown in a variety of ways to cause loss of reliability.
To prevent loss of the system requires many physical inspections to learn what
you don’t know!
Return to top
Lognormal-
What: Lognormal distributions are continuous life functions that have long tails to
the right (display positive skewness) in time or usage. A lognormal
distribution plotted on semi-log papers would appear as a normal curve.
Why: The lognormal distribution is a common competitor to the Weibull
distribution for life. However it is adequate for 85%-95% of all repair times.
When: Lognormal distributions are motivated by multiplicative (or proportional)
events that grow with time, like crack growth, molecular diffusion, and some
wearout problems.
Where: In the days when plots had to be made by hand, it was the first widely used
transform to convert plotted data into straight lines. Today it is simply one of
an arsenal of probability tools used to obtain good curve fits to data with
multiplicative type events.
Return to top
Maintainability-
What: The measure of the ability of an item to be retained in or restored to a
specified condition when maintenance is performed by personnel having
specified skill levels, using prescribed procedures and resources.
Why: Maintainability measures the percent of maintenance jobs completed to a
standard time for the repair, with repair times for the task usually plotted on a
lognormal probability plot.
When: First you set a standard repair time for the task, second you set a skills level,
third you measure how you’re doing against the standard.
Where: Applies to major tasks where many repetitions are expected and where
considerable time is required.
Return to top
Maintenance-
What: All actions necessary, both technical and administrative, for retaining an item
in or restoring it to a specified condition so it can perform a required function.
The actions include servicing, repair, modification, overhaul, inspection,
reclamation, and restored condition determination.
Why: Equipment deteriorates because of entropy changes, because of errors both
overt and convert, and because of the use of incorrect procedures.
When: Maintenance is generally routine and recurring.
Where: The effort includes fault location, diagnosis, repair, test, adjustment,
replacement, administration, and overhauls wherever equipment is located.
Return to top
Maintenance Engineering-
What: A tactical job for rapidly repairing equipment to operable conditions by
studying operating and repair manuals. Acquires failure data and prepares
maintenance plans of restoring equipment to operable condition in a minimum
amount of time. Prepares general diagrams, charts, drawings, and spare parts
requirements for maintenance planners. Makes recommendations for
improving the repair cycle. Provides manning level forecast for supervisors
and estimates the duration of outages. Determines the cost advantages of
alternatives for developing action plans to comply with internal/external
customer demands for timely repairs of processes/equipment. The purpose of
these activities is to restore equipment to service in a timely manner.
Why: Facilitates speedy repairs by providing maintenance technology above the
craftsman level and up to, but not including, reliability engineering principles.
When: Provides expertise for more complicated maintenance tasks or when
organization and oversight is required and time is of the essence for fast
repairs.
Where: Provides on-site expertise to aid craftsmen to solve non-standard repairs
without hands-on tool contact. Maintenance engineers serve as liaisons with
reliability engineers.
Return to top
Management’s Role For Reliability-

What: Management must display leadership for setting a course for reliability under
their watch. Too little reliability results in many breakdowns, high
maintenance costs, missed production schedules, and unhappy customers.
Too much reliability results in high equipment cost, complicated and
expensive redundancies, excessive procedures, and excessive operating costs
along with happy customers for product delivery but unhappy customers
because of high cost products. You’ve got to get it right for your particular
situation. No 4th quartile producer has demonstrated high reliability
production systems. Many 1st and 2nd quartile producers have demonstrated
high reliability production systems.
Why: Management gets what management wants. Management must say what they
want and want what they say. Management must be consistent. Their talk
must match their walk to achieve failure free processes which take into
account the cost of unreliability throughout the entire system. Management
usually expresses their overriding desires and philosophy with policy
statements as a method of widely communicating intent to the workforce and
making the direction a part of the organization culture. Management cannot
espouse a reliability culture but only talk about fixing things faster or
grumbling only about maintenance costs—they must work to correct the root
of the failures and develop a culture of failure prevention.
When: Management can adopt the reliability culture role at any time. The program
has to be sold to the organization—telling won’t implement an initiative for
reliability. As a working example, follow the methodology used for
implementing strategies and policies for safety, quality, and environment as
role models.
Where: Management’s role for reliability starts at the top as a strategy issue. It cannot
begin at the bottom of the organization.
Return to top
Mean Time-
What: A density figure-of-merit metric often referred to as the average or expected
value. In the simplest form it appears as arithmetic Σ(time) / Σ(events) or in
complicated situations as a statistic metric. It applies to mean life (ML),
mean down time (MDT), mean maintenance time (MMT), mean time
between failures (MTBF for repairable items), mean time to failures (MTTF
for replacement items), mean time between maintenance (MTBM), mean time
between maintenance scheduled (MTBMs), mean maintenance time
unscheduled (MMTu), mean maintenance time scheduled (MMTs), mean
time between overhauls (MTBO), mean time between unscheduled
removals(MTBRu), mean time to restore (MTR), mean time between
downing events (MTBDE), and so forth. The units will be time/metric, e.g.,
hours/failure. The reciprocal of the metric provides an incident rate, e.g.,
failures/hour.
Why: The metric provides an awareness factor for deciding central tendency
numbers and for the expected number of events that will occur into the future
based on historical situations. The arithmetic simplicity of mean time is a
reason to establish the metric and listen to the information derived from it to
gain insight. The arithmetic provides immediate answers to categorize facts
for starting continuous improvement rather than postponing a metric while
searching for delayed perfection!
When: The metrics are used as criteria of performance and variations from the central
tendency numbers are expected however for the long term the variations are
expected to be controlled to prevent distortion of the measurement.
Where: The metrics are used from the shop floor to the management levels as criteria
for “How are we doing?”.
Return to top
Mechanical Components Interaction-

What: Mechanical components suffer from interactions and degradations of
overloads, strength deterioration, wear, corrosion, process variations during
the fabrication process, effects of special processes where the procedures must
be controlled as discovery of the end results would result in destruction of the
component, and removal of safety factors by increasing loads.
Why: The naïve expectation is that, individually, the impact of a single insult will
not destroy reliability of the component. However, you frequently have
multiple insults occurring, which results in failures that are not predicted up
front but can be perfectly explained after the components have failed.
When: The multiple destructive events are more predominate in complex devices and
highly stressed devices which too often have small safety factors that cannot
cope with the overload conditions and thus failures occur.
Where: The foolishness of humans adds further insults to the interactions of many
different failure mechanisms which demands many more maintenance
interventions and frequent inspections. Of course the solution to many of
these cases where failures occur is to increase safety factors by adding extra
material (when possible), but this adds extra weight and extra costs.
Return to top
Monte Carlo Simulation-

What: Monte Carlo simulation (modeling) is a method to solve engineering problems
by sampling methods. The method applies to such things as system reliability
and availability modeling by simulating random processes such as life-to-
failure and repair times.
Why: The technique is used when: 1) many variables are present and their
interrelationships are unclear, 2) the system can’t be analyzed by direct and
formal methods; 3) building analytical models would be time consuming,
complex, and just too hard, 4) you cannot do direct experiments, 5) when the
input details such as equipment life and repair times are not discrete and they
vary over time according to a distribution, and 6) you need to do some
tweaking of the system to understand where opportunities lie for improving
uptime, reliability, and costs.
When: Build models before you commit systems to bricks and mortar so you know
their performance on paper. Revise the models after they are in operation to
help improve the unknown weaknesses and improve costs for future cases.
Where: Monte Carlo models are used for gaining insight about how things work and
data collected from the model is done at an accelerated rate compared to real
life.
Return to top
Normal Distribution-
What: A fundamental frequency distribution that produces a symmetrical bell-shaped
diagram based on the Gaussian distribution to form a normal law of errors.
Why: The distribution is easily described with two statistics, the mean (X-bar, which
is a location parameter) and the standard distribution (sigma, which is a shape
parameter carrying units of the location parameter) as these are parameters of
the population.
When: The distribution is widely used for quality issues where errors are frequently
symmetrically distributed and for a few cases of reliability problems where
life data is also symmetrically distributed. For symmetrical life data, the
normal data makes a good Weibull plot, whereas Weibull data usually makes
a poor normal plot—thus, Weibull plots have almost displaced normal plots
for reliability data.
Where: The distribution is used where the statistics simplify descriptions of the
distribution, so it is easy to describe and explain.
Return to top
OEE-
What: Overall equipment effectiveness (OEE) is a manufacturing index to reduce
complexity of discrete systems for problem solving and benchmarking. In
many ways, it is a subset of effectiveness.
OEE=availability*performance*quality where availability = (operating
time)/(planned production time), performance = (ideal cycle time)/(operating
time/total pieces), and quality = (good pieces)/(total pieces); and OEE is best
suited to discrete manufacturing. The index is larger than for effectiveness
and allows for acceptance of down time without have a hard measure for
utilization losses in the capability (although it does have a performance index
which takes elements from both efficiency and utilization) and it accepts
planned downtime as OK in the availability index. The effectiveness index
looks at the system from the perspective of the investor, whereas OEE looks at
the system from the perspective of the operations management which excuses
many losses such as planned outages, etc., and has the propensity for the
indices to be “Enronized” so they look good, when in fact from the investors
viewpoint, the results are not good which is a violation of the principle of Esse
Quam Videri (To be, rather than to seem).
Why: It’s a simple and easy-to-use index for the big-picture summary of
performance in industry and it can be benchmarked against similar industries.
When: Use for a quick assessment and approximation of the effectiveness equation.
Where: Widely used for a first cut at improving manufacturing operations in lieu of
the more stringent and complete effectiveness equation.
Return to top
Pareto Distribution-
What: Vilfredo Pareto was an Italian economist in the late 1800s who described the
unequal distribution of wealth in the world. The concept was improved and
brought to the factory floor by Joseph M. Juran (December 24, 1904-February
28, 2008) for manufacturing operations. Juran said it was a methodology for
separating the vital few problems from the trivial many problems. The Pareto
principle, as explained by Juran, when applied to quality issues said: It’s the
80-20 rule where 80% of the problems come from 20% of the causes and
management should concentrate on the 20% (the vital few causes). The same
concept works for money issues—you must separate the vital few issues from
the trivial many issues.
When the Pareto distribution is listed in order of money lost (including the
risk for money lost) it becomes a work priority for attacking business
problems that have the greatest impact on the enterprise. Winners in the
organization work on the vital few important items (the 20%), as they put their
reputations at stake, while the losers in the organization work on the trivial
many problems (the 80% of the problem list), which, if solved, would have
little financial impact on the enterprise.
The gear-head approach is to build the Pareto list based on numbers of
failures. This is usually not too productive. Would you really prefer to solve
90% of
1) 1000 failures that costs a total of $1000, or
2) 1 failure that costs $1,000,000?
The gear-head approach says to go for the 1000 small problems. However the
business approach says to go for the big $ items in the list—in the end, it’s all
about the money!
The business approach is to build the Pareto list based on the total amount of
money spent or at risk (maintenance costs + gross margin lost + rework costs
+ scrap costs + warranty costs + … + …., etc to include all appropriate
business costs) rather than working on the trivial money and love affairs that
keep people busy but do not generate financial returns for the business.
The most important reliability tool is a Pareto distribution
based on $’s to set work priorities for attacking the vital few
problems as a method of separating important issues from the
trivial many issues.
Why: The Pareto distribution, based on $’s, sets work priorities, and assuming a
one-year payback period, describes how much money can be spent to resolve
the issues. Most reliability engineers need to be working on the top 5 or 6
items, based on $’s, all the time as data and solutions are developed slowly
and the key items always need to be on the mind for active consideration. The
mentality is to think like a bank robber—go for where the big money is
located and get it back—and get it back fast.
When: At least quarterly reviews of the Pareto distribution are important for
accountability of who has solved what problems and to define what new
targets have come over the horizon that require immediate attention.
Where: Pareto distributions are used throughout the organization to keep attention on
the vital few $ issues. They are highly favored by management when
engineers employ Pareto distributions based on money. Pareto distributions
help set work priorities and avoid focusing on love affairs with equipment or
process, which often occurs to the detriment of the business. Pareto
distributions explain why some work orders always get maintenance priority
while other tasks are relegated to the category of “whenever we get time to
solve the problem.”
Return to top
Poisson Distribution-
What: Poisson distributions are discrete distributions and the simplest statistic
process where Poisson events are random in time, which describes a stable
average rate of occurrence of counted events. The Poisson is frequently used
as a first approximation to describe failures expected with time. The
calculations are driven by an average value, e.g., failures/year, defects/meter2,
hurricanes/year, etc. Answers from the Poisson will come as probabilities for
1 failure, 2 failures, etc., or the probability for 1 hurricane in a year or 2
hurricanes in a year, etc. The average value is obtained from a constant*time-
interval that is usually explained as λ*t. Frequently charts are used to obtain
solutions to the Poisson equation such as the Thorndike Chart from Bell Labs
or the Abernethy-Weber chart from The New Weibull Handbook. The
equation is often described in two formats: 1) probability = (np)re-np/r! where n
= number of trials, r = number of occurrences, and p=probability of an
occurrence, or 2) probability = ZCe-Z/C! where Z=expected number (i.e., the
mean) and C=probability of an event in counting numbers. Of course, for the
two different formats np=Z and r=C. When n is large and p (or 1-p) is small,
the Poisson is an excellent approximation to the binomial distribution.
Why: Simplicity is the major reason for use of the Poisson distribution.
When: Use the Poisson when an answer is needed quickly and the answer deals with
counting terms.
Where: When you know the average number of events the Poisson is easy to use to
find the probability of 1, 2, 3,…events occurring.
Return to top
Probability Plots-
What: Probability plots make sense of the chaos of failure data on an X-Y plot. Each
type of plot is divided differently on the X and Y axis based on the
fundamental mathematics for a given distribution. The decision on which
type of graph paper to use is based on: 1) a simple pragmatic approach (use
the one that gives the best curve fit to the data), and 2) the physics of failure or
the mechanism driving the data for non-failures. For reliability data, 85% to
95% of the data will adequately fit a Weibull distribution. For repair data,
85% to 95% of the data will adequately fit a lognormal distribution. Often
Weibull plots or lognormal plots compete as to which distribution best fits the
failure data.
Why: The acquired data is plotted in the units acquired on the X-axis of a
probability plot and the data is plotted in rank order. The Y-axis in most cases
is determined using Benards median rank approximation to provided the
probability percentage. The result is often a straight line on the properly
divided X-Y graph paper. Please note, over the years many different plotting
positions have been tried with Benard’s plot position being the strongest
survivor for tailed (i.e., non-normal) data.
When: Use when you have failure data or repair data. They work best when age-
failure plots are made by individual failure modes or individual repair modes.
They also will handle high-level failure data and repair times where the data
represent how the system is behaving.
Where: Use probability plots to get complicated data summarized onto one side of one
sheet of paper. When the plots have the cumulative distribution plotted on the
Y-axis, it tells what percent of the population will have a life (or repair time)
less than the corresponding X-value.
Return to top
Process Reliability-
What: Reliability of a production process is defined as the percentage of production
where output consistency is lost as determined by a Weibull plot of daily
production output.
Reliability losses are the sum of production gaps between what should have
been demonstrated (the demonstrated production line) and what was actually
achieved—these are losses due to special causes. Special cause losses occur
from things you can put your finger on and can be solved by process
engineers, maintenance engineers, and reliability engineers.
Nameplate lines (or entitlement line) define the possible daily output.
Nameplate lines lie to the right of the demonstrated production line on a
Weibull probability plot. The gap between the nameplate line and the
demonstrated line quantifies efficiency/utilization losses—these are losses
due to common causes. Common cause losses result from subtle problems
without major identifiers and generally accepted as “that’s the way things are”
without fingering for elimination by six-sigma black-belts and management.
In many production facilities, this category is a major source of losses and
greater than all availability/reliability/maintainability losses.
The sum of the reliability losses plus efficiency/utilization losses constitutes a
hidden factory measured in output quantities.
Production effectiveness = (annual output)/(annual output + hidden
factory losses). These details are shown graphically on a Weibull probability
plot. Contrast the production effectiveness calculation (obtained in minutes)
to the effectiveness equation (obtained in hours/weeks).
Why: You must see the losses on a Weibull probability plot to believe they exist.
Use the graphics to sell an improvement program based on diagnosis of the
problem and where to attack. The technique provides both visual and
qualitative results. The analysis goes onto one side of one sheet of paper.
This is a simple tool used for strong results in a creative and problem solving
organization. Reliability values and the slope of the demonstrated line (beta)
are benchmark able. Process reliability techniques measure system
performance, in production output quantities, and produce a single
production effectiveness index in percentage terms which is similar to the
effectiveness equation.
When: Works well on daily production data accumulated over a period of time in
order to see the patterns of performance.
Where: Useful for any production facility including electrical power generation,
chemical plants (both batch and continuous process), refineries,
pharmaceuticals, semiconductors, packaging facilities, and other complicated
production facilities where achieving a simple index of “how are we doing” is
difficult to achieve. For more details and articles, see hyperlinks at the bottom
of the page: http://www.barringer1.com/prtraining.htm.
Return to top
Quality Function Deployment-

What: QFD is a bad translation of a good reliability technique for getting the voice of
the customer into the design process so the product delivered is the product
the customer desires. In particular, it is applicable to soft issues that are
difficult to specify.
Why: The method helps pinpoint: 1) what to do, 2) the best ways to accomplish the
objective, 3) the best order for achieving the design objectives, and 4) the
staffing/assets required to complete the task.
When: QFD is a major up-front effort (as is the case with most Japanese techniques)
to learn and understand the customer’s requirements and the approach that
will satisfy their objectives.
Where: The methodology is used as a team approach to solving problems and
satisfying customers, beginning with a listing of customer requirements,
converting customer requirements into engineering characteristics (the house
of quality), converting engineering characteristics into parts characteristics
(the house of parts deployment), converting parts characteristics in process
characteristics (the house of process planning), and finally, converting the
process characteristics into production characteristics (the house of production
planning). As with all Japanese techniques, the up-front costs are high and
many clever graphical tools exist for transferring information with the
intention of decreasing costs downstream while satisfying customer’s needs.
Return to top
Reliability-
What: Reliability is the probability that a device, system, or process will perform its
prescribed duty without failure for a given time when operated correctly in a
specified environment. This means that reliability is concerned with the
probability of future failures based on what has occurred with past
observations so we predict the future based on past observations.
Why: Reliability has two broad ranges of meanings:
1) qualitatively—operating without failure for long periods of time just as the
advertisements for sale suggest, and
2) quantitatively—where life is predictable, long, and measurable in tests to
assure satisfactory field conditions are achieved to meet customer
requirements.
Reliability is concerned with failure-free operation for periods of time,
whereas quality is concerned with avoiding non-conformances at a specified
time prior to shipment; thus, reliability measures a dynamic situation but
quality measures a static situation. As in physics, statics (not time dependent
as with quality issues) is easier to understand and calculate than dynamics
(time dependent as with reliability issues), which involves higher levels of
math and greater mental capabilities for comprehension.
When: Reliability is expected for new equipment to start, run, and continue to
function for long periods of time without failure. Reliability is also expected
when the equipment is dormant and called to duty. Reliability is also
expected upon service or restoration and resumption of long life. Reliability
is designed into the system by up-front activities, and reliability is sustained
by careful operation of the system along with careful nurturing of the system
with sustaining maintenance activities. Reliability always terminates in a
failure and the roots of failure can be due to design, fabrication, installation,
operation, maintenance (repair and periodic servicing), and management of
the system—in short, there are many ways and means to kill the system but
few ways to keep is operating without failure.
Where: The adage says “the proof of the pudding is in the eating,” and, for reliability,
the proof of the system is in the long failure-free interval. Reliability tools are
used from stem to stern to demonstrate high reliability (the absence of failures
for long periods of time) by use of many tools such as:
reliability acceptance test to demonstrate long life;
reliability analysis to compute the expected results;
reliability and maintainability the mathematical tasks that predict the
expected results from the elements;
reliability apportionment to allocate life issues in a top-down manner to
meet an overall reliability goal;
reliability assessment determines the achieved level of reliability of an
existing system using data gathered during test or use;
reliability assurance implements planned management and technical
measures to provide confidence that a reliability target is obtained and
maintained;
reliability block diagrams to graphically and mathematically calculate
reliability results prior to building a system;
reliability-centered maintenance is the systematic approach to identify
preventive support and service according to a set of procedures to reduce and
avoid failures;
reliability confidence limits demonstrate the limits for reliability within a
given confidence limit;
reliability control is the coordination and direction of system dependability
through design activities and management planning;
reliability critical item identification whereby failure significantly affects
system safety/cost or operational success or maintenance/logistics support
costs;
reliability data is the basic age-to-failure data as life unit information relating
to the time-to-failure when organized by probability distributions;
reliability degradation which incurs loss of the failure-free performance due
to poor workmanship or bad parts or improper operation or abuse or
inadequate maintenance;
reliability design practices are a series of trade-off-tools to meet or beat the
design specification for reliability;
reliability development/growth tests are the evaluations to disclose
deficiencies and verify corrective actions to prevent reoccurrence of the
failures to achieve the design specifications and sustain reliability growth
toward longer times between failure;
reliability estimates are life values used prior to statistical experimentation
with the end products to make predictions, or assessments, or stress analysis
evaluations;
reliability function is the graphical representation of life characteristics
plotted against operating time;
reliability growth achievement is the systematic improvements of a
item/systems dependability by removing failure mechanisms through
corrective actions to eliminate deficiencies and flaws often achieved by means
of test-analyze and fix;
reliability growth models (Crow-AMSAA) measures the reliability growth
by means of log-log plots of cumulative failures on the Y-axis and cumulative
time on the X-axis to demonstrate with statistics that failures are coming more
slowly and reliability goals have been achieved;
reliability guarantee is the commitment by suppliers to provide a given
meant time between replacements or to maintenance and overhauls intervals
for equipment;
reliability improvement is the identification of failure modes and effects
having a critical impact on the system failure potential of the design along
with the systematic removal of the failures to produce long life without
failures;
reliability index is the ratio of the mean reliability level achieved to the
acceptable level specified in the design as a figure of merit;
reliability measurement is failure-free endurance assessment activity for
making decisions about reliability and demonstrating compliance;
reliability mission is the mission time for demonstrating failure-free
performance;
reliability prediction is the process of quantitatively assessing whether a
proposed or existing deign meets a specified life requirement;
reliability prediction functions estimate the life characteristics for setting
goals and evaluating the design benchmarks and needs;
reliability prediction limitations describes the shortcomings in life values by
analytical methods;
reliability prediction requirements describes life assumptions,
environmental data, and failure rates for the design;
reliability prediction summary is a report providing conclusions and
recommendations based upon an reliability assessment analysis;
reliability program are the activities to organize and achieve a system to
insure reliability goals are achieved and deficient areas shored-up;
reliability program plan is the formal written definition of the specific tasks
to fulfill the reliability requirements;
reliability qualification test (RQT) is an evaluation conducted under
specified conditions using items representative of the approved product
configuration;
reliability quantitative elements are the life characteristics and factors
considered in predicting and measuring reliability performance;
reliability requirements are the numerical values representing a specified
failure-free life or dependability performance characteristic;
reliability sequential tests are evaluations of the number of failures and the
time required to reach a decision based on the accumulated results of the
reliability tests;
reliability tasks describe the activities required to achieve a reliability
program;
reliability tests are the formal evaluation to determine a product’s longevity
for the failure-free interval or stability relative to time/usage;
and finally,
reliability with repair is the failure-free performance achieved by
redundancy with permitted online repairs without interrupting equipment
operation.
Return to top
Reliability Audits-
What: Reliability audits verify your reliability program is effective and find areas of
weakness for corrective action. They are inquiries by factual examination of
elements of the system with written objective criteria for performance,
beginning with an assessment of how management is involved and are they
effective in building a productive reliability program.
Why: Most organizations know where they are strong. On an objective basis, few
organizations know where they are weak. Reliability audits are a fact-finding
exercises similar to financial and quality audits to ferret out weaknesses for
corrective action. The questions to be answered are:
1) How well are you doing what you promised against your reliability policy?
2) How well is upper management doing against company objectives for
reliability?
3) How well are reliability plans, systems, and procedures working?
4) How well are plans, systems, and procedures being executed against the
policy?
5) How well are productive efforts for reliability working toward achieving
the goals?
6) How well has the reliability system been communicated to employees and
are they committed to understanding and implementing the improvements?
and
7) Are financial objectives being met as a result of ongoing reliability
improvements? (Which is the main objective of the audit—not just a rigid
procedural/bureaucratic compliance to details).
When: Detailed annual audits should occur annually with a follow-up to occur six
month later to insure that corrective action has been implemented. Without a
six-month deadline, few tasks will be completed because of procrastination.
Where: Audits are needed for 1) reliability system management, 2) new techniques,
technology, developments, and controls, 3) supplier control (internal and
external), 4) process operation and control, 5) reliability data programs, 6)
problem-solving techniques, 7) control of reliability measurements, 8) human
resources involvement, 9) customer satisfaction assessment (internal/external),
and 10) software reliability (excluding Microsoft products used in the office
environment).
Return to top
Reliability Block Diagrams-

What: Reliability block diagram (RBD) models are graphical representations of a
calculation methodology for reliability systems.
Why: The RBD models allow calculation of system reliability based on
knowing/assuming failure details of the components, starting with the least
component and growing the model to the greatest system to predict
performance from the elements.
When: RBDs are used in upfront designs as a performance parameter and after the
system is constructed to ferret out poorly performing blocks that limit the
system performance.
Where: Frequently used as a trade-off tool to search for the lowest long cost of
ownership and to help sell alternative courses of action for moderating the
effects of reliability issues or overcoming the poor performance by alternative
designs where the results can be calculated before building the system as the
results of the calculations provide knowledge about availability, maintenance
interventions required for failures, and the number of spare parts required to
sustain operations. For other definitions see MIL-HDBK-338, sections 4 and
6.
Return to top
Reliability-Centered Maintenance-
What: Reliability-centered maintenance (RCM) is a systematic planning process
used to determine the maintenance requirements for a system. RCM expects
the system has an inherent reliability and maintenance requirements are
imposed upon the baseline of inherent safety and inherent reliability designed
into the system (the design sets the standard, it can be high, medium, or low).
Why: RCM does what is required to make sure the systems continue to do what the
users want done. If the excellent maintenance programs demonstrate the lack
of reliability expected, then the system must be improved by design changes
to physical assets or the manner in which the assets are used.
When: RCM requires a cultural change in both management and employees to “do
maintenance by the numbers”. This requires discipline in the organization to
perform the FMEAs that drive the work process for maintenance and it also
requires defining functional failures.
Where: RCM works better in top-quartile manufacturers who have a disciplined work
force and are interested in achieving excellence in 1) safety, 2) operability, 3)
reduced maintenance downtime by a disciplined approach to the maintenance
activities, 4) high uptimes, and 5) a reduction in failures. Lacking one or
more of the five efforts at excellence generally results in a failed RCM
program.
Return to top
Reliability Engineering-
What: A strategic job for preparing plans to reduce the failures and the cost of
failures as a preventative measure to reduce the cost of unreliability. Acquires
failure data and analyzes the data to quantify the financial impact and prepare
long-term solutions to prevent reoccurrences to improve reliability and
uptime. Determines the cost advantages and proposes alternatives for solving
the problem and recommends the alternative with the lowest long-term cost of
ownership. The purpose of these actions is to prevent failures.
Why: Prevents future failures by working on medium- and long-term projects using
technology to solve the problems. As required, provides technical assistance
to maintenance engineers to aid their efforts for quickly restoring equipment
to service.
When: Provides expertise for avoiding failures by means of a technical solution to
reduce the high-cost reliability problems on the Pareto distribution.
Where: Provides technical support and solutions for management on longer range
problems, and as required, supplies technical assistance to maintenance
engineers for immediate and difficult restoration projects as a liaison effort.
Supports task improvements to accomplish longer term objectives (think
months and quarters), which will result in smoother operations, at lower costs,
without failures.
Return to top
Reliability Growth Models-

What: Reliability growth models are important management concepts for making
reliability visual with simple displays. The simple log-log plots of cumulative
failures on the Y-axis against cumulative time on the X-axis often make
straight lines where the slope of the trend line is highly significant for telling
if failures are coming faster (β>1), which is undesirable, slower (β<1), which
is desirable, or without improvement/deterioration (β=1), which usually drifts
toward undesirable results. The reliability growth models are frequently
called Crow-AMSSA plots in honor of Larry Crow’s proof of why the charts
work as described in MIL-HDBK-189 when he worked with AMSAA.
Why: Both engineers and management must see reliability problems to fix them.
The simple log-log plots make the models visible. The task of the reliability
engineer is to put favorable cusps on the Crow-AMSAA trend lines to make
failures come more slowly and thus decrease the long-term cost of ownership.
If you’re doing your improvement job correctly, you’ll never have many
failures until you have a cusp.
When: The plots are useful for development tasks (where they first were used) or to
long-term operations. They work for safety programs, plant improvement
programs, environmental programs, or for cost problems. Use the plots as
“show me, don’t tell me,” how the projects are proceeding and the key metric
in the form of line slope is easy to understand and easy to communicate in less
than 60 seconds.
Where: They are used for technical development issues or for management reviews.
A picture is worth a thousand words for getting management’s attention for
focusing on a problem. Likewise the charts are highly useful for showing the
reductions in failures that have occurred from making a desirable and
permanent fix.
Return to top
Reliability Policies-
What: Management communicates with their staffs through important policy
statements. Management policies are general and relate to procedures and
rules which are specific for implementing policies. Written statements of
policy regarding reliability are decisive documents about avoiding system
failures in the same way that safety policies address the need for absence of
human injuries, quality policies address the need for absence of product
discrepancies, and environmental policies address the need for avoiding spills
and releases. Management needs to also say, by a policy statement, a
reliability policy that may read like this:
We will build an economical and failure-free process that will operate
for 5 years between planned outages.
This statement will clearly communicate that failures to the process (which is
the money machine) are to be abhorred and avoided!
Why: Process failures are clearly money issues because, when the process ceases to
run, the company has no income, thus process failures are to be abhorred for
killing the money machine.
When: Implementing a policy before constructions of new facilities is important to
use the policy as design criteria. When implemented with older facilities, the
task is more difficult and old facilities may never be able to comply with the
objectives at a reasonable cost alternative.
Where: Responsibility for implementing the policy lies with:
1) the chief operating officer must authorize the policy and ensure the policy
is applied throughout the operations under the administrative directive that
sets the guidelines for financial and engineering measures,
2) the engineering/R&D executives are responsible for ensuring the policy is
implemented by systems engineering, design engineering, project engineering,
pilot plant engineering, and test engineering,
3) the manufacturing executive is responsible for ensuring that the reliability
policy is carried out by the materials and procurement functions, industrial
engineering functions, manufacturing engineering functions, operations
functions, and maintenance functions,
4) the quality assurance executive is responsible for the dissemination of the
reliability policy, its annual review and auditing for compliance to the spirit of
the policy, and for making recommendations to the chief operating officer
concerning continued relevance, applicability, and effectiveness, and
5) the human resources executive is responsible for ensuring that all new
employees are indoctrinated into the purpose and implementation of the
reliability policy as a part of the operation’s mission, goals, and priorities.
Return to top
Reliability Testing-
What: Suppliers have two strategies for testing: 1) test for success and/or 2) test for
failures. Reliability testing produces failures, particularly when the tests are
accelerated with extra loads, and this may be troublesome to have in the
records for future lawsuits. Thus, it is often to everyone’s advantage to
perform reliability test under code names to protect against the broad rules of
legal discovery.
Why: The reliability tests will determine a product’s longevity and failure-free
performance. This requires data recording and data integrity. Plans must be
set for how the tests are to be conducted, loads to be handled, duration of the
tests, environmental conditions, operating modes, failure definitions, and
documentation for recording/analyzing the test data.
When: Reliability test are usually run prior to release of the product for sale or after
the product has been released and troublesome failures appear in field
applications where no problems were expected.
Where: Laboratory test are conducted in many cases but in other cases the data may
simply come from field use. Note the failures induced require extra
components that must be expected and budgeted along with the extra costs for
data acquisition/analysis.
Return to top
Simultaneous Testing-
What: For inexpensive components and inexpensive tests, simultaneous tests involve
many components under test loads/conditions at the same time for the purpose
of quickly acquiring data and producing test analysis as the failures occur. In
simultaneous testing, the suspensions (censored data) become important
details for use in the statistical analysis. Most simultaneous tests are
accelerated to generate the data in a short period of time, although this carries
the risk of introducing unexpected failure modes (but this can also be useful
information for anticipating field failures).
Why: Conducting analysis of the early test results, when only a few failures have
occurred, will give precursors as to passing/failing the longer-term tests. If
the early test results look encouraging, the larger test may be allowed to run to
conclusion. However if early test results are disappointing, the test may be
abandoned without using all of the testing budget so that remedial action can
occur prior to completing the full-scale planned test.
When: This testing is usually conducted prior to release of products. However, a
similar watch may be setup for warranty repairs so as to anticipate the cost
and extra supplies required to cope with an unexpected failure that was not
forecasted.
Where: This strategy is appropriate for inexpensive components in the test laboratory.
However, for warranty problems, the issues are very appropriate for expensive
components or assemblies.
Return to top
Software Reliability-
What: Software does not wear out but it does fail and most failures are due to
specification errors and code errors with only a few errors in copying or use.
The only software repair is by reprogramming and adding safety factors is
almost impossible. Software reliability improves by finding errors and fixing
the errors but estimating the number of errors that cause failures is extremely
difficult as many branches of software code may lie dormant and unused until
special events occur to make the latent failures obvious. Software failures are
not often time related but are more software code page dependent. Software
reliability is improved by extensive testing to disclose the failures and then
fixing them to repeat the test all over again to validate the fix did not generate
more failures and to continue the search of other latent defects.
Why: More than 50% of the software bugs (failures) occur from specifications with
lesser amounts of failures from system design and the coding process. This is
due to the lack of visibility in the software process along with problems from
those specifying the requirements with problem roots in ambiguities,
inconsistencies, incomplete statements, and lack of logical requirements. This
requires that both inputs and outputs for software must be specified in greater
detail than for mechanical, electrical, or system data to avoid the errors and
conflicts.
When: “Clean room” software procedures are a technique for extracting details from
the customers so the programmers get the scope of the project and the
input/output correct as an up-front effort to reduce errors and wasted code.
Acquiring the data is tedious, and roughly 80% of the software budget is spent
get the details “right” before programming commences.
Where: Disciplined software specialists carefully work the plan up-front to reduce
errors and testing time. Undisciplined, so called “neo-experts” want to see
busyness in code writing up-front and thus their software reliability is worse
from not having a firm foundation from which to work.
Return to top
Sudden Death Testing-

What: For expensive components and expensive tests, sudden death tests involve a
few components that tie-up a test frame as they are heavily loaded under the
same test loads/conditions with several items being run at the same time.
When one of the items fails, the entire test frame is shut down so that you
have 1 failure (this is the sudden death!) and several suspensions because the
unfailed units are survivors as the test is halted until the test frame is loaded
with new samples for resumption of the life test. Opening the test frame
(instead of tying up the frame until all samples have failed) is cost effective.
If three units can be tested simultaneously and the test is halted on the first
failure, then perhaps we will literally have only 4 failures and 8 suspensions
for preparing the Weibull analysis. Will the 4 sample + 8 suspension data set
be different than if all 12 samples had been run to failure?—the answer is yes,
they will be different, but will they be significantly different?—the answer is
no to the significant difference. So, as with simultaneous testing the
suspensions (censored data) become important details for use in the statistical
analysis. Most sudden death tests are accelerated to generate the data in a
short period of time although this carries the risk of introducing unexpected
failure modes (but this can also be useful information for anticipating field
failures).
Why: Sudden death testing is all about the economics and shorter elapsed time for
results.
When: Sudden death testing is used for product acceptance tests.
Where: It is a quick test for many products and the ongoing test for production lots.
Return to top
Total Productive Maintenance-

What: Total productive maintenance (TPM) is a corporate-wide effort involving all
employees to fully use equipment to the maximum limit employing an
equipment-oriented management concept to reduce failures and increase
utilization of equipment and processes in a productive manner. TPM
programs are teamwork programs and require a corporate culture of teamwork
devoid of us vs. them issues. All employees are expected to accept ownership
of the equipment and processes to do many small things all the time to ensure
high levels of availability by eliminating failures in the early stages with low-
cost actions. The employees approach the process equipment as owners rather
than renters.
Why: Maximizing equipment uptime with lower costs by all employees working to
reduce the many small incidents that lead to a failure
When: Major maintenance tasks are handled by the craftsmen. Most small tasks are
handled by operators in a never-ending effort of cleaning, lubricating, and
tightening to find problems early when they can be solved simply instead of
letting the problem grow to a major issue.
Where: TPM is a system-wide effort of providing care to the equipment rather than
saying “it’s not my job,” and “We’ve got to fill out the paperwork before
‘they’ can do anything.” The technique makes good use of the 5 human
senses but technical details must be taught to the work force to understand
good from bad and when action must be taken along with what must be
done—this requires a sharing environment where the work team works for the
common good of higher performance. If the culture is me, me, me, TPM will
not work.
Return to top
Weibayes Estimates-
What: If you’ve got one piece of failure data and nothing else, you’re a poor person
without much hope. If you’ve got one piece of failure data and a Weibull
database, you’re a rich person with a map on the back of an envelope and a
compass by your side to get you out of the abysmal swamp of ignorance and
misunderstanding.
Why: The Weibayes technique uses your failure data and past experience to make
Weibull analysis forecast about what you should expect into the future and in
many cases, given a hypothesis of worst-case/best-case a failure forecast can
be generated.
When: Use the technique when you lack specific details but you know something
from your past experience—often the past experience reduces errors of
Weibull analysis. Use Weibayes analysis to make sense out of emotional
nonsense.
Where: Use the technique to say something and point noses in the right direction
rather than playing the role of Chicken Little with the sky falling. Some data
is better than no data in most cases, and when you can keep your wits and
everyone else is in panic mode, it quiets the problem to allow reason to
prevail.
Return to top
Weibull Analysis-
What: Weibull analysis is the tool of choice for most reliability engineers when they
consider what to do with age-to-failure data. It uses the two-parameter
Weibull distribution which says mathematically that reliability, R(t) = e-
(t/η)^β
where t is time, η is a scale factor known as the characteristic life (most
of the Weibull distributions have tailed data and lack an easy way to describe
central tendency as the mode≠median≠mean; however, regardless of the β-
values, which is a shape factor, all of the cumulative distribution function
values pass through the η value at 63.2% which thus entitles it to be known as
the single-point characteristic life).
Be careful in use of the three-parameter Weibull equation! It is frequently
misused simply to get a good curve fit! The three-parameter Weibull requires
compliance with these four requirements:
1) you must see curvature of data on a two-parameter plot (concave
downward curves imply a failure free interval on the age-to-failure axis
whereas concave upward curves imply a percentage of the population are
prefailed),
2) you must have a physical reason for why a three-parameter distribution
exists (producing a better curve fit is not a valid reason!),
3) you must have at 21 failure data points (if curvature is slight you may
need 100+ data points), and
4) the goodness of curve fit must be significantly better after use of the
three-parameter distribution.
Why: The Weibull distribution is so frequently used for reliability analysis because
one set of math (based on the weakest link in the chain will cause failure)
described infant mortality, chance failures, and wear-out failures. Also the
Weibull distribution has a closed form solution:
1) for the probability distribution function (PDF),
2) for the cumulative distribution function (CDF),
3) for the reliability function (1-CDF), and
4) the instantaneous failure rate which is also known as the hazard function.
For engineers, discrete solutions are preferred rather than use of tables
because of simplicity. In a similar manner, engineers strongly need graphics
of the Weibull distribution whereas statisticians do not find the graphics
nearly as useful for comprehension.
When: Use Weibull analysis when you have age-to-failure data.
• When you have age-to-failure data by component, the analysis is very
helpful because the β-values will tell you the modes of failure which no other
distribution will do [β<1 implies infant mortality with decreasing failure rates,
β≈1 implies chance failures with a constant failure rate, and β>1 implies
wear-out failure modes with increasing failure rates—when you know the
failure mode you know which “medicine” to apply]!
• When you have age-to-failure for the system, the β-values have NO
physical significance and the β-, η-values only explain how the system is
functioning—this means you loose significant physical information for
problem solving.
Where: When in doubt, use the Weibull distribution to analyze age-to-failure data. It
works with test data. It works with field data. It works with warranty data. It
works with accelerated testing data. The Weibull distribution is valid for
~85% to 95% of all life data, so play the odds and start with Weibull analysis.
The major competing reliability distribution for Weibull analysis is the
lognormal distribution which is driven by accelerating events. For additional
information read The New Weibull Handbook, 5th edition by Dr. Robert B.
Abernethy and use the SuperSMITH Weibull and SuperSMITH Visual
software for analyzing the data (both software are bundled for a reduce price
as SuperSMITH).
Return to top
Weibull Corrective Action-

What: Starting with Weibull analysis of component failures, the shape factor β
derived from the Weibull analysis provides an objective guide for selecting
repair strategies.
Why: Experience has shown when shape factor beta is:
β < 1, failure rates are declining with time as occurs with infant mortality
failure modes. This condition provides a run to failure strategy. Older
components are better than new components because the failure rate for the
population is lower than when new.
β ≈ 1, failure rates are constant with time as occurs with chance failure modes.
This condition provides a run to failure strategy (or a run until the component
failure mode changes to a wearout failure mode). An old component is as
good as a new component.
β > 1, failure rates are increasing with time as occurs with wearout failure
modes. If the cost of failures in service is much greater than the cost for a
replacement, the component may have an optimum replacement interval for
timed replacements. If the cost of failures in service is equal to or slightly
larger than for a replacement, the component many have a run to failure
strategy.
Bottom line: You must know your Weibull failure modes and your costs to
make a good maintenance decision.
When: Collect data from the FRACAS system. Perform a Weibull analysis. Store
the data in a Weibull database. Use the Weibull facts for making fact based
technical decisions.
Where: Weibull corrective action is used by maintenance engineers and reliability
engineers. It is a useful tool for understanding scatter in the data and
provides guidance for taking the appropriate corrective action.
Return to top
Weibull Database-
What: The smartest way to maintain a reliability database is in Weibull format and
Weibull databases are available. Seldom do you see Weibull databases from
vendors because they jealously protect their data for proprietary reasons—
they live/die financially from the Weibull database information.
Why: The Weibull databases simplify the complications of failure data into two
statistical values of great importance:
β tells you HOW things fail at the component level, and
η tells you WHEN things fail.
The results are key benchmark data that tell you how you’re doing.
When: Gather your failure data and create your own database. No one is going to
give you their database because they put much sweat and tears into cleaning
up the data so it is useful. The data needs to be locally generated because it
tells you: 1) the life from the grade of equipment you purchase, 2) it describes
the grade of operation of the equipment—do you operate it like 16-year-old
teen agers or wise old men/women of 65?, 3) it describes the grade of
maintenance you use to renew its life, and 4) it tells you management’s
expectations for how to treat the system.
Where: Data collections as a Weibull database seems to many to start out as a silly
exercise by maintenance to accumulate data with much ridicule from the
unknowledgeable about why are you spending so much effort to build a
Weibull database. When adversity arises, the Weibull database becomes
everyone’s prized possession with proprietary information. Remember the
worlds of Rudyard Kipling about plight of the English soldier: To paraphrase:
In peacetime it’s Tommy this and Tommy that, and Tommy get out of the
way…but you let the bullets fly in wartime and it’s Mr. This and Mr. That and
Mr., if you please! Everyone wants the baby but no one wants the dirty
diapers that go with every baby! If you don’t have a Weibull database, you’re
already too late because your competitor has one started and is using it to your
disadvantage, and he’s not going to tell you why you’re left in the dirt!
Return to top
Comments:
Refer to the caveats on the Problem Of The Month Page about the limitations of the
following solution. Maybe you have a better idea on how to solve the problem. Maybe
you find where I've screwed up the solution and you can point out my errors as you check
my calculations. E-mail your comments, criticism, and corrections to: Paul Barringer by
clicking here. Return to top of page.
You can download a copy of this page as a PDF file.
Return to Barringer & Associates, Inc. homepage
Last revised November 8, 2010

© Barringer & Associates, Inc. 2007

Reliability Tools

Uploaded by

Copyright:

Available Formats

Reliability Tools

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reliability Tools

Uploaded by

Copyright:

Available Formats

What are some reliability tools discussed?

What are some reliability tools discussed?

What is availability and how is it measured?

What is availability and how is it measured?

Reliability Tools:

The most important reliability tool is a Pareto distribution based on

Block Diagram Model (same as Reliability Block Diagram Models)-

Contracting For Reliability-

Critical Items List-

Design Reviews For Reliability-

Environmental Stress Screening (ESS)-

Fault Tree Analysis-

Life Cycle Cost-

Management’s Role For Reliability-

Mechanical Components Interaction-

Monte Carlo Simulation-

Quality Function Deployment-

Reliability Block Diagrams-

Reliability Growth Models-

Sudden Death Testing-

Total Productive Maintenance-

Weibull Corrective Action-

You can download a copy of this page as a PDF file.

Return to Barringer & Associates, Inc. homepage

Last revised November 8, 2010

You might also like