Failure Rate 1
Failure Rate 1
Failure Rate 1
71849
Chapter 7
Provisional chapter
Abstract
Failure prediction is one of the key challenges that have to be mastered for a new arena
of fault tolerance techniques: the proactive handling of faults. As a definition, prediction
is a statement about what will happen or might happen in the future. A failure is
defined as “an event that occurs when the delivered service deviates from correct
service.” The main point here is that a failure refers to misbehavior that can be observed
by the user, which can either be a human or another computer system. Things may go
wrong inside the system, but as long as it does not result in incorrect output (including
the case that there is no output at all) there is no failure. Failure prediction is about
assessing the risk of failure for some time in the future. In my approach, failures are
predicted by analysis of error events that have occurred in the system. As, of course, not
all events that have occurred ever since can be processed, only events of a time interval
called embedding time are used. Failure probabilities are computed not only for one
point of time in the future, but for a time interval called prediction interval.
1. Introduction
Failure prediction is one of the key challenges that have to be mastered for a new arena of fault
tolerance techniques: the proactive handling of faults. As a definition, prediction is a statement
about what will happen or might happen in the future. A failure means “an occurrence that
happens when the delivered service gets out from correct service.”
The main point here is that a failure derives of misbehavior that can be observed by the
operator, which can either be a human or another computer system. Some things may go
wrong inside the system, but as long as it does not eventuate in incorrect output (such as the
system that there is no output at all) the system can run without failure. Failure prediction is
about evaluation the risk of failure for some times in the future. In my viewpoint, analysis of
© 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,
distribution, and reproduction in any medium, provided the original work is properly cited.
© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,
distribution, and eproduction in any medium, provided the original work is properly cited.
100 Failure Analysis and Prevention
error events that have occurred in the system can be called failure prediction. To compute
breakdown probabilities, not only one point of time in the future, but a time interval called
prediction interval are considered, simultaneously.
Failure rates and their projective manifestations are important factors in insurance, business,
and regulation practices as well as fundamental to design of safe systems throughout a
national or international economy. From an economic view point, inaction owing to machin-
ery failures as a consequence of downtimes can be so costly. Repairs of broken down
machines are also expensive, because the breakdowns consume resources: manpower, spare
parts, and even loss of production. As a result, the repair costs can be considered as an
important component of the total machine ownership costs. Traditional maintenance policies
include corrective maintenance (CM) and preventive maintenance (PM). With CM policy,
maintenance is performed after a breakdown or the occurrence of an obvious fault. With
PM policy, maintenance is performed to prevent equipment breakdown. As an example, it is
appeared that in developing countries, almost 53% of total machine expenses have spent to
repair machine breakdowns whereas it was 8% in developed countries, that founding
the effective and practicable repair and maintenance program could decreased these costs
up to 50%.
The complex of maintenance activities, methodologies and tools aim to obtain the continuity of
the productive process; traditionally, this objective was achieved by reviewing and substitut-
ing the critical systems or through operational and functional excess in order to guarantee an
excess of productive capacity. All these approaches have partially emerged inefficiencies:
redundant systems and surplus capacity immobilize capitals that could be used more Afford-
able for the production activities, while accomplishing revision policies very careful means to
support a rather expensive method to achieve the demand standards. The complex of mainte-
nance activities is turned from a simple reparation activity to a complex managerial task which
main aim is the prevention of failure. An optimal maintenance approach is a key support to
industrial production in the contemporary process industry and many tools have been devel-
oped for improving and optimizing this task.
The majority of industrial systems have a high level of complexity, nevertheless, in many cases,
they can be repaired. Moreover historical and or benchmarking data, related to systems failure
and repair patterns, are difficult to obtain and often they are not enough reliable due to various
practical constraints. In such circumstances, it is evident that a good RAM analysis can play a
key role in the design phase and in any modification required for achieving the optimized
performance of such systems. The assessing of components reliability is a basic sight for
appropriate maintenance performance; available reliability assessing procedures are based on
the accessibility of knowledge about component states. Nevertheless, the states of component
are often uncertain or unknown, particularly during the early stages of the new systems
development. So for these cases, comprehending of how uncertainties will affect system
reliability evaluation is essential. Systems reliability often relies on their age, intrinsic factors
(dimensioning, components quality, material, etc.) and use conditions (environment, load rate,
stress, etc.). The parameter defining a machine’s reliability is the failure rate (λ), and this value
Failure Rate Analysis 101
http://dx.doi.org/10.5772/intechopen.71849
is the characteristic of breakdown occurrence frequency. In this context, failure rate analysis
constitute a strategic method for integrating reliability, availability and maintainability, by
using methods, tools and engineering techniques (such as Mean Time to Failure, Equipment
down Time and System Availability values) to identify and quantify equipment and system
failures that prevent the achievement of its objectives. At first we define common words
related to failure rate:
• Failure
A failure occurs when a component is not available. The cause of components failure is
different; they may fail due to have been randomly chosen and marked as fail to assess their
effect, or they may fail because any other component that were depending on else has brake
down. In reliability engineering, a Failure is considered to event when a component/system is
not doing its favorable performance and considered as being unavailable.
• Error
In reliability engineering, an error is said a misdeed which is the root cause of a failure.
• Fault
In reliability engineering, a fault is defined as a malfunction which is the root cause of an error.
But within this chapter, we may refer to a component failure as a fault that may be conducted
to the system failure. This is done where there is a risk of obscurity between a failure which is
occurring in intermediate levels (referred to as a Fault) and one which is occurring finally
(referred to as Failure).
2. Failure Rate
The reliability of a machine is its probability to perform its function within a defined period with
certain restrictions under certain conditions. The reliability is the proportional expression of a
machine’s operational availability; therefore, it can be defined as the period when a machine can
operate without any breakdowns. The equipment reliability depends to failures frequency,
which is expressed by MTBF1. Reliability predictions are based on failure rates. Failure intensity
or λ(t)2 can be defined as “the foretasted number of times an item will break down in a deter-
mined time period, given that it was as good as new at time zero and is functioning at time t”.
This computed value provides a measurement of reliability for an equipment. This value is
currently described as failures per million hours (f/mh). As an example, a component with a
failure rate of 10 fpmh would be anticipated to fail 10 times for 1 million hours time period. The
calculations of failure rate are based on complex models which include factors using specific
component data such as stress, environment and temperature. In the prediction model,
1
Mean time between failures.
2
Conditional failure rate.
102 Failure Analysis and Prevention
assembled components are organized serially. Thus, failure rates for assemblies are calculated by
sum of the individual failure rates for components within the assembly. The MTBF was deter-
mined using Eq. (1). Failure rate which is equal to the reciprocal of the mean time between
failures (MTBF) defined in hours (λ) was calculated by using Eq. (2) [1].
ð1Þ
ð2Þ
where, MTBF is mean time between failures, h; T is total time, h; n is number of failures; λ is
failure rate, failures per 10n h.
or
NOTE: Although MTBF was designed for use with repairable items, it is commonly
used for both repairable and non-repairable items. For non-repairable items, MTBF is
the time until the first (an only) failure after t0.
Failure Rate Analysis 103
http://dx.doi.org/10.5772/intechopen.71849
3. Units
Any unit of time can be mentioned as failure rate unit, but hours is the most common unit in
practice. Other units included miles, revolutions, etc., which can also replace the time units.
In engineering notation, failure rates are often very low because failure rates are often expressed
as failures per million (106), particularly for individual components.
The failures in time (FIT) rate for a component is the number of failures that can be occurred in
one billion (109) use hours. (e.g., 1000 components for 1 million hours, or 1 million components
for each 1000 hours, or some other combination). Semiconductor industry currently used this
unit.
104 Failure Analysis and Prevention
Example 1 If we aim to estimate the failure rate of a certain component, we can carry out this
test. Suppose each one of 10 same components are tested until they either break down or reach
1000 hours, after this time the test is completed for each component. The results are shown in
Table 1 as follows:
1 950 Failed
2 720 No failure
3 467 No failure
4 635 No failure
5 1000 Failed
6 602 No failure
7 1000 Failed
8 582 No failure
9 940 Failed
10 558 Failed
Totals 7454 5
Example 2 If a tractor be operated 24 hours a day, 7 days a week, so it will run 6540 hours for 1
year and at which time the MTBF number of a tractor be 1,050,000 hours:
In the average year, we can expect to fail about 0.62% of these tractors.
Example 3 Now assuming a tractor be operated at 6320 hours a year and at which time the
MTBF number of this be 63,000 hours.
, i.e., ~13.9% of these tractors may break down in the average year.
One of basic measures of reliability is mean time to failure (MTTF) for non-repairable systems.
This statistical value is defined as the average time expected until the first failure of a compo-
nent of equipment. MTTF is intended to be the mean over a long period of time and with a
large number of units. For constant failure rate systems, MTTF can calculated by the failure
rate inverse, 1/λ. Assuming failure rate, λ, be in terms of failures/million hours, MTTF =
1,000,000/failure rate, λ, for components with exponential distributions. Or:
ð3Þ
For repairable systems, MTTF is the anticipated time period from repair to the first or next
break down.
Failure Density f(t)- The failure density of a component or system means that first failure what
is likely to occur in the component or system at time t. In such cases, the component or system
was running at time zero.
Failure Rate or r(t)- The failure rate of a component or system is expressed as the probability
per unit time that the component or system experiences a failure at time t. In such cases, the
component or system was using at time zero and has run to time t.
Conditional failure rate or conditional failure intensity λ(t)– The conditional failure rate of a
component or system is the probability per unit time that a failure occurs in the component or
system at time t, so the component or system was operating, or was repaired to be as good as
new, at time zero and is operating at time t.
Unconditional failure intensity or failure frequency ω(t)– The definition of the unconditional
failure intensity of a component or system is the probability per unit time when the component
or system fail at time t. In such cases, the component or system was using at time zero. The
following relations (4) exist between failure parameters [2].
106 Failure Analysis and Prevention
ð4Þ
The difference between definitions for failure rate r(t) and conditional failure intensity λ(t)
refers to first failure that the failure rate specifies this for the component or system rather than
any failure of the component or system. Especially, if the failure rate being constant at consid-
ered time or if the component is non-repairable. These two quantities are same. So:
The conditional failure intensity (CFI) λ(t) and unconditional failure intensity ω(t) are differ-
ent because the CFI has an additional condition that the component or system has survived
to time t. The equation (5) mathematically showed the relationship between these two
quantities.
ð5Þ
If the failure rate is constant then the following expressions (6) apply:
ð6Þ
As can be seen from the equation above, a constant failure rate results in an exponential failure
density distribution.
self-imposed downtime, and any logistics or administrative delays. The MDT and MTTR (mean
time to repair) are difference due to the MDT includes any and all delays involved; MTTR looks
particularly at repair time.
Sometimes, Mean Time To Repair (MTTR) is used in this formula instead of MDT. But MTTR
may not be the identical as MDT because:
If you used MDT or MTTR, it is important that it reflects the total time for which the equipment
is unavailable for service, on the other hands the computed availability will be incorrect.
In the process industries, MTTR is often taken to be 8 hours, the length of a common work shift
but the repair time really might be different particularly in an installation.
PFD is probability of failure on demand. The design of safety systems are often such that to
work in the background, monitoring a process, but not doing anything until a safety limit is
overpassed when they must take some action to keep the process safe. These safety systems are
often known as emergency shutdown (ESD) systems.
PFD means the unavailability of a safety task. If a demand to act occurs after a time, what is the
probability that the safety function has already failed? As you might expect, the PFD equation
looks like the equation (7) for general unavailability [3]:
Note that we talk about PFDavg here, the mean probability of failure on demand, which is
really the correct term to use, since the probability does change over time—the failure proba-
bility of a system will relied on how long ago you tested it.
λDU is the failure rate of dangerous undetected failures. We are not counting any failures that are
guessed to be “safe,” perhaps because they cause the process to shut down, only those failures
which remain hidden but will fail the operation of the safety function when it is called upon.
This is essential as it assures us not to suppose that a safety-related product is generally more
reliable than a general purpose product. The aim of safety-related product design is to have
especially low failure rate of the safety task, but its total failure rate (MTBF) may not be so efficient.
So, the MDT for a safety function is defined as a dangerous undetected failure will not be
obvious until either a demand comes along or a proof test would be revealed it.
Suppose we proof test our safety function every year or two, say every T1 hours. The safety
function is equally likely to fail at any time between one proof test and the next, so, on average
it is down for T1/2 hours.
108 Failure Analysis and Prevention
From this we get the simplest form of PFD calculation for safety functions [3]:
3.6. SIL
Under reliability engineering, SIL is one of the most abused terms. “SIL” is often used to
mention that an equipment or system show better quality, higher reliability, or some other
desirable feature. It does not. SIL actually means safety integrity level and has a range between
1 and 4. It is applied to depict the safety protection degree required by a process and finally the
safety reliability of the safety system is essential to obtain that protection. SIL4 shows the
highest level of safety protection and SIL1 is the lowest.
Many products are demonstrated by “SIL” rated. This means that they are appropriate for use
in safety systems. In fact, if this is true, it relies on a lot of detail, which is beyond the scope of
this chapter. But remember that even when a product indeed matches with “SIL” needs that
are only reminding you that it will do a definite job in a safety system. This safety reliability
may be high, but its general reliability may not be, as mentioned in the prior section.
Useful to remember
• If an item works for a long time without breakdown, it can be said is highly reliable.
• If an item does not fail very often and, when it does, it can be quickly returned to
service, it would be highly available.
• If a system is reliable in performing its safety function, it is considered to be safe.
The system may fail much more frequently in modes that are not considered to be
dangerous.
• Finally, a safety system may be has lower MTBF in total than a non-safety system
performing a similar function.
• “SIL” does not mean a guarantee of quality or reliability, except in a defined safety
context.
• MTBF is a measure of reliability, but it is not the expected life, the useful life, or the
average life.
• Calculations of reliability and failure rate of redundant systems are complex and
often counter-intuitive.
4. Failure types
Failures generally be grouped into three basic types, though there may be more than one cause
for a particular case. The three types included: early failures, random failures and wear-out
Failure Rate Analysis 109
http://dx.doi.org/10.5772/intechopen.71849
failures. In the early life stage, failures as infant mortality often due to defects that escape the
manufacturing process. In general, when the defective parts fail leaving a group of defect free
products, the number of failures caused by manufacture problems decrease. Consequently the
early stage failure rate decreases with age. During the useful life, failures may related to freak
accidents and mishandling that subject the product to unexpected stress conditions. Suppose
the failure rate over the useful life is generally very low and constant. As the equipment
reaches to the wear-out stage, the degradation of equipment is related to repetitious or con-
stant stress conditions. The failure rate during the wear-out stage increases dramatically as
more and more occurs failure in equipment that caused by wear-out failures. When plotting
the failure rate over time as illustrated in Figure 1, these stages make the so-called “bath tub”
curve.
To ensure the integrity of design, we used many methods. Some of the design techniques
include: burn-in (to stress devices under constant operating conditions); power cycling (to
stress devices under the surges of turn-on and turn-off); temperature cycling (to mechanically
and electrically stress devices over the temperature extremes); vibration; testing at the thermal
destruct limits; highly accelerated stress and life testing; etc. Despite usage of all these design
tools and manufacturing tools such as six sigma and quality improvement techniques, there
will still be some early failures because we will not able to control processes at the molecular
level. There is always the risk that, although the most up to date techniques are used in design
and manufacture, early breakdowns will happen. In order to remove these risks — especially
in newer product consumes some of the early useful life of a module via stress screening.
The start of operating life in initial peak represents the highest risk of failure; since in this
technique, the units are allowed to begin their somewhere closer to the flat portion of the
bathtub curve. Two factors included burn in and temperature cycling consumed the operating
life. The amount of screening needed for acceptable quality is a function of the process grade as
well as history. M-Grade modules are screened more than I-Grade modules, and I-Grade
modules are screened more than C-Grade units.
Consider another example, there are 15,000 18-year-old humans in the sample. Our investiga-
tion is related to 1 year. During this period, the death rate became 15/15,000 = 0.1%/year. The
inverse of the failure rate or MTBF is 1/0.001 = 1000. This example represents that high MTBF
values is different from the life expectancy. As people become older, more deaths occur, so the
best way to calculate MTBF would be monitor the sample to reach their end of life. Then, the
average of these life spans are computed. Then we approach to the order of 75–80 which
would be very realistic.
There are two major categories for system outages: 1. Unplanned outages (failure) and 2.
Planned outages (maintenance) that both conducted to downtime. In terms of cost, unplanned
and planned outages are compared but use the redundant components maybe mitigate it. The
planned outage usually has a sustainable impact on the system availability, if their schemati-
zation be appropriate. They are mostly happen due to maintenance. Some causes included
periodic backup, changes in configuration, software upgrades and patches can caused by
planned downtime. According to prior research studies 44% of downtime in service providers
is unscheduled. This downtime period can spent lots of money.
Another categorization can be:
• Internal outage
• External outage
Specification and design flaws, manufacturing defects and wear-out categorized as internal
factors. The radiation, electromagnetic interference, operator error and natural disasters can
considered as external factors. However, a well-designed system or the components are highly
reliable, the failures are unavoidable, but their impact mitigation on the system is possible.
The most common ways that failure rate data can be obtained as following:
• Historical data about the device or system under consideration.
Many organizations register the failure information of the equipment or systems that they
produce, in which calculation of failure rates can be used for those devices or systems. For
equipment or systems that produce recently, the historical data of similar equipment or
systems can serve as a useful estimate.
• Government and commercial failure rate data.
The available handbooks of failure rate data for various equipment can be obtained from
government and commercial sources. MIL-HDBK-217F, reliability prediction of electrical
equipment, is a military standard that provides failure rate data for many military elec-
tronic components. Several failure rate data sources are available commercially that focus
on commercial components, including some non-electronic components.
• Testing
The most accurate source of data is to test samples of the actual devices or systems in
order to generate failure data. This is often prohibitively expensive or impractical, so that
the previous data sources are often used instead.
Distributions
Discrete Continuous
4.8. Derivations of failure rate equations for series and parallel systems
This section shows the derivations of the system failure rates for series and parallel configura-
tions of constant failure rate components in Lambda Predict.
ð9Þ
where R1, R2, …, Rn are the values of reliability for the n components. If the failure rates of the
components are λ1, λ2,…, λn, then the system reliability is:
ð10Þ
Therefore, the system reliability can be expressed in terms of the system failure rate, λS, as:
ð11Þ
Where and λS is constant. Note that since the component failure rates are con-
stant, the system failure rate is constant as well. In other words, the system failure rate at any
mission time is equal to the steady-state failure rate when constant failure rate components are
arranged in a series configuration. If the components have identical failure rates, λC, then:
ð12Þ
Failure Rate Analysis 113
http://dx.doi.org/10.5772/intechopen.71849
It should be pointed out that if n blocks with non-constant (i.e., time-dependent) failure rates
are arranged in a series configuration, then the system failure rate has a similar equation to the
one for constant failure rate blocks arranged in series and is given by:
ð13Þ
ð14Þ
where RC is the reliability of each component. Substituting the expression for component
reliability in terms of the constant component failure rate, λC, yields:
ð15Þ
Notice that this equation does not reduce to the form of a simple exponential distribution like
for the case of a system of components arranged in series. In other words, the reliability of a
system of constant failure rate components arranged in parallel cannot be modeled using a
constant system failure rate model.
To find the failure rate of a system of n components in parallel, the relationship between the
reliability function, the probability density function and the failure rate is employed. The
failure rate is defined as the ratio between the probability density and reliability functions, or:
ð16Þ
Because the probability density function can be written in terms of the time derivative of the
reliability function, the previous equation becomes:
ð17Þ
ð18Þ
ð19Þ
114 Failure Analysis and Prevention
Substituting into the expression for the system failure rate yields:
ð20Þ
For constant failure rate components, the system failure rate becomes:
ð21Þ
Thus, the failure rate for identical constant failure rate components arranged in parallel is time-
dependent. Taking the limit of the system failure rate as t approaches infinity leads to the
following expression for the steady-state system failure rate:
ð22Þ
ð23Þ
So the steady-state failure rate for a system of constant failure rate components in a simple
parallel arrangement is the failure rate of a single component. It can be shown that for a k-
out-of-n parallel configuration with identical components:
ð24Þ
Author details
Fatemeh Afsharnia
Address all correspondence to: [email protected]
References
[1] Billinton R, Allan RN. Reliability Evaluation of Engineering Systems (Concepts and Tech-
niques). New York, London: Plenum Press; 1992. 453 pp
[2] Available from: http://www.reliabilityeducation.com/ReliabilityPredictionBasics.pdf
[3] Available from: http://www.mtl-inst.com