ECT144

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

cahiers techniques 144

introduction to dependability
design
P. Bonnefoi

Pascal Bonnefoi earned his enginee-


ring degree ESE in 1985. After working
for a year in Operational Research for
the French Navy he started his work as
a reliability analyst for Merlin Gerin in
1986, in the Reliability studies for which
he developed a series of special
software packages. He aslo taught
courses in this field in the industrial and
academic worlds. He is presently
working as a software engineer for
HANDEL, a Merlin Gerin subsidiary.

MERLIN GERIN
service information
38050 Grenoble Cedex
France MERLIN GERIN
tél. : 76.57.60.60
la maîtrise de l'énergie électrique
E/CT 144
GROUPE SCHNEIDER
December 1990
Equipment failures, unavailability of a
introduction to dependability design power supply, stoppage of automated
equipment and accidents are quickly
P. BonnefoiP=. becoming unacceptable events, be it to
the ordinary citizen or industrial
manufacturers.
Dependability and its components:
reliability, maintainability, availability and
safety, have become a science that no
designer can afford to ignore.
Table of contents This technical report presents the basic
concepts and an explanation of its basic
1. Importance of dependability In housing p. 2
computational methods.
In services p. 2 Some examples and several numerical
In industry p. 2 values are given to complement the
formulas and references to the various
2. Dependability characteristics Reliability p. 2
computer tools usually applied in this
Failure rate p. 2 field .
Availability p. 3
Maintainability p. 4
Safety p. 4
3. Dependability characteristics Interrelated quantities p. 5
interdependence Conflicting requirements p. 5
Time average related quantities p. 6
4. Types of defects Physical defects p. 7
Design defects p. 7
Operating errors p. 7
5. From component to system: Data bases for system
modeling aspects components p. 8
FMECA method p. 11
Reliability block diagram p. 11
Fault trees analysis p. 14
State graphs p. 17
6. Conclusion p. 19
7. References and Standards p. 20

cahiers techniques Merlin Gerin n° 144 / p.1


1. importance of dependability

Prehistoric men had to depend on their In competitive industries it is not For over 20 years Merlin Gerin has
arms for survival. Modern man is sur- possible to tolerate production losses. pioneered work in the DEPENDABILITY
rounded by ever more sophisticated tools This is even more so for complex field: in the past, with its contribution to
and systems on which he depends for industrial processes. In these cases the design of nuclear power plants or the
safety, efficiency and comfort. one vies to obtain the best: high availability of power supplies used at
Ordinary citizen are specially concer- ■ reliability of command and control the launching site of the ARIANE space
ned in everyday life by: systems, program, nowadays, by its design of
■ the reliability of the TV set, ■ availability of machine tools, products and systems used worldwide.
■ the availability of the mains supply, ■ maintainability of production tools,
■ the maintainability of freezers and cars, ■ personnel and invested capital safety.
■ the safety of their boiler valves. These characteristics, known under the
Bankers and, in general, service general term of DEPENDABILITY, are
industries give a lot of weight to: related to the concept of reliance, (to
■ computer reliability, depend upon something). They are
■ availability of heating, quantified in relation to a goal, they are
■ maintainability of elevators, computed in terms of a probability and
■ fire related safety. are obtained by the choice of an
architecture and its components. They
can be verified by suitable tests or by
experience.

2. dependability characteristics

reliability Function: the reliability is a characteristic the probability that it will suddenly burn
assigned to the system’s function. out in the interval of time (t, t+∆t), given
Light bulbs are used by everyone: Knowledge of its hardware architecture is that it kept working until time t. Failure
individuals, bankers and industrial usually not enough. Functional analysis rates are time rates and, as such, their
workers. When turned on, a light bulb is methods must be used to determine the units are inverse time.
expected to work until turned off. Its reliability.
reliability is the probability that it works Mathematically, the failure rate is written
until time t and it is a measure of the light Conditions: the environment has a as:
bulb’s aptitude to function correctly. fundamental role in reliability. This is also
Definition:
The reliability of an item is the probability
true for the operating conditions.
Hardware aspects are clearly insufficient.
λ(t) = lim
∆t⇒0
( 1 R(t) - R (t+∆t)
∆t R(t) )
that this item will be able to perform the Time interval: we wish to emphasize an -1 d R(t)
interval of time as opposed to a specific = (1)
function it was designed to accomplish R(t) dt
under given conditions during a time instant. Initially, the system is supposed
interval (t1,t2); it is written R(t1,t2). to work. The problem is to determine for
how long. In general t1=0 and it is possible For a human being, the failure rate
This definition follows the one given by
to write R(t) for the reliability function. measures the probability of death
the IEC (International Electrotechnical
occurring in the next hour:
Commission)International Electrotechni-
λ(20 years)=10-6 per hour.
cal Vocabulary, Chapter 191. There are failure rate If λ is represented as function of age, one
certain basic concepts used by this defi-
Consider the light bulb example again. Its obtains the curve given in figure 1.
nition which must be detailed:
failure rate at time t, written as λ(t), gives

cahiers techniques Merlin Gerin n° 144/ p.2


After the high values corresponding to
the infant mortality period, λ reaches the
value of adult age during which it becomes
constant since causes of death are mainly
λ(t)
accidental and thus, independent of age.
After 60 years old, old age causes λ to
increase. Experience seems to show that
many electronic components follow a
similar bathtub curve, from which the
same terminology is borrowed: infant
infant
mortality, useful life and wearout.
mortality useful life wearout
During the useful life, λ is constant and
Equation (1) becomes R(t) = exp(-λt).
This is the exponential distribution and t
the shape of the reliability function is
given in figure 2.
The exponential distribution is one among
many other possibilities. Mechanical fig. 1: bathtub curve
devices which are subject to wearout
since the beginning of their operating life
can follow other distributions, like Weibull’s
distribution. In this case the failure rate is
time dependent. A curve illustrating the
time dependency of λ is seen in figure 3,
in which no plateau, as in figure 1, exists. 1

availability R (t) = e - λt
To illustrate the concept of availability
consider the case of an automobile. A
vehicle must start and run upon demand.
Its past history may be of little relevance.
The availability is a measure of its aptitude
to run properly at a given instant. 0
Definition: t
The availability of a device is the probability
that this device be in such a state so as to
fig. 2: exponential reliability
perform the function for which it was
designed under given conditions and at a
given time t, under the assumption that
external conditions needed are assured.
We will use the symbol A(t).
This definition, inspired by the one given
by the IEC, mimicks the one for the λ (t)
reliability. However, its time characteristics
are basically different since the concept
of interest is an instant of time instead of
a time length. For a repairable system,
functionning at time t does not necessarily
imply functionning between [0,t]. This is
the main difference between availability infant mortality
period
and reliability.
It is possible to plot the availability curve
t

fig. 3: wearout reliability curve

cahiers techniques Merlin Gerin n° 144 / p.3


as a function of time for a repairable safety The concept of safety is closely linked to
device, having exponential times to failure that of risk which, in turn, not only depends
and to repair, (see figure 4). It is possible to distinguish between on the probability of occurrence but also
It can be seen that the availability has a dangerous failures and safe ones. The on the criticality of the event. It is possible
limiting value which, by definition, is the difference does not lie so much in the to accept a life threatening risk (maximum
asymptotic availability. This limit is failures themselves but in their criticality) if the probability of such an
reached after a certain time. The limiting consequences. Switching off the light event is minimal. If it is just a matter of
reliability is always zero since, eventually, signals in a train station or suddenly having a broken limb the acceptable
all devices will fail. (This last point is switching them from green to red has an probability might be greater. The curve
controversial when dealing with software). impact (all trains stop) but is not on figure 5 illustrates the concept of
Consider again the case of the automobile. functionally dangerous. The situation is acceptable risk.
Two kinds of cars can have poor totally different if the lights would
availabilities: those with frequent failures accidentally turn all to green. Safety is the
and those which do not fail often but probability to avoid dangerous events.
instead spend a long time in the garage
for repairs. Thus, although the reliability
is an important component of the
availability, the aptitude to being promptly
repaired is also of paramount importance: D (t)
this is measured by the maintainability.
1

maintainability
Many designers seek top performance D∞
for their products, sometimes neglecting
to consider the possibility of failure. When
all the effort has been concentrated on
having a functionning system, it is difficult
to consider what would happen in case of
0
failure. Still, this is a fundamental question t
to ask. If a system is to have high
availability, it should very rarely fail but it
should also be possible to quickly repair fig. 4: availability as a function of time
it. In this context, the repair activity must
encompass all the actions leading to
system restoration, including logistics. The
aptitude of a system to be repaired is
therefore measured by its maintainability.
criticality
Definition:
The maintainability of an item is the
probability that a given active maintenance unacceptable
operation can be accomplished in a given risk
time interval [t 1,t 2]. It is written as
M(t1,t2).This definition also follows closely
that of the IEC’s international vocabulary.
It shows that the maintainability is related acceptable
to repair in a manner similar to that of risk
reliability and failure. The maintainability
M(t) is also defined using the same
hypotheses as R(t).
The repair rate µ(t) is introduced in a way probability of occurrence
analogous to the failure rate. When it can
be considered constant, the implica- fig. 5: the level of risk is a function of both, criticality and probability of occurrence.
tion is an exponential distribution for:
[M(t) = exp(-µt)].

cahiers techniques Merlin Gerin n° 144/ p.4


3. dependability characteristics interdependence

interrelated quantities one of three states, see figure 7. In addition ratio between the time spent on state A
to the normal functionning state, two and the total time is characteristic of the
The examples given so far have shown further failed states can be considered: a availability.
that the concept of dependability is a failsafe state and a state of dangerous The aptitude of the system to avoid
function of four quantifiable characteris- failure. In order to simplify this description spending any time on state C is a
tics: these are related to each other in the we are including in the failed states all characteristic of safety. It can be seen
way shown by figure 6. modes of degraded performance, labeled that state B is acceptable in terms of
These four quantities must be conside- “incorrect performance”. safety but is a source of unavailability.
red in all dependability studies. The de-
pendability is thus often designated in The time spent before leaving state A is
terms of the initials RAMS. characteristic of the reliability. The time
Reliability: probability that the system be spent on state B, after a safe failure, is
failure free in the interval [0,t]. characteristic of the maintainability. The
Availability: probability that the system
works at time t.
Maintainability: probability that the system
AVAILABILITY SAFETY
be repaired in the interval [0,t].
Safety: probability that a catastrophic
event is avoided.

conflicting requirements
Some of the requirements of the depen-
dability can be contradictory.
An improved maintainability can bring
about some choices which degrade the
reliability, (for example, the addition of
components to simplify the assembly-
disassembly operations). The availability RELIABILITY MAINTAINABILITY
is therefore a compromise between relia- fig. 6: the components of dependability
bility and maintainability. A dependability
study allows the analyst to obtain a
numerical estimate of this compromise.
Similarly, safety and availability might
conflict with each other.
STATE B
We have noted that the safety of a system
INCORRECT
is defined as the probability to avoid a repair
PERFORMANCE
catastrophic event and is often maximum AND NOT
when the system is stopped. In this case, STATE A DANGEROUS
its availability is zero! Such a case arises failsafe
NORMAL
when a bridge is closed to traffic when
FUNCTIONNING
there is a risk of collapse. Conversely, to
improve the availability of their fleet, cer- STATE C
tain airlines are known to have neglected dangerous INCORRECT
their preventive maintenance activities failure PERFORMANCE
thus diminishing flight safety. In order to AND DANGEROUS
ascertain the optimum compromise bet-
ween safety and availability it is neces-
sary to produce a scientific computation fig. 7: failsafe: availability
of these characteristics. dangerous failure: safety

A system can be described as being in

cahiers techniques Merlin Gerin n° 144 / p.5


time average related chance of having failed after such a time. Important relations and numerical
The definitions and relative positions of values
quantities these mean times during the life of a There are many mathematical relations
In addition to the previously mentioned system are given in figure 8. linking the quantities introduced thus far:
probabilities (reliability, availability, MTTF or MTFF (Mean Time To First For an exponential distribution with
maintainability and safety) of occurrence Failure): R(t) = exp(-λt) one has MTTF = 1/λ. In
of events, it is common to use mean times the mean time before the occurrence of this case, for a non repairable system, we
before the ocurrence of events in order to the first failure. have MTBF = MTTF (in fact, in this case,
describe the dependability. all failures are “first” failures). This explains
MTBF (Mean Time Between Failures):
Mean times why the classical formula used for
mean time between two consecutive fai-
It is useful to recall here the exact definition electronic components (non repairable)
lures in a repairable system.
of all the mean times as they are often is: MTBF = 1/λ.
misunderstood. The worst example of MDT (Mean Down Time): The above formula is only valid for
abuse is probably the most widely known, mean time between the instant of failure exponential distributions (constant failure
the MTBF, which is often confused with and total restoration of the system. It rates) and, strictly speaking, for non
lifetime. includes the failure detection time, the repaired items although it is possible to
On the average, in a homogenous repair time and the reset time. apply it for repaired systems with very
population of items following an MTTR (Mean Time To Repair): mean small MDTs. Analogously, when repair
exponential distribution, about 2/3 of these time to actually restore the system to an times obey an exponential distribution, it
items will have failed after a time equal to operating condition. is possible to show that MTTR = 1/µ.
the MTBF. A single system having a MUT (Mean Up Time): mean failure free One also has: MTBF = MUT + MDT. In
constant failure rate will have a 63% time. general it is also true that MDT = MTTR,
except for the logistic delay and restart
times. Furthermore:
■ asymptotic availability
MTTF MTBF MTBF This formula illustrates the assertion given

A ∞ = lim A t
t → +∞
MDT MUT MDT MUT MDT

on page 3 concerning the availability (ratio


of correct performance time to total time).
This quantity MUT corresponds to the
MTBF
asymptotic value given in figure 4, page 4.
■ asymptotic unavailability
= 1 - asymptotic availability

U ∞ = lim 1 - A t
t → +∞
time

failure failure failure


The asymptotic unavailability is usually
repair repair repair easier to express numerically than the
availability: it is much easier to read 10-6
than 0.999999.
failed state up state
For exponential distributions, using the
equations MUT = 1/λ and MDT = 1/µ one
obtains:

fig. 8: diagram for mean times in the case of a system with no interruptions due to preventive
maintenance
λ µ
U∞= or A ∞ =
λ+µ λ+µ

cahiers techniques Merlin Gerin n° 144/ p.6


λ is often much smaller than µ since the It can be seen that the reliability is To illustrate the impact of redundancy on
repair times are much smaller than the degraded when the complexity of the the unavailability, consider the national
times to failure. It is therefore possible to system increases. This corresponds to a power grid. One is concerned with the
simplify the denominator and write: well-known rule of dependability design: deliverance of energy to the final user.
simplify as much as possible. The unavailability is about 10-3. This cor-
λ
U∞= = λ.MTTR The concept of mean time is often responds to about 9 hours of downtime
µ misunderstood. For example the next two per year. For a computer room, having a
This last formula illustrates, in the case of sentences have, for exponential heavily redundant system of Uninterrup-
exponential distributions, the compromise distributions, the same meanings: “The tible Power Supplies (UPS), it is possible
between reliability and maintainability MTTF is 100 years” and “The odds are to reduce this figure between 1000 and
which has to be optimized to improve the one in 100 to observe a failure in the first 10 000 times.
availability. year”. Still, the second sentence seems
The table of figure 9 gives failure rates more worrisome for a manufacturer selling
and mean times to failure for certain 10 000 devices of this type per year. On
devices belonging to the electronic and the average, about 100 units will fail on
electrotechnical fields. the first year.

resistances micro- fuses and generator mains


proc. circuit- outages
breakers,
300 ft. cables,
busbars
λ(/h) 10-9 10-6 10-7 to 10-6 10-5 10-2
MTTF 1000 centuries 100 years 100 to 1000 years 10 years 4 days

fig. 9: failure rates and mean times to failure for certain devices belonging to the
electronic and electrotechnical fields

4. types of defects

The design of a system with respect to its operating errors Software aspects
dependability goals implies the need to ■ the reliability of a piece of software in
identify and take into account the various arising from an incorrect use of the which all the inputs are exhaustively tested
possible causes of defects. equipment: is equal to 1 forever. Nevertheless, this is
One can suggest the following ■ hardware being used in an inappropriate unrealistic for real life, complex programs.
classification: environment, ■ having two redundant programs implies
■ human operating or maintenance development by different software teams
errors, using different algorithms. This is the
physical defects ■ sabotage. principle behind fault tolerant software
induced by internal causes (breakdown The various techniques discussed in this in which a majority vote may be
of a component) or external causes, document concern mostly physical implemented.
(electromagnetic interferences, vibra- defects. Nevertheless, human and ■ most software reliability models can be
tions,...). software errors are also very important split in two major categories:
although the state of the art in these fields ■ complexity models: based upon a
is not as advanced as for physical defects. measure of the complexity of the code or
design defects Still, within the scope of this document, algorithm,
comprising hardware and software design we feel the following elements are worth ■ reliability growth models: based upon
errors. mentioning: previous observed failure history.
■ the quantitative evaluation of the

cahiers techniques Merlin Gerin n° 144 / p.7


different models does not allow yet for a Qualitative approaches are predominant have shown that the human factor can
systematic study of software reliability. in this field. The efforts lie mostly in the have great impact, not only from the
The best results are obtained in particular modeling of the human operator, task operator standpoint but also at the
cases and for given environments classification and human errors. The most designer’s stage. The more freedom of
(language, methods). This is the case for advanced studies belong to the nuclear action is given to a human operator the
the SPIN (Integrated Digital Protection and aerospace industries. Human more the risks are increased. This also
System) software developped by Merlin behavior is known as much by simulators includes management, as the Challenger
Gerin for use in nuclear power plants. as by field reports. Both sources can be Space Shuttle accident has shown: it is
Merlin Gerin is also an active participant compared to each other. Some references possible to go all the way up to the
in different working groups dealing with exist which propose some numerical designers of the working structure of the
software reliability (see references). The values. However, these must be used designer’s team! Many disciplines are
Technical Paper CT 117 gives further with utmost caution. According to these called upon to tackle the problem of human
details on this subject. The title is “Methods references it is feasible to assign an error reliability. Among them psychology and
for developping dependability related probability depending on the nature of the ergonomy.
software”. activity: mechanical, procedure or
Human reliability cognitive action.
Some of the recent major catastrophes

5. from component to system: modeling aspects

data bases for system resistance used in an electronic board is thus obtained by multiplying all the
and used inside an electric switchboard. corrective factors and the base failure
components It is necessary to consult the table given rate:
Electronics in figure 11 in order to determine the λ = λb.ΠR.ΠEΠQ = 0.33 x 10 -6 / hour
Reliability calculations have been widely corresponding correcting values. The If at the design stage the reliability goals
used in this field for many years. The two environment is “au sol” (fixed, ground) have been integrated, then:
best known data bases are the Military and therefore, the environment correc- ■ better thermal designs will allow a
Handbook 217 (version E at present) tive factor is: lowering of the environment temperature,
issued in the U.S. and the “Recueil de ΠE = 2.9 ■ better board designs will lower the load
données de fiabilité”, from CNET (French The resistance value gives the factor ρ.
Telecom Center), see figure 11 for an corresponding multiplying factor: With t = 60°C and ρ = 0.2 the diagram
example. Merlin Gerin participates in its ΠR = 1 gives:
updates. This resistance is taken as being “non λb = 1.7
These data bases allow the calculation of qualified” which gives the multiplying If now a qualified component is selected,
the failure rates of electronic components, quality factor we have: Π Q = 2.5, which gives
assumed to be constant. These rates are ΠQ = 7.5 λ = 0.012 x 10 -6, that is an improvement
a function of the application characteris- The load factor ρ is a characteristic of the factor of 30.
tics, environment, load, etc. The type of application, as opposed to the other Knowledge of the reliability of each
component is also relevant, e.g., number factors which are characteristic of the component provides a means to obtain
of gates, value of the resistance, etc. component itself. If the load factor is 0.7 the reliability of the boards, (which are
Computation is usually faster with the and the environmental temperature for repairable or replaceable), and therefore
CNET approach but many specialized the board is 90°C, the diagram gives that of whole electronic systems. This is
computer programs exist to implement
λb = 15 done by using the techniques described
either technique with ease.
The global failure rate for this resistance in the rest of this report.
As an example, let us take a 50 kΩ

cahiers techniques Merlin Gerin n° 144/ p.8


Mechanics and electromechanics when it should. The table in figure 10 For example, for the “stuck closed” mode,
Data bases in these fields exist although gives a point estimate of the failure rate we have a corresponding failure rate of:
they are not really “standards”. Some for the thermal function of circuit breakers. -6 34 -7
0.335.10 x = 1.17.10
sources are: Various information items given are as 100
■ RAC, NPRD 3: report by the Reliability follows: Another approach can sometimes be
Analysis Center (RADC, Griffiss AFB), ■ environment: GF, Ground Fixed, more relevant: instead of considering
under contract from the US DoD, dealing industrial conditions. the calendar time, the number of make-
with non electronic parts. ■ failure rate estimate: 0.335 10-6 h-1 break operations can be tallied. Then,
■ IEEE STD 500: field data on reliability ■ a 60% confidence interval for the failure a test is planned in which a sample is
of electrical, electronic and mechanical rate using the 20% lower and 80% upper selected and the reliability is estimated
equipment used in nuclear power plants. bounds. using a more realistic model (e.g.
In France and the US, some reference ■ the number of records used in this Weibull distribution).
books exist that deal specifically with calculation, i.e. 2. Which technique to use is largely a
mechanical components. ■ the number of observed failures: here 3.
matter of determining the kind of fai-
As an example of data relevant to our ■ the total number of operating hours:
lure one wishes to study: contact wear
activities, figure 10 gives some information 8.994 106 h . is related to the number of make and
concerning circuit breakers. This comes The actual knowledge of the global failure break cycles whereas corrosion is time
from RAC’s NPRD 3-1985. First, there is rate and the failure mode distribution dependent. Specific use and environ-
a failure mode distribution in a pie chart. allows the calculation of the probability of ment conditions are always important.
For example, 34% of all field failures are specific events by using a simple
due to the circuit breaker failing to open proportionality rule.

15.00 % 8.00 %

noisy
15.00 %
no movement
6.00 %
intermittent

degraded

stuck closed
8.00 %
9.00 % stuck open
out of adjustment

others
4.00 %

34.00 %

component APPL user point 60 % upper 20 % lower 80 % upper % of % of operating


part type ENV code estimate single-side internal internal recs fail HRS (E6)
thermal GF M 0.335 - 0.171 0.621 2 3 8.944

fig. 10: failure modes and reliability data for circuit breakers

cahiers techniques Merlin Gerin n° 144 / p.9


The people interessed in this kind of
information
can refer to American Standard referenced:
MIL HDBK 217 E

fig. 11: example of CNET publications

cahiers techniques Merlin Gerin n° 144/ p.10


Failure Modes, Effects and one of the relevant data bases. The the probability of occurence of failure and
hardware structure of the system as well the seriousness of its consequences. Thus
Critically Analysis (FMECA) as its functional characteristics allow the an FMECA is a tool to study the influence
method analyst to inductively assess the effect of of the component failures on the system.
This is a technique to analyse the reliability each and all of the failure modes The main interest of this technique lies in
of a system in terms of the failure modes corresponding to each element and their its exhaustiveness. It is nevertheless in-
of its components. The IEC has issued a effects on the system. complete in that the combination of ef-
standard (IEC 812) giving a description of An FMECA should also give an estimate fects must be seraparately considered.
this technique. Each element of the of the criticality of each failure mode, see This can be accomplished using the
system can, in turn, be analyzed using figure 12. This depends on two factors: methods described in the rest of this
chapter.

component function failure cause effect criticality comments


mode
circuit-breaker switch stuck solder no 2
closed shedding
« « unable mechanical no 2
to close power
« short circuit unable solder no 4 action
prot. to open protect
« current sudden adjustment no 3
path open power
« « heat bad electronic 2
contact failure

fig. 12: example of FMECA table

Reliability Block Diagram series parallel


(RBD)
The RBD method is a simple tool to 1
represent a system through its (non-
repairable) components. Using the RBD 1 2
allows the computation of the reliability of
systems having series, parallel, bridge 2
and k-out-of-n architectures or any of its
combinations. Although it is possible to
fig. 13: series/parallel systems
apply the RBD technique to repairable
systems, the implementation is much
more difficult. R(t)=R1(t).R2(t). For the particular case of non repairable
In the case of two independent components following an exponential
Series-parallel systems
components in parallel, the system works distribution of times to failure, one can
Two components are in series, from the
if one OR the other works. It is easy to write:
reliability standpoint, if both are necessary
calculate the unreliability of the system For the series case:
to perform a given function. They are in
since it is equal to the product of the two R(t) = exp(-λ1t).exp(-λ2t) = exp(-(λ1+λ2)t).
parallel when the system works if at least
component unreliabilities: the system fails It follows that the system’s times to failure
one of the two components works, see
if the first component AND the second also follow an exponential distribution,
figure 13.
component fail: (constant failure rate), since the reliability
These considerations are easily genera-
1 - R(t) =(1 - R1(t)).(1 - R2(t)). function is an exponential with:
lized to more than two components.
Whenever two components are in series Or equivalently: λ = λ1+ λ2
and can be considered to be independent, R(t) = R1(t)+R2(t) - R1(t).R2(t). For the parallel case:
(the failure of one does not modify the In this case, components 1 and 2 are said R(t) = exp(-λ1t)+exp(-λ2t)-exp(-(λ1+λ2)t).
probability of failure of the other), the to be in active redundancy. The Here, the reliability function is not an
reliability of this sytem can be calculated redundancy would be passive if one of exponential. Therefore, it can be
by multiplying the individual reliabilities the parallel components is turned on only concluded that the failure rate is not
together since the first component AND in the case of failure of the first. This is the constant.
the second must work: case of auxiliary power generators.

cahiers techniques Merlin Gerin n° 144 / p.11


All these formulas can be generalized to
a system with n non repairable compo-
1
nents, mixing series and parallel archi-
tectures.
k-out-of-n redundancies
A k-out-of-n system, or simply K/N, is a n-
component system in which k or more
2
components are needed for the system to
work properly. We will consider only ac- K/N

tive redundancies here, see figure 14:
Let us call Ri(t) the reliability of each one •

of the n components of the system. In



some simple cases the reliability of the
system can be computed by adding the
favourable combinations: N
■ 2/3 system:
R=R1.R2+R1.R3+R2.R3 fig. 14: K/N redundant systems
■ series system (n/n):
n
R(t) = Π R i (t)
i=1

■ parallel system (1/n): 1 4


n
1 - R(t) = Π ( 1 - R i (t) )
i=1

■ k/n system of identical components


If we write
3
Ri (t) = r (t), then,
n i i n-i
R(t) = ∑ C n r(t) ( 1 - r(t))
i=k

Bridge systems 2 5
These are systems which cannot be
described by simple series-parallel
combinations. They can, however, be
reduced to series-parallel cases by an fig. 15: bridge systems
iterative procedure, see figure 15.
In order to compute the reliability of this
system in terms of the five non repairable would result if each sensor is connected Coupler: λ3 = 10-5
component reliabilities it is necessary to to either one of the two alarms, as in Alarms: λ4 = λ5 = 4.10-4
apply conditional probabilities: figure 18, through a coupler. We will All these failure rates are given in
calculate the reliability improvement due (hours)-1
R=R3.R(given that 3 works)
to this modification. Let us also suppose ■ computation for Diagram A of
+ (1-R3).R(given that 3 has failed). that the mission time of this system is figure 17.
It is thus possible to derive the system three months, i.e., the maximum expected
This is a simple case of two parallel
reliability R(t) by decomposing the original absence during which the system must
branches, each having two components
bridge system in the two disjoint systems function. Furthermore, after each mission,
in series:
illustrated in figure 16. the system is thoroughly checked and
maintained and can be considered as Reliability of Branch 1: R1(t).R4(t)
Example: reliability of an intrusion
detection system. good as new when reset. During the Reliability of Branch 2: R2(t).R5(t)
The system consists of two sensors, a mission, there are no repairable elements. System reliability: RA(t) = R1(t).R4(t)
vibration sensor and a photoelectric cell. Let us use the following realistic constant + R2(t).R5(t) - R1(t).R4(t).R2(t).R5(t)
Each of these sensors could be connected failure rates to obtain the different orders
of magnitude: Using Ri(t)= exp(-λit) with t = 3 months
to its specific alarm, as in figure 17, and
= 2190 hours as the mission
we would have two independent Vibration sensor: λ1 = 2.10-4 time one obtains: RA(3 months) = 0.51.
branches. However, a bridge system Photoelectric cell: λ2 = 10-4

cahiers techniques Merlin Gerin n° 144/ p.12


1 4 1 4

2 5 2 5

fig. 16: decomposition of a bridge system

alarm 1

((
■ computation for Diagram B of
figure 18
This is the bridge system. Whenever the
vibration
sensor
1 4 ( ((
coupler is failed we are back to the dia-
gram of figure 17. On the other hand,
when it works, we have 1 and 2 in parallel,
both in series with 4 and 5, themselves in
parallel. The system reliability for figure 18 alarm 2

((
is then:
RB = (1-R3).R+R3.(R1+R2-R1.R2).(R4+R5 photoelectric
cell
2 5 ( ((
-R4.R5)
The numerical computation gives
fig. 17: alarms with no coupling, diagram A
RB(3 months) = 0.61.
In spite of the excellent reliability of the
coupler, the system’s reliability is only
marginally improved. This numerical
1 4
example shows, through a simple calcu-
lation, that there is not much sense in
having a more expensive set-up.
coupler
Case of repairable elements
RBD’s cannot be used as systematically
as before: 3
■ for two components in parallel, the
equation relating R(t) to R1(t) and R2(t) is
no longer valid. In fact, a working system
in the interval [0,t] may correspond to an
alternating working condition between 1 2 5
and 2, with non repairable components
there should be at least one working
fig. 18: system with coupler, diagram B
component in the time interval [0,t] whe-
reas for repairable components both can
fail, but not simultaneously. for the reliability calculations: repairman is available, (instead of as
■ the equation R(t) = R1(t).R2(t) remains
A(t) = A1(t).A2(t) for a series system many as necessary). This sequential
valid for a two reparaible component se- A(t) = A1(t)+A2(t)-A1(t).A2(t) for parallel feature, i.e. having a component waiting
ries system. systems. to be repaired while the other is being
■ in the case of repairable components
These formulas are valid only for serviced, is not possible to model by a
the main concern is the numerical esti- simple cases simple RBD. In these cases the State
mate of the availability. It is possible to For instance, the formula A(t)= A1(t)+A2(t) Graphs, to be dealt with later, are adap-
use the RBD’s with the same formulas as -A1(t).A2(t) ceases to be valid if only one ted to this problem.

cahiers techniques Merlin Gerin n° 144 / p.13


fault trees analysis
The computation of the system’s failure
fuse switch
probability is the main goal of this type of
analysis. It is based upon a graphical
construction representing all the
combinations of events, essentially M
through AND-gates and OR-gates, that
may lead to a catastrophic event.
Except for extremely simple cases,
computer resources must be used to The top event is: motor unable to start
evaluate the probability of the catastrophic
event. It is then possible to modify the
structure of the system’s design to lower fig. 19: electrical supply for a motor
this probability.
Basic procedure
A deep understanding of the system and motor
a clear definition of the “catastrophic idling
event” are essential to build the fault tree. and unable
to start
The catastrophic event, sometimes called
the “top event”, is then analyzed in terms
of its immediately preceding causes.
Then, each one of these causes is
analyzed in terms of their own immediately
preceding causes until the basic events
are reached. These are supposed to be
independent.
no motor immediate
A simple example is given in figure 19 and
power failure causes
its corresponding fault tree in figure 20.
This tree only contains OR-gates
connecting the intermediate events
(rectangles) and the basic events. The
basic events are represented by circles.
It is convenient to define a cut-set as a
simultaneous combination of basic events
that, by themselves, produce the top
event.
The analysis proceeds in two phases:
dead intermediate
■ qualitative analysis: the minimal cut- no + link no - link battery causes
sets, or min cuts, are obtained. The min
cuts are minimal combinations that include
basic events that lead to the top event.
The order of a min cut is simply the
number of basic events it contains.
■ quantitative analysis: this is
performed using the min cuts and the
probability of occurrence of the basic open open
fuse switch
events. This gives an approximate value wire wire
for the probability of the top event. It is
also necessary to validate the accuracy
of this approximation in a systematic
fashion. Then, depending on the
objectives of the analysis, different
probabilities are used to compute the fig. 20: fault tree for fig. 19 circuit
system reliability or its availability.
We can illustrate these ideas by two ■ an overhead projector with one lamp A single AND-gate is necessary. The
examples: inside and one spare. The top event is "no chances of this happening is seen to be 2
working lamp available", see figure 21. in two thousand.

cahiers techniques Merlin Gerin n° 144/ p.14


■ a simple light bulb. The top event is “no
light”, see figure 22. A single OR-gate is
necessary. The probability of the top event
is seen to be about 0.001, one in a failure
no light
thousand of not having light. The main probability: P
cause for this event is the burn out of the
light bulb. AND-Gate
In the general case it is often possible to
obtain an exact calculation of the
probability of the top event using 1st. light
2nd. light
recursivity instead of the min cuts: Boolean P1 P2 bulb dead one order 2 min-cut
bulb dead
probability calculations are performed for or missing

each gate in terms of the sub-trees being


input to the gate considered. The
P = P x P = 0 , 0 5 x 0 , 0 4 = 2 . 10 - 3
assumption of independence must be 1 2
verified but this procedure leads to an
exact evaluation of the top event. Thus,
the recursive calculation allows a fig. 21: fault tree for an overhead projector
comparison to the min-cut approach. Both
methods are complementary.
Application of fault tree using min-
cuts to the availability of a low voltage
network. failure
The fault tree corresponding to the network no light
probability: P
given in figure 23 is shown in figure 24.
Power is considerd to be either present or
OR -Gate
absent. The top event is assumed to be
the absence of power at the output,
noted E.
light bulb two order 1
In building this tree certain assumptions P1 no mains P2
min-cuts
dead
are made:
■ only two failure modes are considered
for the circuit-breakers: sudden contact -4 -3
1- P = ( 1 - P ) (1 -P ) = ( 1 -1 0 ) ( 1 -1 0 ) = 0,9989
break and failure to open upon a short- 1 2
circuit.
■ each transformer line can, by itself,
supply voltage to the main network, to
fig. 22: a fault tree for a light bulb
which E belongs.
■ the two mains supplies are coming
from two different Medium Voltage
sources. This reduces the Common Mode
failure to the unavailability of the High
Voltage supply.
Each event in the Fault Tree will have a
certain probability of occurrence
A B
associated with it. In this case the
probability will be the unavailability. The Busbar 1
unavailability associated with the basic
events is calculated by the formula: C D
U ≈ λ.MTTR.
Busbar 2 Busbar 3
λ is the failure rate corresponding to a
particular failure mode of a component. It
E F
can be obtained from several sources of
field data.

fig. 23: low voltage network

cahiers techniques Merlin Gerin n° 144 / p.15


no power
in output E

G11*

sudden
BB 3 no power short circuit
opening of
failure to BB 3 through F
C.B.E
G22* 2*3* G24*
2*1*

sudden C.B. F short


wire no power
opening of stuck on circuit
failure to BB 1
C.B. D short above F
circuit
3*1* 3*2* G33* 3*4* 3*5*

no power
BB 1 short circuit
to BB 1
failure through C

G42* G43*
4*1*

C.B. C
double line no HV short circuit stuck on
failure supply through C short
circuit
G51* 5*2* G53* 5*4*

line A line B BB 2
cable

G61* G62*
6*3* 6*4*

transfo transfo
C.B. A C.B. B
A B

7*1* 7*2* 7*3* 7*4*

fig. 24: fault tree corresponding to Fig. 23 network

cahiers techniques Merlin Gerin n° 144/ p.16


MTTR is the Mean Time to Repair and it transitions correspond to the different there) + P(the system comes from ano-
depends on the component being events that concern the components of ther state Ej).
considered as well as the particular the system. In general, these events are For a graph having n states, n differential
installation, technology, geographical either failures or repairs. As a equations are obtained which can be
location, service contract. consequence, the transition rates written as:
In some instances a specific value of a between states are essentially failure rates
dΠ(t)
probability is unknown. A worst case or repair rates, eventually weighted by = Π(t).[A]
situation, or upper bound, is therefore probabilities like that of an equipment dt
assumed. For example, we have taken refusing to turn on upon demand. where: Π(t) = [P1(t), P2(t), …, Pn(t)]
the upper bound probability of a short- The graph on figure 26 shows the behavior [A] is called the transition matrix of the
circuit above F to be 10-2. of a system with a single repairable graph.
The results of the Fault Tree Analysis, component. The solution of this equation in matrix
shown in figure 25, indicate that the Assumptions form is performed by computer and gives
unavailability on output E is 10-5 which A model is said to be markovian if the the probabilities Pi(t), that is the probability
corresponds to 5 minutes per year. The following conditions are satisfied: of the system being in state i as a function
min cut approach allows, in addition to ■ the evolution of the system depends of all the transition rates and the initial
the calculation of the probability of the top only on its present state and not on its state.
event, the assessment of the weight each past history, Computation of dependability quanti-
min cut carries in producing the top event. ■ the transition rates are constant, i.e. ties
Figure 25 also shows this weight, as a only exponential distributions are The availability being the probability of
percentage of the total unavailability which considered, the system being in a working state, it
is possible to attribute to each min cut. ■ there is a finite number of states, follows:
This contribution is one measure of the ■ at any given time there cannot be more D(T) = ∑ P i (t)
importance of the min cut. than one transition. .[A i

An eyeball examination of the min cuts Equations where Pi(t) = probability of being in
relative importances shows that the cable Under the above hypotheses, the proba- working state Ei.
linking busbar 1 to busbar 3, (third min bility of the system being in state Ei at time
cut), is critical. To a lower extent this is t+dt can be written as: Pi(t+dt) = P(the
also true of the two busbars 1 and 3. If system is in state E i and it stays
these components were improved, the
mains supply then becomes critical. If a
further improvement on the overall
availability became essential, it would be unavailability: 1.01 E -05, i.e. 1.01 10 -5
necessary to incorporate an auxiliary list of min cuts and their importance
min cuts indicated on the fault tree, percent contribution
power supply, such as a diesel generator.
A detailed study of the availability of an 1 :2*1* : 9,5
2 :2*3* : 1,6
electrical supply is presented in Merlin 3 :3*1* : 68
Gerin’s Technical paper “Sureté et 4 :3*2* : 1,6
5 :3*4* , 3*5* : ,013
distribution électrique” (in French). 6 :4*1* : 9,5
7 :5*2* : 9,9
8 :5*4* , 6*3* : 9,1E - 6
9 :5*4* , 6*4* : 3,2 E - 6
state graphs 10 :7*1* , 7*3* : ,00058
11 :7*1* , 7*4* : 1,3 E - 5
State graphs, also called Markov graphs, 12 :7*2* , 7*3* : 1,3 E - 5
13 :7*2* , 7*4* : 2,7 E - 7
allow a powerful modeling of systems
under certain restrictive assumptions. The
fig. 25: contributions of network components to its unavailability
analysis proceeds from the actual cons-
truction of the graph to solving the corres-
ponding equations and, finally to the in-
terpretation of results in terms of reliabi-
lity and unavailability. Mathematically, a
λ:failure rate
great simplification is obtained by consi-
dering only the calculation of time inde- up state down state
pendent quantities.
Construction of the graph µ: repair rate
The graph represents all the possible
states of the system as well as the
transitions between these states. These fig. 26: elementary state graph

cahiers techniques Merlin Gerin n° 144 / p.17


The reliability is the probability of being in UPS’s. Each working UPS in state Ei quantities. It can be seen that the MTTF
a working state without ever having adds its own exit rate λ towards state Ei+1. is here 4.17 107 hours whereas the
passed through a down state. A graph is These exit rates are 3λ, 2λ and λ res- nonredundant case (3/3) has an MTTF
constructed by deleting all transitions pectively. equal to 1/3 λ = 1.67 104 hours.
going from a failed state to a working The up states are 0 and 1. We assume For the asymptotic unavailability the
state. Once the new probabilities Pi’(t) are that the repair strategy is such that there change is from 1.19 10-7 for the redundant
obtained, we have: can be three repairmen working system to 6 10-4 for the non redundant
,
R(t) = ∑ P i (t) simultaneously on each UPS. Thus, the case (3/3) system. The comparison of
i transition rates corresponding to the repair these figures is easily visualized through
There are two other quantities which are activity are proportional to the number of the graph itself: in the redundant case,
very simple to obtain: failed UPS’s in the state being considered. the unavailability is calculated by summing
■ the meant time of state occupancy: The numerical values are as follows: the probabilities of the two failed states,
1 λ = 2.10-5 h-1 ; µ = 10-1 h-1 i.e., A = P2+P3 while, in the non redundant
Ti =
Σ (rates of departure from state i) Figure 28 gives the computed results
case, the sum is performed over three
failed states:
■ the occupancy frequency correspon- corresponding to the time independent
A = P1+P2+P3
ding to state i:
Pi
f i=
Ti
The characteristic mean times MTTF,
MTTR, MUT, MDT, MTBF are calculated
using matrix calculus and some of the
equations already discussed. For the
MTTF, the initial state of the system must 3λ 2λ λ
be specified in terms of the probabilities
of the system being initially in each one of state 0 state 1 state 2 state 3
its different states.
µ 2µ 3µ
Application: Uninterruptible Power
Supplies (UPS) in parallel
A UPS is a device which improves the
quality of the electrical supply. It is often
used for critical applications such as fig. 27: UPS's in parallel
computers and their peripherals. We will
consider a typical configuration (Triple
Modular Redundancy), i.e. the UPS’s
constitute a 2/3 redundant system. The
unavailability is not the only quantity of
interest: the MTTF gives the mean time Time independant quantities:
before the first black-out.
In the construction of the state graph it is Unavailability: : 1.199360E-07 Availability : 9.999999E-01
here possible to use the fact that the three MTTF : 4.169167E+07 MTTR : 8.333667E+00
UPS’s are identical and therefore states MUT : 4.169167E+07 MDT : 5.000333E+00
can be grouped, according to the number MTBF : 4.169167E+07
of failed UPS’s. The failure and repair
rates for the UPS’s, λ and µ respectively,
are given in figure. 27
The number associated with each state
corresponds to the number of failed fig. 28: values corresponding to the graph on figure 27

cahiers techniques Merlin Gerin n° 144/ p.18


6. conclusion

The dependability is a concept becoming contracts. The existence of computational comparison of different configurations and
ever more critical for comfort, efficiency methods and tools allows the systematic thus provide an evaluation of risk
and safety. It can be controlled and study of the dependability during the associated to a better performance, i.e.
calculated. It can be designed in, be it for design phase and for quality assurance performance adapted to clearly specified
devices, architectures or systems. purposes. needs.
Dependability characteristics are now An intuitive insight, combined with exact
frequently included in specifications and or approximate calculations, allow the

cahiers techniques Merlin Gerin n° 144 / p.19


7. references and standards

Military Handbook 217E A. Villemeur: EPRI document 3593


DoD (U.S.A.) “Sureté de fonctionnement des Electrical Power Research Institute
October 1986. systèmes industriels” Hannaman, Spurgin, 1984.
Recueil de données de fiabilité, CNET Eyrolles, France 1988. NUREG document 2254
(Centre National d’Etudes des International Electrotechnical US Nuclear Regulatory Commission
Télécommunications, France) Vocabulary Bell, Swain, 1983.
1983. VEI 191 Merlin Gerin Technical Report 117 :
IEEE Std. 493 and IEEE Std. 500 International Electrotechnical “Méthode de développement d’un
(Institute of Electrical and Electronic Commission logiciel de sureté”
Engineers) June 1988. A. Jourdil, R. Galera 1982.
1980 and 1984. Proceedings of the 15th InterRam Merlin GerinTechnical Report 134 :
NPRD document 3 conference ”Approche industrielle de la sureté de
Nonelectronics Parts Reliability Data Portland, Oregon fonctionnement”
Reliability Analysis Center, (RADC) June 1988. H. Krotoff 1985.
1985. C. Marcovici, J. C. Ligeron: Merlin Gerin Technical Report 148 :
A. Pagès, M. Gondran: “Techniques de fiabilité en mécani- “Sureté et distribution électrique”
“Fiabilité des systèmes” que” G. Gatine 1990.
Eyrolles, France1983. Pic, France, 1974.

IEC Standard 271 IEC Standard 605 Merlin Gerin’s dependability experts have
List of basic terms, definitions and related Equipment Reliability Testing. published extensively in this field and
mathematics for reliability. have presented papers in most
IEC Standard 706 international reliability conferences.
IEC Standard 300 Guide on maintainability of equipment. Merlin Gerin is also an active participant
Reliability and maintainability manage- in several national and international
ment. IEC Standard 812 committees dealing with dependability:
Analysis techniques for system reliability ■ presidence of the French National
IEC Standard 362 - Procedure for failure mode and effects Committee for IEC TC 56 activities,
Guide for the collection of reliability, analysis (FMEA). (dependability) and expert with IEC
availability and maintainability data from Working Group 4, TC 56, (statistical
field performance of electronic items. IEC Standard 863 methods),
Presentation of reliability, maintainability ■ software dependability with the
IEC Standard 409 and availability predictions. European Group of EWICS- TC7:
Guide for the inclusion of reliability clauses computer and critical applications,
into specifications for components (or IEC Standard 1014 ■ french AFCET Working Group on
parts) for electronic equipment. Programmes for reliability growth. computer systems dependability,
■ updating contributions to the French
CNET Electronic components reliability
handbook,
■ working Group IFIP 10.4 on Dependable
Computing.

cahiers techniques Merlin Gerin n° 144/ p.20

You might also like