ECT144
ECT144
ECT144
introduction to dependability
design
P. Bonnefoi
MERLIN GERIN
service information
38050 Grenoble Cedex
France MERLIN GERIN
tél. : 76.57.60.60
la maîtrise de l'énergie électrique
E/CT 144
GROUPE SCHNEIDER
December 1990
Equipment failures, unavailability of a
introduction to dependability design power supply, stoppage of automated
equipment and accidents are quickly
P. BonnefoiP=. becoming unacceptable events, be it to
the ordinary citizen or industrial
manufacturers.
Dependability and its components:
reliability, maintainability, availability and
safety, have become a science that no
designer can afford to ignore.
Table of contents This technical report presents the basic
concepts and an explanation of its basic
1. Importance of dependability In housing p. 2
computational methods.
In services p. 2 Some examples and several numerical
In industry p. 2 values are given to complement the
formulas and references to the various
2. Dependability characteristics Reliability p. 2
computer tools usually applied in this
Failure rate p. 2 field .
Availability p. 3
Maintainability p. 4
Safety p. 4
3. Dependability characteristics Interrelated quantities p. 5
interdependence Conflicting requirements p. 5
Time average related quantities p. 6
4. Types of defects Physical defects p. 7
Design defects p. 7
Operating errors p. 7
5. From component to system: Data bases for system
modeling aspects components p. 8
FMECA method p. 11
Reliability block diagram p. 11
Fault trees analysis p. 14
State graphs p. 17
6. Conclusion p. 19
7. References and Standards p. 20
Prehistoric men had to depend on their In competitive industries it is not For over 20 years Merlin Gerin has
arms for survival. Modern man is sur- possible to tolerate production losses. pioneered work in the DEPENDABILITY
rounded by ever more sophisticated tools This is even more so for complex field: in the past, with its contribution to
and systems on which he depends for industrial processes. In these cases the design of nuclear power plants or the
safety, efficiency and comfort. one vies to obtain the best: high availability of power supplies used at
Ordinary citizen are specially concer- ■ reliability of command and control the launching site of the ARIANE space
ned in everyday life by: systems, program, nowadays, by its design of
■ the reliability of the TV set, ■ availability of machine tools, products and systems used worldwide.
■ the availability of the mains supply, ■ maintainability of production tools,
■ the maintainability of freezers and cars, ■ personnel and invested capital safety.
■ the safety of their boiler valves. These characteristics, known under the
Bankers and, in general, service general term of DEPENDABILITY, are
industries give a lot of weight to: related to the concept of reliance, (to
■ computer reliability, depend upon something). They are
■ availability of heating, quantified in relation to a goal, they are
■ maintainability of elevators, computed in terms of a probability and
■ fire related safety. are obtained by the choice of an
architecture and its components. They
can be verified by suitable tests or by
experience.
2. dependability characteristics
reliability Function: the reliability is a characteristic the probability that it will suddenly burn
assigned to the system’s function. out in the interval of time (t, t+∆t), given
Light bulbs are used by everyone: Knowledge of its hardware architecture is that it kept working until time t. Failure
individuals, bankers and industrial usually not enough. Functional analysis rates are time rates and, as such, their
workers. When turned on, a light bulb is methods must be used to determine the units are inverse time.
expected to work until turned off. Its reliability.
reliability is the probability that it works Mathematically, the failure rate is written
until time t and it is a measure of the light Conditions: the environment has a as:
bulb’s aptitude to function correctly. fundamental role in reliability. This is also
Definition:
The reliability of an item is the probability
true for the operating conditions.
Hardware aspects are clearly insufficient.
λ(t) = lim
∆t⇒0
( 1 R(t) - R (t+∆t)
∆t R(t) )
that this item will be able to perform the Time interval: we wish to emphasize an -1 d R(t)
interval of time as opposed to a specific = (1)
function it was designed to accomplish R(t) dt
under given conditions during a time instant. Initially, the system is supposed
interval (t1,t2); it is written R(t1,t2). to work. The problem is to determine for
how long. In general t1=0 and it is possible For a human being, the failure rate
This definition follows the one given by
to write R(t) for the reliability function. measures the probability of death
the IEC (International Electrotechnical
occurring in the next hour:
Commission)International Electrotechni-
λ(20 years)=10-6 per hour.
cal Vocabulary, Chapter 191. There are failure rate If λ is represented as function of age, one
certain basic concepts used by this defi-
Consider the light bulb example again. Its obtains the curve given in figure 1.
nition which must be detailed:
failure rate at time t, written as λ(t), gives
availability R (t) = e - λt
To illustrate the concept of availability
consider the case of an automobile. A
vehicle must start and run upon demand.
Its past history may be of little relevance.
The availability is a measure of its aptitude
to run properly at a given instant. 0
Definition: t
The availability of a device is the probability
that this device be in such a state so as to
fig. 2: exponential reliability
perform the function for which it was
designed under given conditions and at a
given time t, under the assumption that
external conditions needed are assured.
We will use the symbol A(t).
This definition, inspired by the one given
by the IEC, mimicks the one for the λ (t)
reliability. However, its time characteristics
are basically different since the concept
of interest is an instant of time instead of
a time length. For a repairable system,
functionning at time t does not necessarily
imply functionning between [0,t]. This is
the main difference between availability infant mortality
period
and reliability.
It is possible to plot the availability curve
t
maintainability
Many designers seek top performance D∞
for their products, sometimes neglecting
to consider the possibility of failure. When
all the effort has been concentrated on
having a functionning system, it is difficult
to consider what would happen in case of
0
failure. Still, this is a fundamental question t
to ask. If a system is to have high
availability, it should very rarely fail but it
should also be possible to quickly repair fig. 4: availability as a function of time
it. In this context, the repair activity must
encompass all the actions leading to
system restoration, including logistics. The
aptitude of a system to be repaired is
therefore measured by its maintainability.
criticality
Definition:
The maintainability of an item is the
probability that a given active maintenance unacceptable
operation can be accomplished in a given risk
time interval [t 1,t 2]. It is written as
M(t1,t2).This definition also follows closely
that of the IEC’s international vocabulary.
It shows that the maintainability is related acceptable
to repair in a manner similar to that of risk
reliability and failure. The maintainability
M(t) is also defined using the same
hypotheses as R(t).
The repair rate µ(t) is introduced in a way probability of occurrence
analogous to the failure rate. When it can
be considered constant, the implica- fig. 5: the level of risk is a function of both, criticality and probability of occurrence.
tion is an exponential distribution for:
[M(t) = exp(-µt)].
interrelated quantities one of three states, see figure 7. In addition ratio between the time spent on state A
to the normal functionning state, two and the total time is characteristic of the
The examples given so far have shown further failed states can be considered: a availability.
that the concept of dependability is a failsafe state and a state of dangerous The aptitude of the system to avoid
function of four quantifiable characteris- failure. In order to simplify this description spending any time on state C is a
tics: these are related to each other in the we are including in the failed states all characteristic of safety. It can be seen
way shown by figure 6. modes of degraded performance, labeled that state B is acceptable in terms of
These four quantities must be conside- “incorrect performance”. safety but is a source of unavailability.
red in all dependability studies. The de-
pendability is thus often designated in The time spent before leaving state A is
terms of the initials RAMS. characteristic of the reliability. The time
Reliability: probability that the system be spent on state B, after a safe failure, is
failure free in the interval [0,t]. characteristic of the maintainability. The
Availability: probability that the system
works at time t.
Maintainability: probability that the system
AVAILABILITY SAFETY
be repaired in the interval [0,t].
Safety: probability that a catastrophic
event is avoided.
conflicting requirements
Some of the requirements of the depen-
dability can be contradictory.
An improved maintainability can bring
about some choices which degrade the
reliability, (for example, the addition of
components to simplify the assembly-
disassembly operations). The availability RELIABILITY MAINTAINABILITY
is therefore a compromise between relia- fig. 6: the components of dependability
bility and maintainability. A dependability
study allows the analyst to obtain a
numerical estimate of this compromise.
Similarly, safety and availability might
conflict with each other.
STATE B
We have noted that the safety of a system
INCORRECT
is defined as the probability to avoid a repair
PERFORMANCE
catastrophic event and is often maximum AND NOT
when the system is stopped. In this case, STATE A DANGEROUS
its availability is zero! Such a case arises failsafe
NORMAL
when a bridge is closed to traffic when
FUNCTIONNING
there is a risk of collapse. Conversely, to
improve the availability of their fleet, cer- STATE C
tain airlines are known to have neglected dangerous INCORRECT
their preventive maintenance activities failure PERFORMANCE
thus diminishing flight safety. In order to AND DANGEROUS
ascertain the optimum compromise bet-
ween safety and availability it is neces-
sary to produce a scientific computation fig. 7: failsafe: availability
of these characteristics. dangerous failure: safety
A ∞ = lim A t
t → +∞
MDT MUT MDT MUT MDT
U ∞ = lim 1 - A t
t → +∞
time
fig. 8: diagram for mean times in the case of a system with no interruptions due to preventive
maintenance
λ µ
U∞= or A ∞ =
λ+µ λ+µ
fig. 9: failure rates and mean times to failure for certain devices belonging to the
electronic and electrotechnical fields
4. types of defects
The design of a system with respect to its operating errors Software aspects
dependability goals implies the need to ■ the reliability of a piece of software in
identify and take into account the various arising from an incorrect use of the which all the inputs are exhaustively tested
possible causes of defects. equipment: is equal to 1 forever. Nevertheless, this is
One can suggest the following ■ hardware being used in an inappropriate unrealistic for real life, complex programs.
classification: environment, ■ having two redundant programs implies
■ human operating or maintenance development by different software teams
errors, using different algorithms. This is the
physical defects ■ sabotage. principle behind fault tolerant software
induced by internal causes (breakdown The various techniques discussed in this in which a majority vote may be
of a component) or external causes, document concern mostly physical implemented.
(electromagnetic interferences, vibra- defects. Nevertheless, human and ■ most software reliability models can be
tions,...). software errors are also very important split in two major categories:
although the state of the art in these fields ■ complexity models: based upon a
is not as advanced as for physical defects. measure of the complexity of the code or
design defects Still, within the scope of this document, algorithm,
comprising hardware and software design we feel the following elements are worth ■ reliability growth models: based upon
errors. mentioning: previous observed failure history.
■ the quantitative evaluation of the
data bases for system resistance used in an electronic board is thus obtained by multiplying all the
and used inside an electric switchboard. corrective factors and the base failure
components It is necessary to consult the table given rate:
Electronics in figure 11 in order to determine the λ = λb.ΠR.ΠEΠQ = 0.33 x 10 -6 / hour
Reliability calculations have been widely corresponding correcting values. The If at the design stage the reliability goals
used in this field for many years. The two environment is “au sol” (fixed, ground) have been integrated, then:
best known data bases are the Military and therefore, the environment correc- ■ better thermal designs will allow a
Handbook 217 (version E at present) tive factor is: lowering of the environment temperature,
issued in the U.S. and the “Recueil de ΠE = 2.9 ■ better board designs will lower the load
données de fiabilité”, from CNET (French The resistance value gives the factor ρ.
Telecom Center), see figure 11 for an corresponding multiplying factor: With t = 60°C and ρ = 0.2 the diagram
example. Merlin Gerin participates in its ΠR = 1 gives:
updates. This resistance is taken as being “non λb = 1.7
These data bases allow the calculation of qualified” which gives the multiplying If now a qualified component is selected,
the failure rates of electronic components, quality factor we have: Π Q = 2.5, which gives
assumed to be constant. These rates are ΠQ = 7.5 λ = 0.012 x 10 -6, that is an improvement
a function of the application characteris- The load factor ρ is a characteristic of the factor of 30.
tics, environment, load, etc. The type of application, as opposed to the other Knowledge of the reliability of each
component is also relevant, e.g., number factors which are characteristic of the component provides a means to obtain
of gates, value of the resistance, etc. component itself. If the load factor is 0.7 the reliability of the boards, (which are
Computation is usually faster with the and the environmental temperature for repairable or replaceable), and therefore
CNET approach but many specialized the board is 90°C, the diagram gives that of whole electronic systems. This is
computer programs exist to implement
λb = 15 done by using the techniques described
either technique with ease.
The global failure rate for this resistance in the rest of this report.
As an example, let us take a 50 kΩ
15.00 % 8.00 %
noisy
15.00 %
no movement
6.00 %
intermittent
degraded
stuck closed
8.00 %
9.00 % stuck open
out of adjustment
others
4.00 %
34.00 %
fig. 10: failure modes and reliability data for circuit breakers
Bridge systems 2 5
These are systems which cannot be
described by simple series-parallel
combinations. They can, however, be
reduced to series-parallel cases by an fig. 15: bridge systems
iterative procedure, see figure 15.
In order to compute the reliability of this
system in terms of the five non repairable would result if each sensor is connected Coupler: λ3 = 10-5
component reliabilities it is necessary to to either one of the two alarms, as in Alarms: λ4 = λ5 = 4.10-4
apply conditional probabilities: figure 18, through a coupler. We will All these failure rates are given in
calculate the reliability improvement due (hours)-1
R=R3.R(given that 3 works)
to this modification. Let us also suppose ■ computation for Diagram A of
+ (1-R3).R(given that 3 has failed). that the mission time of this system is figure 17.
It is thus possible to derive the system three months, i.e., the maximum expected
This is a simple case of two parallel
reliability R(t) by decomposing the original absence during which the system must
branches, each having two components
bridge system in the two disjoint systems function. Furthermore, after each mission,
in series:
illustrated in figure 16. the system is thoroughly checked and
maintained and can be considered as Reliability of Branch 1: R1(t).R4(t)
Example: reliability of an intrusion
detection system. good as new when reset. During the Reliability of Branch 2: R2(t).R5(t)
The system consists of two sensors, a mission, there are no repairable elements. System reliability: RA(t) = R1(t).R4(t)
vibration sensor and a photoelectric cell. Let us use the following realistic constant + R2(t).R5(t) - R1(t).R4(t).R2(t).R5(t)
Each of these sensors could be connected failure rates to obtain the different orders
of magnitude: Using Ri(t)= exp(-λit) with t = 3 months
to its specific alarm, as in figure 17, and
= 2190 hours as the mission
we would have two independent Vibration sensor: λ1 = 2.10-4 time one obtains: RA(3 months) = 0.51.
branches. However, a bridge system Photoelectric cell: λ2 = 10-4
2 5 2 5
alarm 1
((
■ computation for Diagram B of
figure 18
This is the bridge system. Whenever the
vibration
sensor
1 4 ( ((
coupler is failed we are back to the dia-
gram of figure 17. On the other hand,
when it works, we have 1 and 2 in parallel,
both in series with 4 and 5, themselves in
parallel. The system reliability for figure 18 alarm 2
((
is then:
RB = (1-R3).R+R3.(R1+R2-R1.R2).(R4+R5 photoelectric
cell
2 5 ( ((
-R4.R5)
The numerical computation gives
fig. 17: alarms with no coupling, diagram A
RB(3 months) = 0.61.
In spite of the excellent reliability of the
coupler, the system’s reliability is only
marginally improved. This numerical
1 4
example shows, through a simple calcu-
lation, that there is not much sense in
having a more expensive set-up.
coupler
Case of repairable elements
RBD’s cannot be used as systematically
as before: 3
■ for two components in parallel, the
equation relating R(t) to R1(t) and R2(t) is
no longer valid. In fact, a working system
in the interval [0,t] may correspond to an
alternating working condition between 1 2 5
and 2, with non repairable components
there should be at least one working
fig. 18: system with coupler, diagram B
component in the time interval [0,t] whe-
reas for repairable components both can
fail, but not simultaneously. for the reliability calculations: repairman is available, (instead of as
■ the equation R(t) = R1(t).R2(t) remains
A(t) = A1(t).A2(t) for a series system many as necessary). This sequential
valid for a two reparaible component se- A(t) = A1(t)+A2(t)-A1(t).A2(t) for parallel feature, i.e. having a component waiting
ries system. systems. to be repaired while the other is being
■ in the case of repairable components
These formulas are valid only for serviced, is not possible to model by a
the main concern is the numerical esti- simple cases simple RBD. In these cases the State
mate of the availability. It is possible to For instance, the formula A(t)= A1(t)+A2(t) Graphs, to be dealt with later, are adap-
use the RBD’s with the same formulas as -A1(t).A2(t) ceases to be valid if only one ted to this problem.
G11*
sudden
BB 3 no power short circuit
opening of
failure to BB 3 through F
C.B.E
G22* 2*3* G24*
2*1*
no power
BB 1 short circuit
to BB 1
failure through C
G42* G43*
4*1*
C.B. C
double line no HV short circuit stuck on
failure supply through C short
circuit
G51* 5*2* G53* 5*4*
line A line B BB 2
cable
G61* G62*
6*3* 6*4*
transfo transfo
C.B. A C.B. B
A B
An eyeball examination of the min cuts Equations where Pi(t) = probability of being in
relative importances shows that the cable Under the above hypotheses, the proba- working state Ei.
linking busbar 1 to busbar 3, (third min bility of the system being in state Ei at time
cut), is critical. To a lower extent this is t+dt can be written as: Pi(t+dt) = P(the
also true of the two busbars 1 and 3. If system is in state E i and it stays
these components were improved, the
mains supply then becomes critical. If a
further improvement on the overall
availability became essential, it would be unavailability: 1.01 E -05, i.e. 1.01 10 -5
necessary to incorporate an auxiliary list of min cuts and their importance
min cuts indicated on the fault tree, percent contribution
power supply, such as a diesel generator.
A detailed study of the availability of an 1 :2*1* : 9,5
2 :2*3* : 1,6
electrical supply is presented in Merlin 3 :3*1* : 68
Gerin’s Technical paper “Sureté et 4 :3*2* : 1,6
5 :3*4* , 3*5* : ,013
distribution électrique” (in French). 6 :4*1* : 9,5
7 :5*2* : 9,9
8 :5*4* , 6*3* : 9,1E - 6
9 :5*4* , 6*4* : 3,2 E - 6
state graphs 10 :7*1* , 7*3* : ,00058
11 :7*1* , 7*4* : 1,3 E - 5
State graphs, also called Markov graphs, 12 :7*2* , 7*3* : 1,3 E - 5
13 :7*2* , 7*4* : 2,7 E - 7
allow a powerful modeling of systems
under certain restrictive assumptions. The
fig. 25: contributions of network components to its unavailability
analysis proceeds from the actual cons-
truction of the graph to solving the corres-
ponding equations and, finally to the in-
terpretation of results in terms of reliabi-
lity and unavailability. Mathematically, a
λ:failure rate
great simplification is obtained by consi-
dering only the calculation of time inde- up state down state
pendent quantities.
Construction of the graph µ: repair rate
The graph represents all the possible
states of the system as well as the
transitions between these states. These fig. 26: elementary state graph
The dependability is a concept becoming contracts. The existence of computational comparison of different configurations and
ever more critical for comfort, efficiency methods and tools allows the systematic thus provide an evaluation of risk
and safety. It can be controlled and study of the dependability during the associated to a better performance, i.e.
calculated. It can be designed in, be it for design phase and for quality assurance performance adapted to clearly specified
devices, architectures or systems. purposes. needs.
Dependability characteristics are now An intuitive insight, combined with exact
frequently included in specifications and or approximate calculations, allow the
IEC Standard 271 IEC Standard 605 Merlin Gerin’s dependability experts have
List of basic terms, definitions and related Equipment Reliability Testing. published extensively in this field and
mathematics for reliability. have presented papers in most
IEC Standard 706 international reliability conferences.
IEC Standard 300 Guide on maintainability of equipment. Merlin Gerin is also an active participant
Reliability and maintainability manage- in several national and international
ment. IEC Standard 812 committees dealing with dependability:
Analysis techniques for system reliability ■ presidence of the French National
IEC Standard 362 - Procedure for failure mode and effects Committee for IEC TC 56 activities,
Guide for the collection of reliability, analysis (FMEA). (dependability) and expert with IEC
availability and maintainability data from Working Group 4, TC 56, (statistical
field performance of electronic items. IEC Standard 863 methods),
Presentation of reliability, maintainability ■ software dependability with the
IEC Standard 409 and availability predictions. European Group of EWICS- TC7:
Guide for the inclusion of reliability clauses computer and critical applications,
into specifications for components (or IEC Standard 1014 ■ french AFCET Working Group on
parts) for electronic equipment. Programmes for reliability growth. computer systems dependability,
■ updating contributions to the French
CNET Electronic components reliability
handbook,
■ working Group IFIP 10.4 on Dependable
Computing.