Chapter - 3 System Reliability Models and Redundancy Techniques in System Design

46
CHAPTER – 3
SYSTEM RELIABILITY MODELS AND REDUNDANCY

TECHNIQUES IN SYSTEM DESIGN
Table of Contents
S. No Description Page No.
3.1 Failures and Failure Modes 48
3.2 Causes of failures and unreliability 51
3.3 Reliability of a product from test data 52
3.4 Mean Time To Failure (MTTF) 54
3.5 Time Dependent Hazard Models 56
3.5.1 Field –data curves 56
3.5.2 Constant Hazard Model 57
3.5.3 Linear Hazard Model 58
3.5.4 Non-linear Hazard Model 58
3.5.5 Gamma Model 59
3.5.6 Other Models 60
3.6 Stress-Dependent Hazard Models 61
3.7 Computation of Reliability function using Markov model 62
3.8 System Reliability Models 64
3.8.1 Series Systems 65
3.8.2 Systems with Parallel Components 66
3.8.3 k-out-Of-m Systems 69
3.8.4 Non series – Parallel Systems 72
3.8.5 Systems with mixed-mode failures 77

47
3.8.6 Fault-Tree Technique 78
3.9 Redundancy Techniques in System Design 81
3.9.1 Component Versus Unit Redundancy 82
3.9.2 Weakest-Link Technique 84
3.9.3 Mixed Redundancy 87
3.9.4 Standby Redundancy 87
3.9.5 Redundancy Optimization 88
3.9.6 Problem Formulation 90
3.9.7 Computational Procedure 91
3.10 Conclusions 92
48
CHAPTER - 3
“A system is a collection of components, subsystems
and/assemblies arranged to a specific design in order to achieve the
desired functions with acceptable performance and reliability”. The
types of components, their qualities, their quantities and the way in
which they are set inside the system have direct effect on system
reliability. Malfunction of a component or part may lead to the
breakdown of the whole system in few cases and may not in others,
depending upon the functional relationship among the components.
This necessitates a cautious study of component failures and failure
modes and their functions and also their failure models.
3.1 FAILURES AND FAILURE MODES
“Failure is defined as Non-conformance to some defined
performance criterion”. Some products have well defined failures while
others do not. For example, electric bulbs and switches have well
distinct failures. Either they are working or nonworking (failed). Such
products are known as two state products. Some products like
voltage – stabilizers, resistors etc. work in a range. For example, the
output voltage of a stabilizer may lie inside the limits of V - and V +.
When the output voltage crosses these limits only, the device is
supposed to have failed. For evaluating the quantitative reliability of a
device, the concept of failure and their details is to be used.

49
Several years of knowledge of failure data of different devices
has revealed that based upon the nature of failure, the failures can be
grouped into different kinds. When a large collection of elements are
put into operation, it is possible that there are a huge number of
failures initially which are called initial failures or infant mortality.
These initial failures are due to production defects, such as weak
parts, poor fit, poor insulation, bad assembly etc. This period is
known as the burn-in or debugging period, as the malfunctioning units
are eliminated in the initial failure periods.
“The failures which occur after initial failures, due to the sharp
change in parameters determining the performance of the units, either
as a result of the change in the working stresses or environment
conditions are called random failures or catastrophic failure”. Random
failures are few in long period of operation during which it is difficult
to predict the time of stress occurrence and their amplitude.
With the passage of time, the units get worn out and begin to
weaken. A gradual decrease in the values of the parameters affects
the performance of the product and when the parameters exceed the
limits of tolerance, the product fails. This area is called the wear-out
region where the failure rate increases and the prediction of wear-out
failure is very complicated.
The characteristic curve depicting the above modes of behavior
which if often known as bath-tub curve is shown in the Fig.3.1.
Manufacturers who manufacture high-reliability products subject
their products to an initial burn-in-period in order to reduce

50
manufacturing defects. The useful life time of the product is the
period t2 – t1. After time ‘t2’, the product is changed with a pre-tested
product.
Infant
Mortality Wear out
Failure Rate
Useful life
T1 Time T2
Fig. 3.1 Bath-tub curve
A curve is plotted with the failure rate on the y-axis and with
the product life on the x-axis. The life can be in cycles, actuations,
years, hours, minutes, or any other units of time or use, which are
quantifiable. Failures among surviving units per time unit is known as
failure rate. From this plot, it can be shown that many products begin
their lives with a higher failure rate due to poor workmanship, poor
quality control of incoming parts, manufacturing defects, etc. and
exhibit a declining failure rate. The failure rate then generally
stabilizes to more or less constant rate where the failures observed are
a chance failure which is known as useful life region. With usage, the
51
products experience more wear and tear, the failure rate begins to
grow and the products failure begins to occur related to wear-out. The
mortality rate is more during the first year or so of life, then drops to
a low constant level during teenage and early adult life and then rises
as we progress in years in case of human mortality. Infant mortality
occurs in early age is characterized by a decreasing failure rate.
Occurrence of failures during this period is not random in time but
rather the result of a few components with gross defects and the lack
of adequate quality assurance controls in the manufacturing process.
Causes for increase in failure rate /wear out failures are:
 Owing to oxidation, friction wear, shrinkage, corrosion, atomic
migration, breakdown of insulation, fatigue, etc.
 Populations of substandard items owing to microscopic flaws where
the components fail when random fluctuations (transients) of
stress go beyond the item strength.
 Usually related to quality assurance and manufacture, e.g. joints,
connections, welds, cracks, wrap, impurities, dirt, insulation or
coating flaws, incorrect positioning or adjustment.
3.2 CAUSES OF FAILURES AND UNRELIABILITY
The causes for failures of component and equipments in a
system can be many. Some are known and others are unknown due
to the complexity of the system and its environment. A few of them
are listed below:
1. Poor component design or system design

52
2. Wrong method of production-technique
3. Lack of complete knowledge and skill
4. High complexity of equipment
5. Very poor maintenance policies,
6. Organizational inflexibility and complexity
7. Human errors
8. Failure resulting from environmental factors ,software elements
and human factors
9. Common mode failure in a redundant system where all
replicated units fail by a common factor.
Component hazard data can be obtained in the following two ways:
a) It is not reasonable to ascertain failure rates of products for all
working conditions from part-failure data obtained from either
failure reports of consumers or from life-tests. It is not
impossible through interpolation or extrapolation to forecast
reliability under prescribed stress conditions from the data
available.
b) From the basic failure rates and working stresses applying
stress models.
3.3 RELIABILITY OF A PRODUCT FROM TEST DATA
Let us consider a population comprising of a group of N items
subjected to operation while time t = 0. As time grows, the products
fail. Let us assume that Ns(t) denote the number of surviving products
53
and Nf(t ) indicate the products that have failed after time t . Each
item independently fails with the probability of failure.
Ft  1  Rt  ….. (3.1)
Where R(t) represents the probability of survival.
Then
N S (t ) N (t )
Rt    1 f ….. (3.2)
N N
“The hazard rate regarded as a measure of instantaneous speed of
failure is defined as
N s (t)  N s (t  t) 1 d N s (t)

Z(t)  lim 
t 0 N s (t) t N s (t) dt ….(3.3)
Differentiating R(t) with respect t,
dR(t ) 1 dN S (t )

dt N dt
substituting in terms of the hazard rate,
dR (t ) N (t ) z (t )
 S   R(t ) z (t )
dt N
Then,
1 dR (t )
z (t )   …(3.4)
R (t ) dt
Integrating on both the sides,
t
 z ( x)dx   logR t 
0
or
t
R(t )  exp    z ( x)dx  ….. (3.5)
 0 
then,
54
t
F (t )  1  exp    z ( x) dx 
 0 
dF (t )
f (t ) 
dt
t
= z (t ) exp   z ( x)dx 
 0 
 z (t ) R (t )
Therefore,
f (t ) f (t )
z (t )   …. (3.6)
R(t ) 1  F (t )
3.4 MEAN TIME TO FAILURE (MTTF)
The probable time to failure for a non-repairable system is
described as Mean time to failure. “MTTF is an estimate of the
average, or mean time until component’s first failure or disruption in
the operation of the product, process, procedure, or design occurs”.
MTTF assumes that the product cannot resume any of its normal
operations as it cannot be repaired. “It is frequently used to describe
the system or equipment reliability, of use when estimating
maintenance costs and is significant even if there is no constant
failure rate”. MTTF is properly used only for components that can be
repaired and returned to service.
The knowledge about the mean time to failure of a product is
important rather than the complete failure details. MTTF for all the
55
products which are identical in their design and operate under
identical conditions is assumed to be same. If we have life-tests data
on the population of N items with failure times t1, t2,……..tn then the
MTTF can be mathematically expressed as
1 n
MTTF   ti
N i 1
……. (3.7)
When a component is described by hazard model and its
reliability function, then the MTTF is a function of the random
variable T relating the time to failure of the component. MTTF can be
defined as

MTTF  E[T ]   tf (t )dt .......(3.8)
0
dF (t ) dR (t )
f (t )  
dt dt

Hence, MTTF    tdT (t )
0

 tR (t ) 0   R ( x)dt
0

  R(t )dt .........(3.9)
0
The MTTF can also be evaluated as the Laplace - transform of R(t) i.e.,
 t
MTTF   t ( R)dt  lim  R( x) dx
0 t 0 0
t
However, lim  R( x) dx  lim R( s )
t  0 s 0
Where R(s) is the Laplace transform of R(t)
Thus, MTTF  limS0 R(s) ……... (3.10)

56
3.5 TIME DEPENDENT HAZARD MODELS
3.5.1 Field –Data Curves
From the failure analysis data, hazard rate of a component can
be computed by using the formula,
N S (t  t )
z (t )  ….. (3.11)
N S (t )t
for various time intervals. A general hazard-rate curve is shown in
Fig.3.2. It is assumed that the time intervals are identical. The time
intervals need not be equal.
()
z(t)
0 20 40 60 80 100
TIME (→)
Fig. 3.2 Hazard-rate curve
When the data happens to be large and the time interval
approaches zero, the piecewise hazard-rate function will tend to be the
continuous hazard-rate function. Using such failure curves, the
failure rate of other identical components operating in identical
conditions can be predicted. But this requires a few mathematical
models which approximately describe the failure behavior. A few such

57
models which depict various shapes of failure curves will be discussed
in the subsequent sections.
3.5.2 Constant-Hazard Model
This model assumes that hazard rate is constant which does
not significantly increase with component age. When a product is
functioning during its useful lifetime constant-hazard model is
precisely suitable.
Reliability of a constant-hazard model:
R(t) = 1 - F(t) = exp (-λt) ..... (3.12)
Let us consider constant failure rate and an exponential probability
distribution;
Failure rate; f (t) = λ e–λt .. (3.13)
Reliability function, R (t)= e–λt
Then, hazard function,
h (t)= f(t)/R(t) = λe–λt / e–λt …. (3.14)
When the hazard rate is a constant and is equal to the failure
rate, the ‘constant-hazard model’ takes the form
Z (t)= …….(3.15)
Where  is a constant. This attribute is being shown by many
components, particularly, electronic products. For many years, this
model has been used in reliability studies.
An item with constant hazard rate will have the following
reliability and unreliability functions.
f (t) = e - t …..... (3.16)
R (t) = e – t …….. (3.17)

58
F (t) = 1-e – t …….. (3.18)
The mean time to failure of the item is:
 1
MTTF   e  t dt  ... (3.19)
0 
3.5.3 Linear – Hazard Model
Several components which are under mechanical stresses will
fail due to deterioration or wear-out. The hazard rate increases with
time for such components. The linear-hazard model which is the
simplest time-dependent model has the form
Z (t) = bt, t > 0 ….. (3.20)
where b is a constant

R(t )  exp    bt dt  ….. (3.21)
 0 
 exp(bt 2 / 2)
f (t )  bt exp(bt 2 / 2) ..... (3.22)
where f (t) is a Rayleigh-density function. The portion beyond the
useful period in the bath-tub curve, might follow this model for few
cases. The mean time to failure is given by
 2
MTTF   e  bt /2
dt
0
T (1/ 2) 
  ….. (3.23)
2 b/ 2 2b
3.5.4 Non-Linear Hazard Model
The hazard rate which is not always a linearly increasing
function of time has a more general form of the hazard model:

59
Z(t) = atb ….. (3.24)
Where a and b are constant
This gives us

R(t )  exp  at b 1 /(b  1)  ….. (3.25)

f (t )  at b exp  at b 1 / b  1)  .….. (3.26)
The above general form is called the Weibull model which
generates a wide range of curves for various values of a and b. This
model represents the constant-hazard model which includes both the
previously discussed models when b=0 and represents linearly
increasing-hazard model when b=1. The parameter b the shape of z(t)
and a affects the amplitude and so they are called shape and scale
parameters respectively.
For the above model,
 1 
 
MTTF   b  1 ….. (3.27)
1 /( b 1)
b  1 a 
 b  1 
dT (d )

(ad ) a
1
Where d 
b 1
3.5.5 Gamma Model
As the gamma is a variable life distribution model, it may offer a
good fit to a few sets of failure data. However, it is not widely used as
a life distribution model for the common failure mechanisms.
The hazard model and it’s associated functions are

60
t
 a 1 e
z (t ) 
(a  1)!
 t
R(t )
…………. (3.28)
R (t )  
n 1
 t  j
e  t …....... (3.29)
j 0 j!
 a 1
f (t ) 
 a  1!
  t  e  t …..…….. (3.30)
Where  is a positive integer and  is a positive constant
MTTF = a / 
For a > 1, z(t) increases and for a=1, it becomes a constant-hazard
rate model. The above functions are applicable to a component which
is replaced (a-1) times by identical components with practically zero
replacement time. In effect a total of ‘a’ components were used in
sequence to accomplish the task. The case a = 1 represents that there
is only one component.
3.5.6 Other Models
In statistics and probability theory, the exponential distribution
is a set of continuous probability distribution which describes the
time elapsed between trials in a Poisson process, i.e. a process in
which trials occur independently and continuously at a steady average
rate. It may be noted that the exponential distribution is not the
identical to the class of exponential sets of distributions, which is a
large class of probability distributions that includes the exponential
distribution as one of its members, but also includes the Poisson
distribution, binomial-distribution, normal distribution, gamma
distribution, and many others.

61
There are few more models that are applied sometimes to
describe the failure curves which do not fit in with the earlier
discussed models. The shifter model takes the form
Z (t) =a (t-t0) b where t > t ….. (3.31)
This is used when the initial hazard rate is almost zero for some time.
This model is called the three-parameter Weibull model. This model
roughly combines the constant and linear models to describe a failure
curve that is initially constant and then increases.
The following model also can serve this purpose to a limited extent.
Z (t) = aect ….. (3.32)
Where a and c are constants
For this,
 
R(t )  exp  a / c  e a  1  ….. (3.33)
This model is generally known as the exponential-hazard model.
3.6 STRESS-DEPENDENT HAZARD MODELS
Mostly, the reliability of a component is defined under stated
operating and environmental conditions which implies that any
variation in these situations can affect the failure rate of the
component and thus its reliability. For almost all components the
failure rate is stress-dependent. A component reliability which is
influenced by more than one kind of stress, a power function model of
the form below is used.
h(t )  z(t ) 1a1  2a2 ….. (3.34)

62
1 and 2 are stress conditions, a1, a2 are positive constants. For
instance, for an electrical item, the stress may be
Operating Voltage
1 
rated voltage
Operating current
2 
rated current
The above models are applied for accelerated testing of products.
Statistics shows that the variable ‘a’ varies from 1 to 8 depending
upon the variety of component. The other factors like quality factors,
application factor, complexity factor, etc will also be used for the
accurate estimation of failure rates.
3.7 COMPUTATION OF RELIABILITY FUNCTION USING MARKOV
MODEL
This method of estimating the reliability of a part can be used if
the hazard rate of the part is known. The part is assumed to exist in
only two states:
State 0: the part is good
State 1: the part is failed
1-z(t) (t) 1
z(t) (t)
State 0 State 1
Fig. 3.3 Markov Graph

63
If a part is in state 0 at time t, then the probability of
component failure and reach state 1 in the time interval t is z (t) t,
where z(t) = the hazard rate of the component
If P0 (t) = the reliability that the component will not fail in time t;
P1 (t) = the probability that the component will fail in time t;
Then, the probability that the component will stay in state 0 at time t
+ t is given by
P1 (t  t )  z (t )tP0 (t )  P1 (t ) ….. (3.35)
When the component fails, the probability of its remaining in state 1
(failed state) is unity because the component is not repaired. The
probability of the component being in state 1 at time t + t is,
P0 t  t   P0 t 
  z t P0 t  ….. (3.36)
t
The probability equations can be written as
P1 t  t   P1 t 
 z t P0 t  ….. (3.37)
t
As t  0, We have
dP0 t 
  z t P0 t 
dt
The solution of this equation yields
t
log P0 t     z x dx  c1
0
t
P0 t   exp   z x dx  c1 
 0 
t
 c2 exp    z  x dx  ….. (3.38)
 0 
64
Where c1 and c2 are constants of ingtegration Initially when t=0 the
component is in state 0 and therefore,
P0 0  1  c2
Thus the reliability of the component is
t
Rt   P0 t   exp    z  x dx  ….. (3.39)
 0 
3.8 SYSTEM RELIABILITY MODELS
The main difficult task of a systems engineer is to evaluate
various reliability parameters of the systems he deals with. The
system configuration may differ from simple (consisting of one or two
elements) to complex (involving thousands of elements). One method
for analyzing such systems is to disintegrate them into subsystems of
suitable size, each representing a precise function. Reliabilities of all
the subsystems are evaluated and then combined to find the reliability
of the complete system using certain probability laws. However, this
approach requires total information about the physical structure of
the system and the kind of its functions to evaluatefind the behavior
of the system whenever a subsystem fails.
The task of a system engineer is to evaluate reliability of various
systems. The system configurations vary from a simple one consisting
of one or two components to a complex system consisting of plenty of
components. Such systems can be analyzed by decomposing the
system into subsystems of suitable size, each performing a specific
function. After decomposition of the system, the reliabilities of

65
subsystems are evaluated and combined to found out the reliability of
the whole system using certain probability laws. This technique
requires a comprehensive knowledge of the physical structure and to
evaluate adequately well to the behavior of the entire system. The
subsystem may consist of one or more components whose reliabilities
are known.
The reliability models for various kinds of subsystems (or
systems) are developed in this section. For all models the assumption
is that each component fails independently of other, i.e. the failure of
any component does not alter the failure of rest of the components.
3.8.1Series Systems
In a series system where components are connected serially, the
failure of one of its components leads to system failure.
Assume a system is having a total of ‘n’ components. When the
functional diagram suggests that for the successful operation of the
system, the accurate operation of all the ‘n' components, then we call
that the system arrangement is a series type. Such systems are
represented as shown in Fig.3.4 for the purpose of reliability
estimation. The information from the IN end will reach the OUT end
only if all the ‘n’ components function properly. Many complex
systems can be reduced to a simple structure.
IN 1 2 n OUT
Fig. 3.4 Series System

66
Let ‘Ei’ represent the event that component ‘i’ is working state
and‘ ’ be the event that component ‘i’ is in failure state. The
intersection of E1, E2, E3,……, En represents the event of success of
the system.
λ1, λ2, λ3, …… λi,…… λn are the components hazard rates of 1, 2, 3,
…..i,…..n respectively.
Pi(t) be the probability that component ‘i’ is working at time ‘t’
Reliability of a series system,
Rs = the probability that all the units are in good condition
Rs  Pr(E1  E 2  E3   E n )
 Pr(E1 ). Pr(E2 / E1 ). Pr(E3 / E 2 .E1 ) …...(3.40)
 P1 (t) . P2 (t) . P3 (t)  Pn(t)
 e  1t .e  2t .e 3t .....e  n t ......(3.41)
= Product of reliabilities of the components
i.e Reliability of a series system is the product of its component
reliabilities.
3.8.2Systems with Parallel Components
“Parallel system is a system in which components are connected
in parallel and the system does not fail, even one component is in
good working condition i.e the system fails only when all components
have failed”.
Let, the event that component ‘i’ is in good working state be ‘ Ei’.
The event that component ‘i’ is in failure state be ‘ Ei’ the number of
67
components in the system be ‘ n’ the hazard rates of components 1,
2,3,..i,…..n are λ1, λ2, λ3,… λn respectively.
2
IN OUT
Fig. 3.5 Parallel System
Let the probability that component ‘i’ is functioning at time ‘t’ be
Pi(t) System unreliability,
= Pr( E1 ∩ E2 ∩ E3……∩ En ) ….. (3.42)
= Pr(E1).Pr(E2 /E1).Pr(E3/E2.E1)..Pr(Em)
= P1(t) . P2(t) . P3(t) …… Pn(t)
    
 1  e  1t . 1  e   2t . 1  e  3t ..... 1  e   n t  ….. (3.43)
= Product of unreliability of components.
System reliability,
RP = 1- Unreliability of the system
i.e Reliability of a parallel system =1-the product of its component
unreliability.
By adding one more unit to the existing system which is having
already m units, the system reliability will increases to
Rm 1 (t )  1  [1  p(t )]m 1 ….. (3.44)
and the increase in reliability is given by

68
Rm 1 (t )  1  [1  p (t )]m 1 ….. (3.45)
 1  p (t )   [1  p (t )] m 1
m
 p(t )[1  p(t )]m
Hence the following recursive formula is available for estimating the
reliability
R1 t   P t 
Rm t   Rm1 t   Rm1 t ,  2 .. (3.46)
If the failure rates are constant

R t   1  1  e    m
The mean time to fail (MTTF) of the system is

MTTF   1  1  e  
0
   dt
m
Putting 1  e   x, we get

1 1 1  x m 
MTTF  dx
 0  1  x 
1 1
 1  x  x 
 .......x m 1 dx
2

 0
1 1 1 1
 1    .............  
 2 3 m
1 m 1
 
 i 1 i
….. (3.47)
For higher values of ‘m’,
1
MTTF  [ In( m)  0.577  1 / 2m] ….. (3.48)


  [exp(1t )  exp(2t )  ........exp(mt )
0
69
It is observed that the enhancement in the mean life with
redundancy is logarithmic. If the component reliabilities are not equal,
the problem becomes complex.
3.8.3 k-out-Of-m Systems
One of the important practical systems is where more than one
of its parallel components is required to meet the demand. For
example, in a power generating station where there are four
generators, two generators are sufficient to provide the necessary
power to the consumers. The other two generators are added to
improve the supply reliability. One more instance is a four-engine
aircraft where two engines are required for successful operation and
two are kept as standby units. Many such systems are present in
industries and other applications. The models which are developed
for a simple parallel system cannot be applied to these systems.
The binomial distribution can be used to estimate the reliability
for systems which have identical and statistically independent
components. If the probability of survival of each component is p,
then the probability that precisely x out of m components surviving is
given by
P( M , x)  B(m, x) p x (1  p)m  x ….. (3.49)
Where
m
B(m, x )   
x
70
is the binomial coefficient. If the least number of components
required for operation of the system without any failure is k , then the
system will function if k, k+1, k+2,…. Or m components are
functioning. The system reliability is the total of binomial
probabilities, x varying from k to m, i.e.,
m
R   B(m, i ) p i (1  p ) m i ….. (3.50)
i k
When k=m the system becomes a series-system and when k=1 the
system becomes a parallel system.
In a k-out-of-m system, the reliability of the system is improved
by the addition of m-k units. Such systems sometimes are called
partial-redundant systems to represent that k>1. The k units are
called basic units whose functioning is essential for the success of the
system. If the demand on the system increases, the number of basic
units required will increase and hence the redundant units decrease.
F1 F3
G1 G2
F2 F4
Fig. 3.6 Power distribution system for a chemical

Processing industry
For example, a 4-unit power generating system where each
unit with a capacity of 65 MW and can supply a demand of up to 260

71
MW. If the capacity is higher than the demand, the system can meet
the demand even if there is a failure of any unit .When there is
demand for less than 130 MW, the system will function as a 2-out-of-
4 system. There is a reserve of another 130 MW. We can safely
increase the demand up to a further 130 MW without installing any
additional unit but increase is done at the cost of reliability. A total
demand of, say 195 M.W would mean that the system works as a 3-
out-of-4 system with a reduced reliability which is given by
R  6 p 2 (1  p)2
There is a corresponding reduction in the mean life of the
1
system. The reduction in case of constant failure rates is T 
2
Using the binomial failure probabilities, the unreliability of a k-out-of-
m system can be stated as
m
 m
Q ( k , m)     (1  p) i p m i ….. (3.51)
i  m  k 1 i 
In case of high reliable components with constant hazard rates
Q(k , m)  B(m, k 1) (t ) mK 1 ….. (3.52)
If we substitute k=1 in the above equation, it reduces to
Q(k , m)  (t ) m ….. (3.53)
If the values of m and k are large, the approximate formula for mean
life is
 1
 m 
1 2
MTTF  ln   ….. (3.54)
  1
k 
 2 
72
3.8.4 Non Series – Parallel Systems
A complex system when simplified can produce a non – series
parallel configuration. Bridge configuration comes under this type.
The reliability of this type of configuration can be estimated using
appropriate probability rules and logic diagram. A simple non series-
parallel structure which is a bridge configuration is shown in Fig.
3.7. S1, S2 …… are the subsystems and direction of arrows show the
flow process. Such cases can be analyzed by using another approach
called the logic diagram technique.
S3
S1
S5
IN OUT
S2 S4
Fig. 3.7 Bridge Network
The system diagram is converted into a logic diagram that
consists of many simple parallel paths between IN and OUT terminals
by using the logic-diagram approach. The successful functioning of the
system depends on the successful operation of the elements in
different paths. If there is at least one continuous way between IN and
OUT terminals, the system will function successfully. The logic
diagram for the system shown in Fig3.7 is represented in Fig.3.8. The
subsystem S5 is unidirectional like any other subsystems.

73
S1 S3
S1 S5 S4
IN OUT
S2 S4
Fig. 3.8 Logic Diagram for Fig 3.7
The logic diagram is a plain series-parallel system whose
reliability can be estimated by combining the models developed in
sections 2 and 3 but care has to be exercised to recognize
interdependency of paths. The failure of each subsystem is assumed
to be independent of the failure of the paths is not independent since
some subsystems find place in more than one path. The use of the
following rule takes care of this problem.
If Pr(E1) = p1p2 and pr(E2) = p2p3 then
Pr(E1) * Pr(E2)= p1p2p3 ….. (3.55)
That is pi * pi = pi for any value of i.
Let the reliability of the subsystem, Si= Pi and the path i
reliability be Ri.
Let the reliability of the path i = Ri and the subsystem Si
reliability be Pi and.
R1 = p1p3
R2 = p2p5p4
R3 = p2p4
Then the entire structure reliability is

74
R=1-(1-R1) (1-R2) (1-R3)
= R1 + R2 + R3 - R1R2 -R2R3 - R1R3 + R1R2R3
On simplification, we get
R=p1p3+p1p5p4+p2p4-p1p3p4p5-p1p2p4p5-p1p2p3p4+p1p2p3p4p5 …….. (3.56)
For p1=p2= ….. =p5=p
R=2p2 +p3 - 3p4 + p5 ………… (3.57)
There are some cases where the subsystem S5 can be a
bidirectional element. For instance, in the systems shown in Fig. 3.9,
the subsystem S5 transmits signals in either direction. The logic
diagram showing the paths for these systems is illustrated in fig. 3.10.
S1 S2 S1 S2
IN OUT
S5
IN S5 OUT
S3 S4 S3 S4
(a) Power distribution system (b) Relay network
Fig. 3.9 Bridge networks

75
S1 S2
S1 S5 S4
INPUT OUTPUT
S3 S5 S2
S3 S4
Fig. 3.10 Logic diagram of systems in Fig.3.9
The reliability of bridge networks shown in Fig. 3.9 is given by
R=1-(1-p1p2) (1-p3p4) (1-p1p5p4) (1-p3p5p2)
= p1p2 +p3p4+p1p5p4+p3p5p2-p1p2p3p4 - p2p3p4p5 - p3p4p5p1 -p4p5p1p2 -
p5p1p2p3+2p1p2p3p4p5 …....(3.58)
For the case p1 = p for all i
R = 2p2 + 2p3 - 5p4 + 2p5 .…….(3.59)
For a value p=0.8 Eqs. (3.57) yields R=0.890 and Eq. (3.59) yields
R=0.91
If we put p5 = o in Eqs. (3.57) and (3.58) the results would be
R = p1p3+p2p4 - p1p2p3p4 (from Eq. (3.56)
R = p1p2 + p3p4 - p1p2p3p4 .…….(3.60)
The slight difference in the positions of p2 and p3 is due to the
difference in the physical location of S2 and S3.
The condition ps=0 means that the subsystem S5 is no more in
operation and the system configuration is as shown in Fig. 3.11. In
this case the reliability is given by Eq. (3.61)

76
R=1-(1-p1p2) (1-p3p4) ……..(3.61)
It may be noted that Eqs. (3.36) and (3.37) are identical. Similarly, if
we assume p5=1, Eq. (3.34) yields
R= p1p2 + p3p4 + p1p4 + p3p2 - p2p3p4 - p3p4p1 – p1p2p4 -
p1p2p3 +p1p2p3p4p5 ……..(3.62)
Section 1 Section 2
S1 S2
IN OUT
S3 S4
Fig. 3.11 Bridge network with open link
The assumption of p5=1 implies that the subsystem is replaced
by a direct link of unit reliability and the system can be represented
as shown in Fig.3.12. The system reliability is the product of the
reliabilities of sections 1 and 2.
Further, we know that
R(Section 1) = 1-(1-p1) (1-p3)
=p1+p3 -p1p3
R(Section 2) =1-(1-p2) (1-p4) = p2+p4-p2p4
Section 1 Section 2
S1 S2
IN OUT
S3 S4
Fig. 3.12 Bridge network with a link of unit reliability

77
then, R=(p1+p3-p1p3) (p2+p4-p2p4)
=p1p2+p2p3+p3p4+p4p1-p1p2p3 - p2p3p4-p3p4p1- p4p1p2 +
p1p2p3p4 ....(3.63)
It may be noted that Eq. (3.63) is the same as that of Eq.(3.62)
for equal Ps, Eq. (3.63) becomes
R =4p2 - 4p3 +p4
= 0.9216 for p 0.8
3.8.5 Systems With Mixed-Mode Failures
Until now it is assumed that when there is an open circuit
relating the OUT and IN terminals of a system, it is considered to have
failed. Such type of failure is known as “open-mode failure”. But
there are some exceptions. For example, an electric appliance will fail
to function when there is an electric short-circuit between OUT and IN
terminals. A relay is declared to be failed if it fails to open when it is
required to open. Such failures are called “short-mode failures”.
When a system possess these type of components, both type of
failures with their failure probabilities should be considered in
estimation of system-reliability.
The reliability of a component subject to both failures, is given
by
p=1-(q0+qs) -----(3.64)
Where q0 and qs are the failure probabilities due to open-mode
and short-mode respectively. Consider a system having two capacitors
which are connected in parallel as shown in Fig. 3.13a. If there is an
open-circuit in one of the capacitors, it does not cause a system to

78
fail. If there is a short-circuit across any one of them, it will cause the
system to fail. The capacitors are considered to be in series for the
short-circuit type in the reliability sense.
For evaluating probabilistic failures the logic diagrams are
shown in Fig. 3.13(b)
C
C C
C2
(i) Open mode (ii) short mode
(a) System mode (b) logic mode
Fig. 3.13 Systems with mixed mode failures
3.8.6 Fault-Tree Technique
Fault tree analysis technique was developed in the early
1960’s. Since then they have been adopted for a broad variety of
engineering disciplines for performing reliability and safety analysis.
They graphically signify the relationship of failures and other events
within a system. Basic events at the base of the fault tree are
connected via logic symbols (known as gates) to one or more top
events. These peak events represent known hazards or system failure
modes for which reliability is predicted. Crucial events at the base of
the fault tree usually represent human and component faults for
which statistical failures and repair data is available.
Common basic events are
Switch fails closed

79
Pump failure
Operator does not respond
Temperature controller failure
Fault trees can be used to investigate large and complex
systems. Fault trees are particularly capable of representing and
analyzing redundancy arrangements. Further common cause events
are easily handled.
The fault-tree approach to reliability analysis of complex
systems is based on the events that will cause the system to fail. “A
fault tree (FT) is a diagrammatic representation of all possible fault
events, their logical combinations, and their relationship to the system
failure”. The faults which are at the lowest level of the system are
usually represented at the base of the tree and the system fault at the
top. The events which are at the lowest level are known as “basic
events”. “The events resulting from combinations of basic events
which may or may not be of interest and such events are represented
as intermediate events”. The system failure is found out by combining
the failure probabilities of basic events to obtain the failure
probabilities of intermediate events and lastly the top event.
For simplicity sake, a 3-unit system as shown in Fig. 3.14 is
considered. This system will fail when the component C1 fails, or by
the failure of components C2 and C3 combined, or by the failure of all
the three components. The fault tree diagram of this system is shown
in Fig. 3.15. For combining two or more events the logical OR and
AND gates are used.

80
C2
C1
C3
Fig. 3.14 A three unit system
An AND gate is a multi-input mechanism which produces an
output signal only when all its inputs function. If Y represents the
output and X1, X2,……. are the input events, then
Y = X1  X2  X3  …….. Xn
An OR gate, on the other hand, produces an output even when one of
its inputs is success. In this case,
Y = X1  X2  X3  ……. Xn
System fails
OR
gate
E1 E1
AND
gate
E2 E3
Fig. 3.15 Fault-tree diagram for the system in Fig. 3.14

81
3.9 REDUNDANCY TECHNIQUES IN SYSTEM DESIGN
“Redundancy is the provision of alternative means or parallel
paths in a system for accomplishing a given task such that all means
must fail before causing a system failure”. System reliability and
mean life can be increased by additional means by applying
redundancy at various levels. The different approaches of introducing
redundancy in the system are:
1. A duplicate path is provided for the entire system itself which is
known as system or unit redundancy.
2. A redundant path is provided for each component individually
which is called component redundancy.
3. In the third approach, the weak components should be identified
and strengthened for reliability.
4. In the last approach, a mix of the above techniques is used
depending upon the reliability requirements and configuration of
the system which is known as mixed redundancy.
The application of a particular technique depends upon many
factors, for example the weight, size, initial cost and operating
characteristics of components or systems. Particularly in electrical
and electronic systems redundancy use at the component level
introduces certain deviation in operating characteristics of the main
systems. Particular attention should be given to such systems.
Redundancy can either be ‘active’ in which case all redundant
elements operate simultaneously in performing the same function or

82
‘standby’ in which the duplicate element is switched into service when
a primary element falls.
3.9.1 Component Versus Unit Redundancy
Duplication at the unit level is easy rather than at the
component level. But higher reliability is achieved through component
redundancy than unit redundancy.
Let us consider a two-component series system. Redundancy
can be applied in two ways as shown in Fig. 3.15. Assuming that the
units are statistically independent and identical at each level, the
reliability of the system with unit-redundancy is
Ru=1-(1-p1p2)(1- p1p2)
=2p1p2- p12 p22 ………(3.65)
C1 C2 C1 C2
C2 C1 C2
C1
C
(a) Unit redundancy (b) Component redundancy
Fig. 3.16 Two-element active redundant systems:
Where the reliabilities of components C1 and C2 are p1 and p2. In case
of component redundancy, the reliability is
Rc  [1  (1  p1 ) 2 ] [1  (1  p2 ) 2 ]
= 4 p1 p2  p12 p22  2 p12 p22 ……..(3.66)
If we assume p1 = p2 for simplicity sake , we obtain
Ru  2 p 2  p 4
83
Rc  p 2 ( 2  p ) 2 ……..(3.67)
Then,
Rc  Ru  p 2 [( 2  p ) 2  ( 2  p 2 )]
= p2(2-4p+2p2)
=2p2(1-P)2 …….. (3.68)
It is obvious from Equation (6.1) that Rc - Ru > 0 for 0 < p < 1 and Rc
- Ru = 0 for p =1 which proves that the redundancy at the component
level is more enhanced than redundancy at the unit level to the level
as far as reliability is concerned. This is also correct even if the
primary components of the system are nonidentical.
This analysis can be extended to a more wider case where the
unit consists of n components in series. Suppose that m-1
components are arranged in parallel at each stage, the system
reliability would be
Rc  [1  (1  p ) m ]n …… (3.70)
In unit redundancy, suppose m-1 units are added across the
primary unit, the reliability is
Ru  1  [1  p n ]m …….. (3.71)
It shows that a series system which has five redundant
components has a greater reliability than a series system which has
three unit redundant components.
Another important redundancy technique is to use partial
redundancy popularly known as k-out-of-m system becomes a series
structure when k=m and a parallel structure when k=1. Consider a

84
simple 2-out-of-3 system. This system has, physically three
components in parallel and the system is successful as long any two
of them are working. In this case, component redundancy means
duplication of each component.
The reliability of system with unit redundancy or no component
is
R=p2(3-2p)
=0.648 for p=0.6
with unit redundancy this would become
Ru  1  (1  0.648) 2  0.876
It may be noted that component redundancy change the system into a
“2-out-of-6-system” and the net reliability would be
6
6
Rc     p i (1  p) 6i
i 2  i 
 15 p 2 (1  p) 4  20 p 3 (1  p) 3  15 p 4 (1  p) 2  6 p 5 (1  p)  p 6
=0.959 for p=0.6
Again component redundancy is higher than unit redundancy.
3.9.2 Weakest-Link Technique
For a series structure, reliability is at the most the same as the
reliability of the weakest component in the structure. Assume a simple
system having two components A and B in series. Their reliabilities
are 0.8 and 0.4 respectively. This system will have a reliability of 0.32
which is much less than 0.4, the reliability of the weakest component.
The system reliability can be enhanced by one of the following
methods:-
85
1. Apply redundancy across B only
2. Apply redundancy across A only.
3. Apply redundancy across both A and B individually (component
redundancy).
Various system configurations and their resultant reliabilities
are presented in table 3.1. Types (b) and (c) shows the use of one
additional component to improve the reliability. Their reliabilities
show that the application of redundancy across weaker equipment
results in higher reliability when compared to the redundancy across
stronger equipment. When an attempt is made to increase the
reliabilities of various configurations of a series structure, it is seen
the overall structure reliability is less than the weakest section
reliability. So the pay off will not be much when investment is done
on a section except the weakest one. If the costs of equipments A and
B are same, then the reliability-cost ratio of type (b) is lower than that
of type (c).
Table 3.1 Redundancy Techniques
Type System Configuration System Reliability

 pa  0.8 and pb  0.4 
A B
0.4
Ra  pa pb  0.32
0.8
A
A B 2
B Rb  1  1  pa   pb  0.384
 
A
A B 2
C Rc  pa 1  1  pb    0.512
 
B
B
A
Rd  2 pa pb  pa2 pb2  0.5376
D
A B
A B
Re  pa pb  2  pa  2  pb   0.6282
E
A B
86
A B 3
R f  pa 1  1  pb    0.6272
 
B
F
B
A B 4
Rg  pa 1  1  pb    0.6963
 
B
G B
A B
Rh  1  1  pa pb  2  pb   1  pa pb 
H
B  0.6674
A B
A B 2 3
Ri  1  1  pa   1  1  pb  
  
B
I A  0.7526
B
A B 3
R j  1  1  pa pb   0.6855
A B
J
A B
A B 2
Rk  1  1  pa pb  2  pb  
B
 0.7618
K A B
A B 3 3
Rl  1  1  pa   1  1  pb  
  
A B
L  0.7777
A B
Now the configurations (c), (f) and (g) are considered in which
the redundancy is provided across the component B only. In (c) the
reliability of B is increased to 0.512 which is still less than the
reliability of A. Therefore, it is advisable to improve the reliability of B
further as shown in (f). It may be noted that the reliability of type (e) is
87
higher than the reliability of type (f) even though both of them employ
the same number of equipments. In (f) the reliability of B is improved
to 0.904 which is now higher than 0.8. Any further improvement
should be done in A. Compare the reliabilities of types (g) and (i) both
having the same number of equipments. Type (h) is superior to
type (g).
3.9.3 Mixed Redundancy
Component and unit redundancies are easy to design and
simple to implement. But they may not be the best configurations
and there might be scope for further enhancement in their reliability-
cost ratios. A comparison of types (j), (k) and (l) will demonstrate this.
In configuration (k), weak component reliability is enhanced first and
then the unit redundancy is applied. Its reliability is more than the
reliability of the component redundancy shown in (I). The reliability of
(k) can be further increased by providing a link in between two
sections.
3.9.4 Stand by Redundancy
All the components and equipments are not suitable for "active
redundancy”. For example, resistors and capacitors generate design
problems when they are placed actively in parallel. If one of two such
components which are operating in parallel fails, there will be a
change in the circuit constants. Similarly, two electric generators
which are operating at different frequencies cannot be placed together
in parallel. Standby redundancy is used to increase the reliability of
such systems.
88
The failed component or equipment is replaced automatically or
physically by its "equivalent" In standby redundancy. The reliability of
the operator or sensing and switching mechanism must be considered
for such cases.
Suppose in a system, a switch (contact) activated by a
feedback sensing and control device is used for that replacement of
factory Unit-1 by a standby Unit-2, then the system reliability is given
by
R  p1 p2 pc  (1  p1 ) p2 pc ps  p1 (1  p2 ) p1 pc ……… (3.72)
Where
p1 = Unit-1 reliability
p2 = Unit-2 reliability
pc = switch reliability
ps = sensing and control device reliability
pt = probability that a chance and premature
switching may not occur.
3.9.5 REDUNDANCY OPTIMIZATION
Using redundancy, adequately reliable systems can be
constructed using less reliable elements. By providing parallel
elements at each stage the reliability of a system can be enhanced but
the complexity, weight, size, cost etc. of the system increase
substantially. So it is essential to maintain the cost and weight of the
system to a least and reliability to a highest level when designing a

89
redundant system. For simple parallel systems, the cost and reliability
of can be expressed as
C = F(c, m, method of redundancy)
R = G (p, m, method of redundancy).
Where ‘p’ and ‘c’ are the reliability and cost and of each element and
‘m’ is the number of parallel elements.
Thus, the optimal redundancy problem is reduced to the
determining the values of m for maximum redundancy for a given ‘C’
or minimum ‘C’ for a given ‘R’.
The redundancy optimization problem can be solved by using a
simple method with cost and volume constraints. The assumptions
made in solving the problem are as follows.
(i) The number of redundant components in each stage is a
continuous variable and the final solution is obtained by
rounding off the solution to integer values.
(ii) The cost, volume/weight of any stage raise linearly with the
number of components in that stage.
(iii) Active redundancy is considered.
(iv) The failure of elements is independent of others
The probability of failure of elements because of short circuit is zero.

90
3.9.6 Problem Formulation
At least one element in each stage must function for the
successful operation of a series system. If ‘mi’ elements each with
unreliability ‘qi’ are used at stage ‘i’ then the reliability of the entire
system having ‘n’ stages is :
R= =
R= …….(3.73)
and the unreliability is
Q= ……(3.74)
When qimi is small, their product may be neglected.
Then
R=1- ……. (3.75)
and Q= …….(3.76)
The total reliability in a multi-constraint problem is
limited by the following constraints.
j=1,2,……..K .…………….(3.77)
Where ‘Cij’ is the jth resource used at ith stage. The constraints
might represent total cost (C1), total volume (C2), etc. Equation (3.75)
and (3.76) represent the objective functions and the equation (3.77)
represent the constraint function.
Now the problem can be stated as
Find Vector M = (m1, m2…….. mn) which will
Maximize R=
Subject to mi > 1 i = 1, 2, ………. n
Cij > 0 j = 1, 2, ……….. k

91
 cij mi < Cj
3.9.7 Computational Procedure
Since mi is assumed to be a continuous variable, the maximum
R can be found by differentiating equation (1) with respect to mi and
equality to zero.
From (1) log R =
log R=
Differentiating, we get = (qimi log qi)/(1 - qimi)
= (- qimilog(qi))/ (1 - qimi ) *
(3.78)
Neglecting the products of qm, equation (3.78) can be written as
= - qimi log (qi) i = 1, 2 … n .…. (3.79)
The optimal value of mi = mi* i s the solution of
= - qimi log (qi) = 0
qimi * log (qi) = q2mi *log (qi) = qn mn* log (qi) .….(3.80)
Assuming that m1 * is known, other values m2 *, m3 *…..mn * can be
determined by equation (3.80)
mI*= i=2,3,……..n ….(3.81)
Where E = q1mi*
H = log (q1)
G = log (qi)
The detailed procedure for computing the vector ‘M’ is given below:
92
Step 1: Choose the initial value m1*
Step 2: Find m1* i = 2, 3, ……….n
Step 3: Check cost and volume constraints. If they are not violated
go to otherwise go to step 5.
Step 4: Give a small positive increment ‘dm’ to m1*, go to step 2
Step 5: Round of mi* such that the constraints are not violated.
Step 6: Calculate the system reliability, cost and volume of this
systems.
3.10 Conclusions
The basic analytical functions in reliability engineering along
with construction of reliability Bathtub curve are presented in this
chapter. Various types of system configuration and their reliability
functions are provided. The problem formulation and computational
procedure for redundancy optimization of series- parallel system with
cost and volume/weight constraints is presented.

Chapter - 3 System Reliability Models and Redundancy Techniques in System Design

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Chapter - 3 System Reliability Models and Redundancy Techniques in System Design

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter - 3 System Reliability Models and Redundancy Techniques in System Design

Uploaded by

Copyright:

Available Formats

46

SYSTEM RELIABILITY MODELS AND REDUNDANCY

S. No Description Page No.

3.1 Failures and Failure Modes 48

3.2 Causes of failures and unreliability 51

3.3 Reliability of a product from test data 52

3.4 Mean Time To Failure (MTTF) 54

3.5 Time Dependent Hazard Models 56

3.5.1 Field –data curves 56

3.5.2 Constant Hazard Model 57

3.5.3 Linear Hazard Model 58

3.5.4 Non-linear Hazard Model 58

3.5.5 Gamma Model 59

3.5.6 Other Models 60

3.6 Stress-Dependent Hazard Models 61

3.7 Computation of Reliability function using Markov model 62

3.8 System Reliability Models 64

3.8.1 Series Systems 65

3.8.2 Systems with Parallel Components 66

3.8.3 k-out-Of-m Systems 69

3.8.4 Non series – Parallel Systems 72

3.8.5 Systems with mixed-mode failures 77

3.8.6 Fault-Tree Technique 78

3.9 Redundancy Techniques in System Design 81

3.9.1 Component Versus Unit Redundancy 82

3.9.2 Weakest-Link Technique 84

3.9.3 Mixed Redundancy 87

3.9.4 Standby Redundancy 87

3.9.5 Redundancy Optimization 88

3.9.6 Problem Formulation 90

3.9.7 Computational Procedure 91

“A system is a collection of components, subsystems

and/assemblies arranged to a specific design in order to achieve the

desired functions with acceptable performance and reliability”. The

types of components, their qualities, their quantities and the way in

reliability. Malfunction of a component or part may lead to the

depending upon the functional relationship among the components.

This necessitates a cautious study of component failures and failure

modes and their functions and also their failure models.

3.1 FAILURES AND FAILURE MODES

“Failure is defined as Non-conformance to some defined

performance criterion”. Some products have well defined failures while

distinct failures. Either they are working or nonworking (failed). Such

products are known as two state products. Some products like

voltage – stabilizers, resistors etc. work in a range. For example, the

supposed to have failed. For evaluating the quantitative reliability of a

device, the concept of failure and their details is to be used.

Several years of knowledge of failure data of different devices

grouped into different kinds. When a large collection of elements are

put into operation, it is possible that there are a huge number of

failures initially which are called initial failures or infant mortality.

These initial failures are due to production defects, such as weak

known as the burn-in or debugging period, as the malfunctioning units

are eliminated in the initial failure periods.

change in parameters determining the performance of the units, either

as a result of the change in the working stresses or environment

conditions are called random failures or catastrophic failure”. Random