Reliability
Reliability
Reliability
History [6][edit]
The word reliability can be traced back to 1816, by poet Samuel Coleridge. [7] Before World War II the
name has been linked mostly to repeatability. A test (in any type of science) was considered reliable
if the same results would be obtained repeatedly. In the 1920s product improvement through the use
of statistical quality control was promoted by Dr. Walter A. Shewart at Bell Labs. [8] Around this time
Wallodi Weibull was working on statistical models for fatigue. The development of reliability
engineering was here on a parallel path with quality. The modern use of the word reliability was
defined by the U.S. military in the 1940s and evolved to the present. It initially came to mean that a
product would operate when expected (nowadays called mission readiness) and for a specified
period of time. A notable figure who played very important roles in engineering, specially with respect
to reliability engineering, was the German rocket scientist Wernher von Braun according NASA,
without any doubt, the greatest rocket scientist in history. In his early life he worked on the German
V1 missiles, which were plagued with reliability issues that could not be solved at that time by
component improvements. One of the first ideas to implement redundancy in systems were adapted
at this time. Wernher von Braun was also during his long career at NASA famous to his very
conservative engineering approach using ample safety factors and redundancy. Some say that this
has resulted in the lost U.S. race for the first man into space, but to the success of the first man on
the Moon. He was the chief architect of the Saturn V launch vehicle, the superbooster that propelled
the Apollo spacecraft to the Moon.
In the time around the WWII and later, many reliability issues were due to inherent reliability of
electronics and fatigue issues. In 1945 M.A. Miner published the seminal paper titled Cumulative
Damage in fatigue in an ASME journal. A main application for reliability engineering in the military
was for the vacuum tube as used in radar systems and other electronics, which reliability has proved
to be very problematic and costly. The IEEE formed the Reliability Society in 1948. In 1950, on the
military side, a group called the Advisory Group on the Reliability of Electronic Equipment, AGREE,
was born. This group recommended the following 3 main ways of working:
1. Improve Component Reliability
2. Establish quality and reliability requirements (also) for suppliers
3. Collect field data and find root causes of failures
In the 1960s more emphasis was given to reliability testing on component and system level. The
famous military standard 781 was created at that time. Around this period also the much used and
also debated military handbook 217 was published by the RCA (Radio Corporation of America) used
for the prediction of failure rates of components. The emphasis on component reliability and
empirical research (e.g. mil std 217) alone slowly decreases. More pragmatic approaches as used in
the consumer industries are being used. The 1980s was a decade of great changes. Televisions had
become all semiconductor. Automobiles rapidly increased their use of semiconductors with a variety
of microcomputers under the hood and in the dash. Large air conditioning systems developed
electronic controllers, as had microwave ovens and a variety of other appliances. Communications
systems began to adopt electronics to replace older mechanical switching systems. Bellcore issued
the first consumer prediction methodology for telecommunications and SAE developed a similar
document SAE870050 for automotive applications. The nature of predictions evolved during the
decade and it became apparent that die complexity wasn't the only factor that determined failure
rates for ICs. Kam Wong published a paper questioning the bathtub curve [9] - see also Reliability
Centered Maintenance. During this decade, the failure rate of many components dropped by a factor
of 10. Software became important to the reliability of systems. By the 1990s, the pace of IC
development was picking up. Wider use of stand alone microcomputers was common and the PC
market helped keep IC densities following Moores Law and doubling about every 18 months.
Reliability Engineering now was more changing towards understanding the physics of failure. Failure
rates for components kept in dropping, but system levels issues become more prominent. Systems
Thinking becomes now more and more important. For software the CCM model (Capability Maturity
Model) was developed, which gave a more qualitative approach to reliability. ISO 9000 added
reliability measures as part of the design and development portion of Certification. The expansion of
the World Wide Web created new challenges of security and trust. The older problem of too little
reliability information available had now been replaced by too much information of questionable
value. Consumer reliability problems could now have data and be discussed online in real time. New
technologies such as micro-electro mechanical systems (MEMS), handheld GPS, and hand-held
devices that combined cell phones and computers all represent challenges to maintain reliability.
Product development time continued to shorten through this decade and what had been done in
three years was now done in 18 months. This meant reliability tools and tasks must be more closely
tied to the development process itself. In many ways, reliability became part of everyday life and
consumer expectations.
Overview[edit]
Objective[edit]
The objectives of reliability engineering, in the order of priority, are: [10]
1. To apply engineering knowledge and specialist techniques to
prevent or to reduce the likelihood or frequency of failures.
2. To identify and correct the causes of failures that do occur,
despite the efforts to prevent them.
3. To determine ways of coping with failures that do occur, if their
causes have not been corrected.
4. To apply methods for estimating the likely reliability of new
designs, and for analysing reliability data.
The reason for the priority emphasis is that it is by far the most effective way of working, in terms of
minimizing costs and generating reliable products.The primary skills that are required, therefore, are
the ability to understand and anticipate the possible causes of failures, and knowledge of how to
prevent them. It is also necessary to have knowledge of the methods that can be used for analysing
designs and data.
Software(systematic) failures
Effective reliability engineering requires understanding of the basics of failure mechanisms for
which experience, broad engineering skills and good knowledge from many different special fields of
engineering,[11] like:
Tribology
Stress (mechanics)
Thermal engineering
Electrical engineering
Material science
...
Definitions[edit]
Reliability may be defined in the following ways:
The idea that an item is fit for a purpose with respect to time
The capacity of a designed, produced or maintained item to perform
as required over time
and the varying degrees of reliability required for different situations, most projects develop a
reliability program plan to specify the reliability tasks that will be performed for that specific system.
Consistent with the creation of a safety cases, for example ARP4761, the goal of reliability
assessments is to provide a robust set of qualitative and quantitative evidence that use of a
component or system will not be associated with unacceptable risk. The basic steps to take [13] are to:
Risk is here the combination of probability and severity of the failure incident (scenario) occurring.
In a deminimus definition, severity of failures include the cost of spare parts, man hours, logistics,
damage (secondary failures) and downtime of machines which may cause production loss. A more
complete definition of failure also can mean injury, dismemberment and death of people within the
system (witness mine accidents, industrial accidents, space shuttle failures) and the same to
innocent bystanders (witness the citizenry of cities like Bhopal, Love Canal, Chernobyl or Sendai
and other victims of the 2011 Thoku earthquake and tsunami) - in this case, Reliability Engineering
becomes System Safety. What is acceptable is determined by the managing authority or customers
or the effected communities. Residual risk is the risk that is left over after all reliability activities have
finished and includes the un-identified risk and is therefore not completely quantifiable.
availability, logistic delays, lack of repair facilities, extensive retro-fit and complex configuration
management costs and others. The problem of unreliability may be increased also due to the
"domino effect" of maintenance induced failures after repairs. Only focusing on maintainability is
therefore not enough. If failures are prevented, none of the others are of any importance and
therefore reliability is generally regarded as the most important part of availability. Reliability needs
to be evaluated and improved related to both availability andthe cost of ownership (due to cost of
spare parts, maintenance man-hours, transport costs, storage cost, part obsolete risks, etc.). But, as
GM and Toyota have belatedly discovered, TCO also includes the down-stream liability costs when
reliability calculations do not sufficiently or accurately address customers' personal bodily risks.
Often a trade-off is needed between the two. There might be a maximum ratio between availability
and cost of ownership. Testability of a system should also be addressed in the plan as this is the link
between reliability and maintainability. The maintenance strategy can influence the reliability of a
system (e.g. by preventive and/or predictive maintenance), although it can never bring it above the
inherent reliability.
The reliability plan should clearly provide a strategy for availability control. Whether only availability
or also cost of ownership is more important depends on the use of the system. For example, a
system that is a critical link in a production system e.g. a big oil platform is normally allowed to
have a very high cost of ownership if this translates to even a minor increase in availability, as the
unavailability of the platform results in a massive loss of revenue which can easily exceed the high
cost of ownership. A proper reliability plan should always address RAMT analysis in its total context.
RAMT stands in this case for reliability, availability, maintainability/maintenance and testability in
context to the customer needs.
Reliability requirements[edit]
For any system, one of the first tasks of reliability engineering is to adequately specify the reliability
and maintainability requirements derived from the overall availability needs and more importantly,
from proper design failure analysis or preliminary prototype test results. Clear (able to design to)
Requirements should constrain the designers from designing particular unreliable items /
constructions / interfaces / systems. Setting only availability (reliability, testability and maintainability)
allocated targets (e.g. max. Failure rates) is not appropriate. This is a broad misunderstanding about
Reliability Requirements Engineering. Reliability requirements address the system itself, including
test and assessment requirements, and associated tasks and documentation. Reliability
requirements are included in the appropriate system or subsystem requirements specifications, test
plans and contract statements. Creation of proper lower level requirements is critical. [14]
Provision of only quantitative minimum targets (e.g. MTBF values/ Failure rates) is not sufficient for
different reasons. One reason is that a full validation (related to correctness and verifiability in time)
of an quantitative reliability allocation (requirement spec) on lower levels for complex systems can
(often) not be made as a consequence of 1) The fact that the requirements are probabalistic 2) The
extremely high level of uncertainties involved for showing compliance with all these probabalistic
requirements 3) Reliability is a function of time and accurate estimates of a (probabalistic) reliability
number per item are available only very late in the project, sometimes even only many years after inservice use. Compare this problem with the continues (re-)balancing of for example lower level
system mass requirements in the development of an aircraft, which is already often a big
undertaking. Notice that in this case masses do only differ in terms of only some %, are not a
function of time, the data is non-probabalistic and available already in CAD models. In case of
reliability, the levels of unreliability (failure rates) may change with factors of decades (1000's of %)
as result of very minor deviations in design, process or anything else.[15] The information is often not
available without huge uncertainties within the development phase. This makes this allocation
problem almost impossible to do in a useful, practical, valid manner, which does not result in
massive over- or under specification. A pragmatic approach is therefore needed. For example; the
use of general levels / classes of quantitative requirements only depending on severity of failure
effects. Also the validation of results is a far more subjective task than for any other type of
requirement. (Quantitative) Reliability parameters -in terms of MTBF - are by far the most uncertain
design parameters in any design.
Furthermore, reliability design requirements should drive a (system or part) design to incorporate
features that prevent failures from occurring or limit consequences from failure in the first place! Not
only to make some predictions, this could potentially distract the engineering effort to a kind of
accounting work. A design requirement should be so precise enough so that a designer can "design
to" it and can also prove -through analysis or testing- that the requirement has been achieved, and if
possible within some a stated confidence. Any type of reliability requirement should be detailed and
could be derived from failure analysis (Finite Element Stress and Fatigue analysis, Reliability Hazard
Analysis, FTA, FMEA, Human Factor analysis, Functional Hazard Analysis, etc.) or any type of
reliability testing. Also, requirements are needed for verification tests e.g. required overload loads (or
stresses) and test time needed. To derive these requirements in an effective manner, a systems
engineering based risk assessment and mitigation logic should be used. Robust hazard log systems
are to be created that contain detailed information on why and how systems could or have failed.
Requirements are to be derived and tracked in this way. These practical design requirements shall
drive the design and not only be used for verification purposes. These requirements (often design
constraints) are in this way derived from failure analysis or preliminary tests. Understanding of this
difference with only pure quantitative (logistic) requirement specification (e.g. Failure Rate / MTBF
setting) is paramount in the development of successful (complex) systems. [16]
The maintainability requirements address the costs of repairs as well as repair time. Testability (not
to be confused with test requirements) requirements provide the link between reliability and
maintainability and should address detectability of failure modes (on a particular system level),
isolation levels and the creation of diagnostics (procedures).
As indicated above, reliability engineers should also address requirements for various reliability
tasks and documentation during system development, test, production, and operation. These
requirements are generally specified in the contract statement of work and depend on how much
leeway the customer wishes to provide to the contractor. Reliability tasks include various analyses,
planning, and failure reporting. Task selection depends on the criticality of the system as well as
cost. A safety critical system may require a formal failure reporting and review process throughout
development, whereas a non-critical system may rely on final test reports. The most common
reliability program tasks are documented in reliability program standards, such as MIL-STD-785 and
IEEE 1332. Failure reporting analysis and corrective action systems are a common approach for
product/process reliability monitoring.
Assumptions
Design
Design drawings
Statistical analysis
Manufacturing
Quality control
Maintenance
Maintenance manuals
Training
etc.
However, humans are also very good in detection of (the same) failures, correction of failures and
improvising when abnormal situations occur. The policy that human actions should be completely
ruled out of any design and production process to improve reliability may not be effective therefore.
Some tasks are better performed by humans and some are better performed by machines. [17]
Furthermore, human errors in management and the organization of data and information or the
misuse or abuse of items may also contribute to unreliability. This is the core reason why high levels
of reliability for complex systems can only be achieved by following a robust systems
engineering process with proper planning and execution of the validation and verification tasks. This
also includes careful organization of data and information sharing and creating a "reliability culture"
in the same sense as having a "safety culture" is paramount in the development of safety critical
systems.
reliability. Furthermore, normally the most unreliable and important items (most interesting
candidates for a reliability investigation) are most often subjected to many modifications and
changes. Engineering designs are in most industries updated frequently. This is the reason why the
standard (re-active or pro-active) statistical methods and processes as used in the medical industry
or insurance branch are not as effective for engineering. Another surprising but logical argument is
that to be able to accurately predict reliability by testing, the exact mechanisms of failure must have
been known in most cases and therefore in most cases can be prevented! Following the
incorrect route by trying to quantify and solving a complex reliability engineering problem in terms of
MTBF or Probability and using the re-active approach is referred to by Barnard as "Playing the
Numbers Game" and is regarded as bad practise.[5]
For existing systems, it is arguable that responsible programs would directly analyse and try to
correct the root cause of discovered failures and thereby may render the initial MTBF estimate fully
invalid as new assumptions (subject to high error levels) of the effect of the patch/redesign must be
made. Another practical issue concerns a general lack of availability of detailed failure data and not
consistent filtering of failure (feedback) data or ignoring statistical errors, which are very high for rare
events (like reliability related failures). Very clear guidelines must be present to be able to count and
compare failures, related to different type of root-causes (e.g. manufacturing-, maintenance-,
transport-, system-induced or inherent design failures, ). Comparing different type of causes may
lead to incorrect estimations and incorrect business decisions about the focus of improvement.
To perform a proper quantitative reliability prediction for systems may be difficult and may be very
expensive if done by testing. On part level, results can be obtained often with higher confidence as
many samples might be used for the available testing financial budget, however unfortunately these
tests might lack validity on system level due to the assumptions that had to be made for part level
testing. These authors argue that it can not be emphasized enough that testing for reliability should
be done to create failures in the first place, learn from them and to improve the system / part. The
general conclusion is drawn that an accurate and an absolute prediction by field data comparison
or testing of reliability is in most cases not possible. An exception might be failures due to wear-out
problems like fatigue failures. In the introduction of MIL-STD-785 it is written that reliability prediction
should be used with great caution if not only used for comparison in trade-off studies.
See also: Risk Assessment#Quantitative risk assessment Critics paragraph
One of the most important design techniques is redundancy. This means that if one part of the
system fails, there is an alternate success path, such as a backup system. The reason why this is
the ultimate design choice is related to the fact that high confidence reliability evidence for new parts
/ items is often not available or extremely expensive to obtain. By creating redundancy, together with
a high level of failure monitoring and the avoidance of common cause failures, even a system with
relative bad single channel (part) reliability, can be made highly reliable (mission reliability) on
system level. No testing of reliability has to be required for this. Furthermore, by using redundancy
and the use of dissimilar design and manufacturing processes (different suppliers) for the single
independent channels, less sensitivity for quality issues (early childhood failures) is created and very
high levels of reliability can be achieved at all moments of the development cycles (early life times
and long term). Redundancy can also be applied in systems engineering by double checking
requirements, data, designs, calculations, software and tests to overcome systematic failures.
Another design technique to prevent failures is called physics of failure. This technique relies on
understanding the physical static and dynamic failure mechanisms. It accounts for variation in load,
strength and stress leading to failure at high level of detail, possible with use of modern finite
element method (FEM) software programs that may handle complex geometries and mechanisms
like creep, stress relaxation, fatigue and probabilistic design (Monte Carlo simulations / DOE). The
material or component can be re-designed to reduce the probability of failure and to make it more
robust against variation. Another common design technique is component derating: Selecting
components whose tolerance significantly exceeds the expected stress, as using a heavier gauge
wire that exceeds the normal specification for the expected electrical current.
Another effective way to deal with unreliability issues is to perform analysis to be able to predict
degradation and being able to prevent unscheduled down events / failures from
occurring. RCM (Reliability Centered Maintenance) programs can be used for this.
Many tasks, techniques and analyses are specific to particular industries and applications.
Commonly these include:
Built-in test (BIT) (testability analysis)Failure mode and effects analysis (FMEA)Reliability hazard
analysis
Accelerated testing
Electromagnetic analysis
Testability analysis
Manual screening
to the first questions will drive improvement in design and processes. [4] When failure mechanisms are
really understood than solutions to prevent failure are easily found. Only required Numbers (e.g.
MTBF) will not drive good designs. The huge amount of (un)reliability hazards that are generally part
of complex systems need first to be classified and ordered (based on qualitative and quantitative
logic if possible) to get to efficient assessment and improvement. This is partly done in pure
language andproposition logic, but also based on experience with similar items. This can for
example be seen in descriptions of events in Fault Tree Analysis, FMEA analysis and a hazard
(tracking) log. In this sense language and proper grammar (part of qualitative analysis) plays an
important role in reliability engineering, just like it does in safety engineering or in general
within systems engineering. Engineers are likely to question why? Well, it is precisely needed
because systems engineering is very much about finding the correct words to describe the problem
(and related risks) to be solved by the engineering solutions we intend to create. In the words of
Jack Ring, the systems engineers job is to language the project. [Ring et al. 2000]. [19] Language in
itself is about putting an order in a description of the reality of a (failure of a) complex
function/item/system in a complex surrounding. Reliability engineers use both
quantitative and qualitative methods, which extensively use language to pinpoint the risks to be
solved.
The importance of language also relates to the risks of human error, which can be seen as the
ultimate root cause of almost all failures - see further on this site. As an example, proper instructions
(often written by technical authors in so called simplified English) in maintenance manuals, operation
manuals, emergency procedures and others are needed to prevent systematic human errors in any
maintenance or operational task that may result in system failures.
Reliability modelling[edit]
Reliability modelling is the process of predicting or understanding the reliability of a component or
system prior to its implementation. Two types of analysis that are often used to model a complete
system availability (including effects from logistics issues like spare part provisioning, transport and
manpower) behavior are Fault Tree Analysis and reliability block diagrams. On component level the
same type of analysis can be used together with others. The input for the models can come from
many sources: Testing, Earlier operational experience field data or data handbooks from the same or
mixed industries can be used. In all cases, the data must be used with great caution as predictions
are only valid in case the same product in the same context is used. Often predictions are only made
to compare alternatives.
For part level predictions, two separate fields of investigation are common:
Reliability theory[edit]
Main articles: Reliability theory, Failure rate and Survival analysis
Reliability is defined as the probability that a device will perform its intended function during a
specified period of time under stated conditions. Mathematically, this may be expressed as,
,
where
is the failure probability density function and
is assumed to start from time zero).
A special case of mission success is the single-shot device or system. These are devices or systems
that remain relatively dormant and only operate once. Examples include automobile airbags,
thermal batteries and missiles. Single-shot reliability is specified as a probability of one-time
success, or is subsumed into a related parameter. Single-shot missile reliability may be specified as
a requirement for the probability of a hit. For such systems, the probability of failure on demand
(PFD) is the reliability measure which actually is an unavailability number. This PFD is derived from
failure rate (a frequency of occurrence) and mission time for non-repairable systems.
For repairable systems, it is obtained from failure rate and mean-time-to-repair (MTTR) and test
interval. This measure may not be unique for a given system as this measure depends on the kind of
demand. In addition to system level requirements, reliability requirements may be specified for
critical subsystems. In most cases, reliability parameters are specified with appropriate
statistical confidence intervals.
Reliability testing[edit]
The purpose of reliability testing is to discover potential problems with the design as early as
possible and, ultimately, provide confidence that the system meets its reliability requirements.
Reliability testing may be performed at several levels and there are different types of testing.
Complex systems may be tested at component, circuit board, unit, assembly, subsystem and system
levels [20] [1] . (The test level nomenclature varies among applications.) For example, performing
environmental stress screening tests at lower levels, such as piece parts or small assemblies,
catches problems before they cause failures at higher levels. Testing proceeds during each level of
integration through full-up system testing, developmental testing, and operational testing, thereby
reducing program risk. However, testing does not mitigate unreliability risk.
With each test both a statistical type 1 and type 2 error could be made and depends on sample size,
test time, assumptions and the needed discrimination ratio. There is risk of incorrectly accepting a
bad design (type 1 error) and the risk of incorrectly rejecting a good design (type 2 error).
It is not always feasible to test all system requirements. Some systems are prohibitively expensive to
test; some failure modesmay take years to observe; some complex interactions result in a huge
number of possible test cases; and some tests require the use of limited test ranges or other
resources. In such cases, different approaches to testing can be used, such as (highly) accelerated
life testing, design of experiments, and simulations.
The desired level of statistical confidence also plays an role in reliability testing. Statistical
confidence is increased by increasing either the test time or the number of items tested. Reliability
test plans are designed to achieve the specified reliability at the specified confidence level with the
minimum number of test units and test time. Different test plans result in different levels of risk to the
producer and consumer. The desired reliability, statistical confidence, and risk levels for each side
influence the ultimate test plan. The customer and developer should agree in advance on how
reliability requirements will be tested.
A key aspect of reliability testing is to define "failure". Although this may seem obvious, there are
many situations where it is not clear whether a failure is really the fault of the system. Variations in
test conditions, operator differences, weather and unexpected situations create differences between
the customer and the system developer. One strategy to address this issue is to use a scoring
conference process. A scoring conference includes representatives from the customer, the
developer, the test organization, the reliability organization, and sometimes independent observers.
The scoring conference process is defined in the statement of work. Each test case is considered by
the group and "scored" as a success or failure. This scoring is the official result used by the reliability
engineer.
As part of the requirements phase, the reliability engineer develops a test strategy with the customer.
The test strategy makes trade-offs between the needs of the reliability organization, which wants as
much data as possible, and constraints such as cost, schedule and available resources. Test plans
and procedures are developed for each reliability test, and results are documented.
Accelerated testing[edit]
The purpose of accelerated life testing (ALT test) is to induce field failure in the laboratory at a much
faster rate by providing a harsher, but nonetheless representative, environment. In such a test, the
product is expected to fail in the lab just as it would have failed in the fieldbut in much less time.
The main objective of an accelerated test is either of the following:
To discover failure modes
To predict the normal field life from the high stress lab life
An Accelerated testing program can be broken down
into the following steps:
Arrhenius model
Eyring model
Software reliability[edit]
Further information: Software reliability
Software reliability is a special aspect of reliability engineering. System reliability, by definition,
includes all parts of the system, including hardware, software, supporting infrastructure (including
critical external interfaces), operators and procedures. Traditionally, reliability engineering focuses on
critical hardware parts of the system. Since the widespread use of digital integrated
circuit technology, software has become an increasingly critical part of most electronics and, hence,
nearly all present day systems.
There are significant differences, however, in how software and hardware behave. Most hardware
unreliability is the result of a component or material failure that results in the system not performing
its intended function. Repairing or replacing the hardware component restores the system to its
original operating state. However, software does not fail in the same sense that hardware fails.
Instead, software unreliability is the result of unanticipated results of software operations. Even
relatively small software programs can have astronomically large combinations of inputs and states
that are infeasible to exhaustively test. Restoring software to its original state only works until the
same combination of inputs and states results in the same unintended result. Software reliability
engineering must take this into account.
Despite this difference in the source of failure between software and hardware, several software
reliability models based on statistics have been proposed to quantify what we experience with
software: the longer software is run, the higher the probability that it will eventually be used in an
untested manner and exhibit a latent defect that results in a failure (Shooman 1987), (Musa 2005),
(Denney 2005).
As with hardware, software reliability depends on good requirements, design and implementation.
Software reliability engineering relies heavily on a disciplined software engineering process to
anticipate and design against unintended consequences. There is more overlap between
software quality engineering and software reliability engineering than between hardware quality and
reliability. A good software development plan is a key aspect of the software reliability program. The
software development plan describes the design and coding standards, peer reviews, unit
tests, configuration management, software metrics and software models to be used during software
development.
A common reliability metric is the number of software faults, usually expressed as faults per
thousand lines of code. This metric, along with software execution time, is key to most software
reliability models and estimates. The theory is that the software reliability increases as the number of
faults (or fault density) decreases or goes down. Establishing a direct connection between fault
density and mean-time-between-failure is difficult, however, because of the way software faults are
distributed in the code, their severity, and the probability of the combination of inputs necessary to
encounter the fault. Nevertheless, fault density serves as a useful indicator for the reliability
engineer. Other software metrics, such as complexity, are also used. This metric remains
controversial, since changes in software development and verification practices can have dramatic
impact on overall defect rates.Testing is even more important for software than hardware. Even the
best software development process results in some software faults that are nearly undetectable until
tested. As with hardware, software is tested at several levels, starting with
individualunits,throughintegratioandfull-up system testing. Unlike hardware, it is inadvisable to skip
levels of software testing. During all phases of testing, software faults are discovered, corrected, and
re-tested. Reliability estimates are updated based on the fault density and other metrics. At a system
level, mean-time-between-failure data can be collected and used to estimate reliability. Unlike
hardware, performing exactly the same test on exactly the same software configuration does not
provide increased statistical confidence. Instead, software reliability uses different metrics, such
as code coverage.
Eventually, the software is integrated with the hardware in the top-level system, and software
reliability is subsumed by system reliability. The Software Engineering Institute'scapability maturity
model is a common means of assessing the overall software development process for reliability and
quality purposes.
reliability requirements are sometimes extremely high. It deals with unwanted dangerous events (for
life, property, and environment) in the same sense as reliability engineering, but does normally not
directly look at cost and is not concerned with repair actions after failure / accidents (on system
level). Another difference is the level of impact of failures on society and the control of governments.
Safety engineering is often strictly controlled by governments (e.g. nuclear, aerospace, defense, rail
and oil industries).[21]
Furthermore, safety engineering and reliability engineering may even have contradicting
requirements. This relates to system level architecture choices .[citation needed] For example, in train signal
control systems it is common practice to use a fail-safe system design concept. In this concept
the Wrong-side failure need to be fully controlled to an extreme low failure rate. These failures are
related to possible severe effects, like frontal collisions (2* GREEN lights). Systems are designed in
a way that the far majority of failures will simply result in a temporary or total loss of signals or open
contacts of relays and generate RED lights for all trains. This is the safe state. All trains are stopped
immediately. This fail-safe logic might unfortunately lower the reliability of the system. The reason for
this is the higher risk of false tripping as any full or temporary, intermittent failure is quickly latched in
a shut-down (safe)state. Different solutions are available for this issue. See chapter Fault Tolerance
below.
Fault tolerance[edit]
Reliability can be increased here by using a 2oo2 (2 out of 2) redundancy on part or system level,
but this does in turn lower the safety levels (more possibilities for Wrong Side and undetected
dangerous Failures). Fault tolerant voting systems (e.g. 2oo3 voting logic) can increase both
reliability and safety on a system level. In this case the so-called "operational" or "mission" reliability
as well as the safety of a system can be increased. This is also common practice in Aerospace
systems that need continued availability and donot have a fail safe mode (e.g. flight computers and
related electrical and / or mechanical and / or hydraulic steering functions need always to be
working. There are no safe fixed positions for rudder or other steering parts when the aircraft is
flying).
Reliability operational
assessment[edit]
After a system is produced, reliability
engineering monitors, assesses and corrects
deficiencies. Monitoring includes electronic and
visual surveillance of critical parameters
identified during the fault tree analysis design
stage. Data collection is highly dependent on
the nature of the system. Most large
organizations have quality control groups that
collect failure data on vehicles, equipment and
machinery. Consumer product failures are
often tracked by the number of returns. For
systems in dormant storage or on standby, it is
necessary to establish a formal surveillance
Reliability organizations[edit]
What is MIL-HDBK-217?
The original reliability prediction
handbook was MIL-HDBK-217,
the Military Handbook for
"Reliability Prediction of
Electronic Equipment". MILHDBK-217 is published by the
Department of Defense, based
on work done by the Reliability
Analysis Center and Rome
Laboratory at Griffiss AFB, NY.
The MIL-HDBK-217 handbook
contains failure rate models for
the various part types used in
electronic systems, such as ICs,
transistors, diodes, resistors,
capacitors, relays, switches,
connectors, etc. These failure
rate models are based on the
best field data that could be
obtained for a wide variety of
parts and systems; this data is
then analyzed and massaged,
with many simplifying
assumptions thrown in, to
create usable models.
The latest version of MIL-HDBK217 is MIL-HDBK-217F, Notice 2
(217F2). You can get a copy of
MIL-HDBK-217F2 from any
source that provides Mil Specs,