zyxw
zyxw
zyxwvu
REDUNDANCYCLASSIFICATION FOR FAULT
TOLERANT COMPUTER DESIGN
ALGIRDAS PAKSTAS*, IGOR SCHAGAEV**, JANUSZ ZALEWSKI***
*SIMT, University of North London, Holloway Rd, 166-220,N7 8DB, a.pakstas(ir,uiil.ac.uk,
** Inst. Control Sci., Profsoyumaya st.65, Moscow, Russia,
[email protected]
***ECE, University of Central Florida, Orlando, FL328 16-2450,
[email protected]
Abstract
build non contradictory complete set of definitions
and descriptions related from one side to the
subject - computer systems and from another - to
the required feature - tolerance to hardware faults.
Paper discusses principles of the redundancy
classification for the design of fault tolerant
computer systems. The basic functions of
classification: definitive, characteristic and
predictive are presented. Shown that proposed
classification of redundancy posses a substantial
predictive power. Proposed classification suits for
the analysis of roles of hardware and software to
achieve fault tolerance of the system.
1.1. Approaches
The formation of the framework of definitions
and their inter relations in any subject domain of
human knowledge in general based on application
of philosophical categories and selection which
features of these categories are the most important
for the object of research. The growth of interest to
this area is reflected, for example, by the fact of‘
simultaneous publications of two monographs with
the identical names by two different authors [ 1,2].
The interrelations between categories are
specified in these monographs either with use O F
dialectic approach or relatively recently developed
structural approach. As well as in fundamental
philosophy, and in the theory of computer systems
formation of the concept, analysis of their
interrelations are closely connected to the matter of
the objects and their specific features.
Computer systems (CS) as phenomena and fault
tolerance of computer system as new required
feature should also be described in terms cd
redundancy and application of redundancy type:;.
Doing this it will be possible to develop framework
of terms and concepts to analyse existed CS as well
as design and development of different kind of new
fault tolerant computer systems (FTCS).
An attempt to build a classification of
redundancy to ease analysis of CS and their
features along design process is a core of this
paper. As a confinnation of workability of
proposed classification an analysis of triplicate
RAM is presented and shown how this structure
should be modernized.
1.2. Structure of the Paper
Paper is organized as following. Part 2 comments,a
problem of redundancy classification and discuss
required functions of classifications.Part 3 presents
a development of new classification and way of its
application. Part 4 states directions of further
research in the area of fundamentals of fault
tolerant computer design.
zyxwvutsrq
zyxwvutsrqpo
zyxwvutsrq
Keywords: philosophy,redundancy class$cation,
fault tolerance, hardware, software.
Onlv method is able to control the thouaht. lead
it to and keeD within the subiect.
Hegel, (Foreword to Encyclopaedia of Philosophical
Sciences) Berlin, 25.05.1827
1. Introduction
Fault tolerant computers differ from the ordinary
ones by the realization of special actions to tolerate
hardware faults after their appearance. These
actions assume several steps of special Algorithm
of Fault Tolerance (AFT) to eliminate fault
influences on computing process. Classical
sequence to acheve fault tolerance of computer
consists of phases of detection and location of
faulty unit and reconfiguration the structure of the
system. Actually, it is much more to be done to
provide correct functioning of computer when a
fault occurs.
Fault tolerance as a new feature of the system is
rather synthetic and consists of completion of all
phases of mentioned algorithm. Clear that
algorithm of fault tolerance requires extra means
and resources. All these “means and resources” are
based on different types of redundancy. Then a
question arises: which kind of redundancy types
can be used and how they can be applied on
different phases of AFT?
The development process of the “fault
toleralization” of computer system requires a 111
and correct set of definitions and their
characteristics. Description of relations between
definitions is also required. Proposed and discussed
here classification of redundancy is an attempt to
zy
zyxwvutsr
0-7803-7087-2/01/$10.00 02001 IEEE
3193
zyxwvutsr
beginning of the theory. The presence of PF and
power of prediction can be formal value of
classification as a whole. One good example of
classification with huge predictive power is
Mendeleev's Periodical Table of the chemical
elements.
Another example of the successll applying of
classification can be found in [SI where levels of
information interactions in homogenous structures
have been detailed and it formed base for further
research.
Classification of kinds of interactions at level of
operators of programming language for parallel
computing have allowed the author of [9] to
construct a theoretical base for this area of research.
The problems of classification the hierarchy in
parallel calculations as well as implementation
them into hardware and software are described in
[IO]. This work also predicted occurrence of
special arithmetic, vector and coherent processors,
and also defined a role of operating systems in
maintenance and management of the parallelism.
Thus, after looking at these examples one may
ask: what is the core for the success of one
classification and the lack of another? The success
of PF for the new classification depends on
concurrent observation and satisfaction of several
conditions.
The first condition is accuracy, precision in
selection of the aim, which is pursued, with analysis
of new subject domain and key required feature some kind of aim inteaitv. The second condition,
not less important, consists of internal structure of
classification and rigorous approach in construction
and introduction of terms and concepts inside
classification.
Classifications [3-7] suffer the variety of
concepts and aims, which did not allow to limit and
form a basis and select essential structure of terms
within the classification. Too many objects,
qualities, features were discussed in 13-71.
To summarize requirements to classification
note:
1. Any classification, as a beginning of the
theory, should be considered fi-om the point of view
of performance of three interconnected functions:
dejinitive, characteristic and predicfive.
2. Classification must be constructed with strict
principle of aim integrity, i.e. with selection the
sinele feature to achieve.
3. Each phase of introduction, presentation and
detailed analysis of the classification development
must be rigorously analysed. Otherwise, the success
of the classification seems to be problematical.
4. Only really essential features and details of the
analysed objects have to be included into
classification. For example, energetic terms should
be used in classification of power generation
systems, where main function of these systems is
Classification and Its Functions
Computer systems are used nowadays in
different and important areas such as banking,
military, aviation, intensive health care, industrial
control, space exploration etc. All of these areas
demand highest possible (and, sometimes
impossible) reliability of functioning. In terms of
reliability the availability has a special interest. The
methods to achieve reliability based either on
efforts to increase the reliability of hardware
components or fault tolerance of computer and its
basic parts.
An additional requirement to industrial
computers usually is the reduction or elimination of
service or maintenance of computer systems
because in most applications service of computer
during the mission is impossible. To cope with the
problem of reliability of CS for the critical
applications a feature to "repair itself" in real time
of operation is needed. In other words CS must be
fault tolerant. This new feature - fault tolerance of
functioning demands additional overheads in
s o h a r e and hardware. This area of research
became visible in late 50's [MI.Several attempts
and efforts in definition of ways and means, which
help implementation of hardware fault tolerance,
were presented in [3-71. But even after these
prominent efforts a general question has not been
answered: which features of classifications are the
most important and how to form correct
classification? An attempt to answer on this
question can be considered as a subject of this
paper.
Which h c t i o n s of classification are most
important and how to evaluate classification itself?
The answer on this question can be done by the
analysis and selection of the criteria which
classificationmust match.
Classifications [3-71 are related to the area of
information processing. All of them have a
definitive function, when terms and concepts called
and nominated. Call this function as definitive
(DF). DF answers the question "what is it"?
The second function describes interrelation
between definitions, i.e. characterize them, call this
function as characteristic function (CF). CF
answers the question "how these definitions are
connected'?.
In [7] for the description the redundancy types
used terms energy, information, and matter. In [3]
redundancies of hardware, program and time were
considered. In [6], with introduction of
dependability the objects features were analyzed
together with form of influence on the objects:
hardware faults, software errors, and user mistakes.
The most successful and known classifications
have the third function as well - a predictive one.
Predictive function (PF) answers the questions
"what if' and "what next" and can be considered as
2.
zyxwvutsrq
zyxwvutsr
zyxwvuts
zyx
3194
zyxwvutsr
zyxw
zyxwvutsrqp
zyxwvutsrqpo
zyxwvut
just production of energy. And, there is no doubt,
information processing systems should be discussed
in terms related to information.
Again, a classification of the distributive systems
should present in some way categories of
dimensions in this kind of the systems.
descriptions and indicate three main variables to
achieve fault tolerance. On the other hand, a
computer system has its own duality and consists of
software and hardware (third level of Fig.1).
Therefore, there are two carriers of the above
mentioned redundancy types. Thus, redundancy has
six different variants: three based on hardware and
three based on software.
The more certain and determined redundancy
types are, the more characteristics might be
expressed in terms of classification and the more
predictions might be generated.
Introduced
definitions of redundancy types are clarified with
examples from daily practice of fault tolerant
computer design:
Hardware based redundancv tvoes:
0
H(2M) - structural (material) redundancy of
hardware such as duplicated computer system;
H(Ml,M2) - duplicated fault tolerant
computer system with different (non identical)
units;
0
H(I1) - redundant bit of information, to check
parity errors of data;
0
H(nT) - special hardware to repeat or delay of
computing to avoid malfimction influence;
H(dT) - special hardware to delay execution
(like in timing diagram) to avoid malfunction.
Software based redundancv tvoes:
S(2T) - double repetition of the same program
on the same hardware, gused to check the results;
0
S(1) - informational redundancy of the
program, to name one is back-up files;
S(Ml,M2) - two different copies of the
program for realisation the same function;
S(dT) - time delays realised in software for
waiting of the guaranteed result.
3. Formation of the Redundancy Classification
Consider the process of classification
construction for computer systems, when tolerance
to hardware faults is required. Tolerance to faults is
a realization of the process of transition of the
system from correct (fault free) state to another
workable state, where an existing fault does not
influence computer functioning. A fault tolerant
computer differs from a usual one by the way of
repair - fault tolerant CS, by definition, must repair
and recover itself after arisen faults. In turn,
ordinary CS requires special interruption for the
service and repair. A well-known sequence of steps
to implement fault tolerance includes the phases of
a) fault detection, b) fault location and c) excluding
of faulty unit fiom the system to continue
functioning.
A11 phases in this sequence should be completed
in proper time and in such manner that the system
should notice neither fault nor process of its
elimination. There is another type of computer
systems, which tolerate the hardware faults with
degradation in performance and/or fimctioning this kind of the system called graceful degradation
systems.
These systems will not be discussed here
because, strictly speaking, they cannot be classified
as fault tolerant. In reality it is much more actions
(steps of algorithm) are needed to eliminate
malfunctions and permanent faults. These actions
depend on the various functions of the basic
computer systems components, their roles and
“power” to provide tolerance to hardware faults.
Construct here the classification of redundancy
with gradually detailed steps and concepts
involved. Computer systems as well as other
systems can be analysed using first order categories
such as matter and time. For CS it is structure and
time. Above this, a basic function of the system is
information processing and information must be
considered as first level description also.
The same classification it is possible to use for
dependability and for software reliability areas of
research, but in these cases, accordingly to the
declared aim integrity principle one should not
expect to have strong predictive function of
classification.
Consider now the way of application of the
classification of redundancy for providing fault
tolerance of the computer. CS, as it was mentioned
is a pair of software and hardware. Hardware fault
can damage the hardware and the state of software.
Therefore, further correct computing after
appearance of even a malfunction without a recover
of hardware and software states seems to be
problematical. The situation becomes even more:
complex because the hardware of computer consists,
of two different subsystems, which involved iri
information processing by different ways, call these
subsystems passive and active. Active subsystem
produces information in each tact of processor antl
hardware fault in there might be detected antl
REDUNDANCY OF FTC
I
I
I
STRUCTURAL INFORMATIONAL TIME
SOFTWARE
HARDWARE
Fig. 1. Classification of FTC redundancy.
Top level on (Fig.1) describes the core of
realization the fault tolerance: it is redundancy,
straight lines connect redundancy with fKst level
3195
zyxwvut
zyxwvut
zyxwvuts
located almost in the same tact of computing.
Active subsystem includes processors, U 0
controllers, etc.
Within passive subsystem the information is
recorded and stored. Generally, an occurring
hardware fault within a passive subsystem can
manifest itself with a delay, sometimes counted in
hours. A passive subsystem consists of memory
(RAM),Flash RAM,file structure with discs etc.
A delay between appearance and manifestation
of a hardware fault is known as a latent period of
fault. This latency means that software will be
damaged by hardware faults. Since the 60’s [ 121
and up to now there was no practical evidence or
theoretic results which proved that it is possible to
design fault tolerant computer so that software state
will be authentic after hardware fault. On the
contrary, practical estimation [ 131 and comments
given in [ 141 prove that the program or data might
be damaged long before the manifestation of a
hardware fault. Latency of the faults, uncertainty of
faulty period and influence on software causes the
problem of effective formation, correct searching
and recovery of the program for distributed [15]
and for standalone computer systems [ 161.
Presented arguments are enough to introduce an
extension of AFT by including of similar actions
directed to hardware (A-C) with actions related to
software (D-F). This generalised algorithm
presented on Fig2 A.
transparent for testing procedures. There is no
doubt that the border between these d e f ~ t i o n sof
permanent fault and malfunction is rather obscure.
For example, when a malfunction happened
exactly at the end of the program execution in terms
of the particular program it is considered as
permanent fault. Again, if a malfunction presence
lasts in the system longer than time to execute the
particular program, it is considered as a permanent
fault. In fault tolerant systems in this case, it
becomes possible to locate and reconfigure the
system to the acceptable configuration to continue
the calculations further and later to check precisely
the matter and type of fault [ 171. The problem of
the type determination becomes more complex in
multiprocessor and pipeline systems, which
intensively use asynchronous blocks of hardware.
Current practice shows that a malfunction of
hardware is easier to eliminate than a permanent
fault. Thus, very often the repetition of the program
or its segment might save the situation. Therefore,
some steps to eliminate malfunction placed into
algorithm of fault tolerance can increase reliability
of the system as a whole. This is as true as bigger
rate of malfunctiodpennanent faults in the system.
In availability terms, the faster malfunctions
influence is eliminated from the system the better.
The second aspect relates more to the technology
and environment of the hardware operation. The
first one enforces an implantation into the algorithm
some actions to check the type of fault as close as
possible to beginning. These actions printed in bold
in the algorithm body. The generalised algorithm of
fault tolerance (GAFT) presented in Fig.2B.
zyxwvutsrq
zyxw
zyxwvutsrqp
zyxwvu
A . To prove that fault does not exist
otherwise
B . Locate a faulty component;
C Reconfigure the hardware;
D To prove that software is not affected;
otherwise
E Locate faulty states of the program and define
the correct state from which to continue
F To recover the system fiom preliminary stored
correct state of software and
G Continue the operation
.
.
.
.
.
A. To prove that fault does not exist
otherwise
B. To determine the type of fault;
C. If the fault is permanent then
D. Locate a faulty component of hardware;
E. Reconfigure the hardware faulty unit;
otherwise
F. To prove that fault does not affect the
software;
G . Locate faulty states of the program and define
the correct state of the software;
H. To recover the system from previously stored
correct state of software and continue
Fig.2A First generalization of AFT
Given algorithm better presents a sequence of
steps and matter OF actions to achieve fault
tolerance of hardware because it represents the
duality of the object structure (sohare, hardware).
Nevertheless, the physics of the fault itself still has
not been presented yet in the structure of algorithm.
Consider then the nature of fault of hardware as far
as it is reflected in the behaviour of the system.
Hardware Faults are considered as permanent
(solid) and temporary ones (usually called
malfunction). The permanent fault manifests itself
in the repetition of the execution of the program.
The malfunction has to be recovered after repetition
of the program or its part. Malfunctions are
Fig.2B Generalization of the AFT
GAFT consists of two near identical parts: one
for hardware and another for software. It is initiated
by external reasons. As external reasons should be
here taken into account: hardware checking signals,
periodical runs of testing procedures or software
initiated signals as in acceptance test approach.
It was shown in the previous papers [16-17] that
a malfunction influence on the program might be
3196
zyxwvutsrqp
eliminated by recovery of the code and variables of
the program, together with operation system
records about this particular program. For the
algorithm 2B it means that after the type of fault
determination in case of a malfunction, it is
possible to jump to the G and H steps. In terms of
classic reliability theory stationary availability is
determined as working time divided by sum of
working time and repair time. For the permanent
faults repair time is the time of execution the whole
steps of algorithm 2B steps A-H. For the
malfunction is just A,B,F,G and H. Both legs of
algorithm would be done as fast as possible.
In this concern the result, which prove the
possibility of concurrent execution of the
mentioned steps related to hardware and software
[17] has a special interest. Using this concurrency
opportunity becomes possible to determine
theoretic minimum time redundancy requiring to
tolerate both type of hardware fault.
Comment specifics of the algorithm steps
related to the software recovery. Because the
latency of hardware faults some consecutive
recovery points (RPs) can store erroneous data.
Then, even several iterative steps of recovery could
not be enough to achieve a correct state of
hardware and software to continue the execution.
The problem of determination the correct RP
has its special value and might be the subject of
further researching. Methods of correct RP
searching are described in [ 161.
GAFT in our taxonomy presents a new feature
(fault tolerance), which is looking for the system.
The structure of the algorithm reflects also physics
of matter (hardware fault) against which GAFT and
a system have been designed. From the other hand,
redundancy classification by itself is closely
connected with the analyzing object (computer
system) and in general terms describes basic types
of redundancy applicable to achieve the fault
tolerance. Combine of the redundancy classification
and our algorithm (sequence of steps to eliminate
the influence of fault). The taxonomy obtained is
presented in Table 1.
Thus, as shown in the Table 1, computer system
is considered as fault tolerant if and only if its
algorithm of fault tolerance is realized in full, i.e.
from the step A to the step 1.
Various fault tolerant systems may' differ in
time to implement steps of the algorithm, in types
of redundancy used on various steps of algorithm;
in the types of fault which have to be tolerated.
Taxonomy presented in the Table 1 also allows
to analyze fault tolerant features of the computer
architectures and evaluate how effective are various
types of redundancy used to achieve required
feature.
Table 1. Taxonomy of redundancy types and
relevant algorithms of fault tolerance.
Error! Not a valid link.
Note also that cost of applied redundancy types
and architecture of CS combined could be
considered as important argument pro or contra
selected architecture and engineered solutions.
3.1. Dependability Revisited
Developed here taxonomy concerns only fault
tolerance of computer systems and provides
correctness of the statement such as: hardware of
the system correct or the system proves tolerance to
hardware fault of some types. Denote set of states
of hardware by Sh. Then fault tolerance is achieved
when the predicate of hardware correctness P on
the set of states Sh is true: P(Sh)=true. By analogy,
the similar predicate of absence of software errors
on the set of states of the program is defined as
P(Ssj=true. The predicate of absence of operator
(user) errors denoted as P(Su). Thus, using this
notation we may defiie term dependability as
P(Sh)&P(Ssj&P(Su)=true.
zyxw
zyxwvutsrq
zyxwvutsr
zyx
4.Some General Problem of Redundancy:
Instead of Conclusion
Staying aside of the details of hardware and
software features required for algorithm of fault
tolerance, further researches in the application of
redundancy classification should be in terms of
complexity of the system.
Complexity of the system presents in volume of
information (internal) and hardware to process this
information. This complexity determines the length
of algorithm of fault tolerance (each phase becomes
larger, longer etc.).
Dynamic complexity might be described, for
example, as mean rate of information dI processed
by the system in time dI(t).
Input data combine with internal data ancl
hardware and change the state of the system. From
the point of view of imbedding a new quality - fault
tolerance there are two different ways: an extension
of input information or internal structures of
software and hardware - as we did in this paper.
Algorithm steps differ on checking steps and
recovery steps. Redundancy, introduced in th'e
system might be spent on checking and recovery.
When (and it is always true) the redundancy level is
limited the problems of optimal redundancy
splitting, utilising and monitoring are arisen. It 1s
important when type of fault is defined and proven
by practice.
Another approach to build fault tolerant systein
might be seen as follows: some amount of
redundancy is given. Of which fault types it is
3197
possible to tolerate fault in the system with this
level of redundancy and how effectively to split the
redundancy between phases of checking and
recovery.
Additionally, above the static analysis of
volume of redundancy for fault tolerance there is a
dynamic aspect also. Thus, when checking is
applied periodically, depending this period the
level of required supported redundancy changes for
both checlung and correction (recovery) of the
system. The approach of run an acceptance test [ 151
between procedures requires to recover state of the
system much more hardware and software efforts to
tolerate fault comparativelyto [ 16,171.
Some other theoretical problems in the area of
fault tolerance following from this paper are:
to prove that some kind of redundancy is
inapplicable for realization of some steps of
algorithm to provide fault tolerance - such negative
result will allow to reduce the number of variants
and structures and, therefore, redundancy options
and could be extremely usehl;
to analyze restrictions of different kinds of
redundancy on criteria of performance, reliability,
availability and cost - these evaluated restrictions
would allow to reduce selection of the architectures
and redundancy types and their amount of
combinations, say, by cost criteria, which simplify
design.
to evaluate an opportunity of application an
offered taxonomy as a logic core of concurrent
design for fault tolerant systems;
to develop and construct reliability models of
each kind of redundancy - this permits during the
design process of the system to estimate
automatically the possible (expected, achievable)
reliability, and, thus, to ensure the optimum choice.
These directions, together with further
development proposed in this paper approach
seems to be definite and productive area of firther
researches.
zyxwv
7.Parkhomenko
P.P.
About
redundancy
classification, 1988, Preprint of ICSAN, pp. 1-7,
8.Evreinov E.B. Homogeneous universal computer
systems with high performance.“Science”,1966.
9,Mirenkov N.N. Parallel algorithms and
correctness of programs, Programmirovanie,1985,
6, pp.3-16
1O.Golovkin B.A. Structures of high performance
computer systems and their relations with the
structures of programs and algorithms, Technical
Cybernetics 5, 1985, pp. 194-229.
11.Cristian F. Rigorous approach to fault tolerant
systems development. IBM
Report RJ
3784/January 1983.
12.Katzan H. Computer organisation and the
system 370. Van Nostrand Reinhold Co. 1971.
13.Chillarege R. Iyer R.K. Measurement based
analysis of error latency. IEEE Transaction on
comp.Vo1.C-36, No.5, 1987, pp.529-537.
14.Gifford D.,Spector A. Case study: IBM system
360-370, COIIXII.ACM, 1987,V01.30, NO.4
15.Rendall B. System structure for software fault
tolerance.Trans. on Soft. eng.Vol.SE-1,.2,191-209
16.Schagaev I. Algorithm of Computation
Recovery. Automatic and Remote Control, 7,1986
17.Schagaev I.. Determination of type of hardware
faults by software means.IBID, 3,1990
18. Pierce Y. Fault tolerant computer systems,
1965. Addison Wesley
19.Schagaev I. Yet another approach to
classification of redundancy. IMEKO Symposium
CIM, 1990, Helsinki, Finland, pp. 117-124.
zyxwvuts
zyxwv
zyxwvuts
References
1.Tulenov M.T.,Dialectic categories and their
interrelations . “High School”, 1986, Moscow.
2.Samburov E.A.. Dialectic categories and their
interrelations “Science” ,1987, Moscow.
3.Avizienis A. Architectures of fault tolerant
computing systems FTCS 1975 pp 3-16.
4.Avizienis A., Laprie J.K. Dependable computing:
from concepts to design diversity. Proc. IEEE
Vo1.74,No.5,May 1986
5,Avizienis A. N-version approach to fault
tolerance s o h a r e . IEEE Trans. on Soft. Eng., Vol.
SE-11, No.12, pp1491-1501, Dec.1985
6.Laprie J.K. Dependability concepts and
terminology, ESPRIT BRA, ProJect3092,
3198