Academia.eduAcademia.edu

Redundancy classification for fault tolerant computer design

2001, 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236)

Paper discusses principles of the redundancy classification for the design of fault tolerant computer systems. The basic functions of classification: definitive, characteristic and predictive are presented. Shown that proposed classification of redundancy posses a substantial predictive power. Proposed classification suits for the analysis of roles of hardware and software to achieve fault tolerance of the system.

zyxw zyxw zyxwvu REDUNDANCYCLASSIFICATION FOR FAULT TOLERANT COMPUTER DESIGN ALGIRDAS PAKSTAS*, IGOR SCHAGAEV**, JANUSZ ZALEWSKI*** *SIMT, University of North London, Holloway Rd, 166-220,N7 8DB, a.pakstas(ir,uiil.ac.uk, ** Inst. Control Sci., Profsoyumaya st.65, Moscow, Russia, [email protected] ***ECE, University of Central Florida, Orlando, FL328 16-2450,[email protected] Abstract build non contradictory complete set of definitions and descriptions related from one side to the subject - computer systems and from another - to the required feature - tolerance to hardware faults. Paper discusses principles of the redundancy classification for the design of fault tolerant computer systems. The basic functions of classification: definitive, characteristic and predictive are presented. Shown that proposed classification of redundancy posses a substantial predictive power. Proposed classification suits for the analysis of roles of hardware and software to achieve fault tolerance of the system. 1.1. Approaches The formation of the framework of definitions and their inter relations in any subject domain of human knowledge in general based on application of philosophical categories and selection which features of these categories are the most important for the object of research. The growth of interest to this area is reflected, for example, by the fact of‘ simultaneous publications of two monographs with the identical names by two different authors [ 1,2]. The interrelations between categories are specified in these monographs either with use O F dialectic approach or relatively recently developed structural approach. As well as in fundamental philosophy, and in the theory of computer systems formation of the concept, analysis of their interrelations are closely connected to the matter of the objects and their specific features. Computer systems (CS) as phenomena and fault tolerance of computer system as new required feature should also be described in terms cd redundancy and application of redundancy type:;. Doing this it will be possible to develop framework of terms and concepts to analyse existed CS as well as design and development of different kind of new fault tolerant computer systems (FTCS). An attempt to build a classification of redundancy to ease analysis of CS and their features along design process is a core of this paper. As a confinnation of workability of proposed classification an analysis of triplicate RAM is presented and shown how this structure should be modernized. 1.2. Structure of the Paper Paper is organized as following. Part 2 comments,a problem of redundancy classification and discuss required functions of classifications.Part 3 presents a development of new classification and way of its application. Part 4 states directions of further research in the area of fundamentals of fault tolerant computer design. zyxwvutsrq zyxwvutsrqpo zyxwvutsrq Keywords: philosophy,redundancy class$cation, fault tolerance, hardware, software. Onlv method is able to control the thouaht. lead it to and keeD within the subiect. Hegel, (Foreword to Encyclopaedia of Philosophical Sciences) Berlin, 25.05.1827 1. Introduction Fault tolerant computers differ from the ordinary ones by the realization of special actions to tolerate hardware faults after their appearance. These actions assume several steps of special Algorithm of Fault Tolerance (AFT) to eliminate fault influences on computing process. Classical sequence to acheve fault tolerance of computer consists of phases of detection and location of faulty unit and reconfiguration the structure of the system. Actually, it is much more to be done to provide correct functioning of computer when a fault occurs. Fault tolerance as a new feature of the system is rather synthetic and consists of completion of all phases of mentioned algorithm. Clear that algorithm of fault tolerance requires extra means and resources. All these “means and resources” are based on different types of redundancy. Then a question arises: which kind of redundancy types can be used and how they can be applied on different phases of AFT? The development process of the “fault toleralization” of computer system requires a 111 and correct set of definitions and their characteristics. Description of relations between definitions is also required. Proposed and discussed here classification of redundancy is an attempt to zy zyxwvutsr 0-7803-7087-2/01/$10.00 02001 IEEE 3193 zyxwvutsr beginning of the theory. The presence of PF and power of prediction can be formal value of classification as a whole. One good example of classification with huge predictive power is Mendeleev's Periodical Table of the chemical elements. Another example of the successll applying of classification can be found in [SI where levels of information interactions in homogenous structures have been detailed and it formed base for further research. Classification of kinds of interactions at level of operators of programming language for parallel computing have allowed the author of [9] to construct a theoretical base for this area of research. The problems of classification the hierarchy in parallel calculations as well as implementation them into hardware and software are described in [IO]. This work also predicted occurrence of special arithmetic, vector and coherent processors, and also defined a role of operating systems in maintenance and management of the parallelism. Thus, after looking at these examples one may ask: what is the core for the success of one classification and the lack of another? The success of PF for the new classification depends on concurrent observation and satisfaction of several conditions. The first condition is accuracy, precision in selection of the aim, which is pursued, with analysis of new subject domain and key required feature some kind of aim inteaitv. The second condition, not less important, consists of internal structure of classification and rigorous approach in construction and introduction of terms and concepts inside classification. Classifications [3-7] suffer the variety of concepts and aims, which did not allow to limit and form a basis and select essential structure of terms within the classification. Too many objects, qualities, features were discussed in 13-71. To summarize requirements to classification note: 1. Any classification, as a beginning of the theory, should be considered fi-om the point of view of performance of three interconnected functions: dejinitive, characteristic and predicfive. 2. Classification must be constructed with strict principle of aim integrity, i.e. with selection the sinele feature to achieve. 3. Each phase of introduction, presentation and detailed analysis of the classification development must be rigorously analysed. Otherwise, the success of the classification seems to be problematical. 4. Only really essential features and details of the analysed objects have to be included into classification. For example, energetic terms should be used in classification of power generation systems, where main function of these systems is Classification and Its Functions Computer systems are used nowadays in different and important areas such as banking, military, aviation, intensive health care, industrial control, space exploration etc. All of these areas demand highest possible (and, sometimes impossible) reliability of functioning. In terms of reliability the availability has a special interest. The methods to achieve reliability based either on efforts to increase the reliability of hardware components or fault tolerance of computer and its basic parts. An additional requirement to industrial computers usually is the reduction or elimination of service or maintenance of computer systems because in most applications service of computer during the mission is impossible. To cope with the problem of reliability of CS for the critical applications a feature to "repair itself" in real time of operation is needed. In other words CS must be fault tolerant. This new feature - fault tolerance of functioning demands additional overheads in s o h a r e and hardware. This area of research became visible in late 50's [MI.Several attempts and efforts in definition of ways and means, which help implementation of hardware fault tolerance, were presented in [3-71. But even after these prominent efforts a general question has not been answered: which features of classifications are the most important and how to form correct classification? An attempt to answer on this question can be considered as a subject of this paper. Which h c t i o n s of classification are most important and how to evaluate classification itself? The answer on this question can be done by the analysis and selection of the criteria which classificationmust match. Classifications [3-71 are related to the area of information processing. All of them have a definitive function, when terms and concepts called and nominated. Call this function as definitive (DF). DF answers the question "what is it"? The second function describes interrelation between definitions, i.e. characterize them, call this function as characteristic function (CF). CF answers the question "how these definitions are connected'?. In [7] for the description the redundancy types used terms energy, information, and matter. In [3] redundancies of hardware, program and time were considered. In [6], with introduction of dependability the objects features were analyzed together with form of influence on the objects: hardware faults, software errors, and user mistakes. The most successful and known classifications have the third function as well - a predictive one. Predictive function (PF) answers the questions "what if' and "what next" and can be considered as 2. zyxwvutsrq zyxwvutsr zyxwvuts zyx 3194 zyxwvutsr zyxw zyxwvutsrqp zyxwvutsrqpo zyxwvut just production of energy. And, there is no doubt, information processing systems should be discussed in terms related to information. Again, a classification of the distributive systems should present in some way categories of dimensions in this kind of the systems. descriptions and indicate three main variables to achieve fault tolerance. On the other hand, a computer system has its own duality and consists of software and hardware (third level of Fig.1). Therefore, there are two carriers of the above mentioned redundancy types. Thus, redundancy has six different variants: three based on hardware and three based on software. The more certain and determined redundancy types are, the more characteristics might be expressed in terms of classification and the more predictions might be generated. Introduced definitions of redundancy types are clarified with examples from daily practice of fault tolerant computer design: Hardware based redundancv tvoes: 0 H(2M) - structural (material) redundancy of hardware such as duplicated computer system; H(Ml,M2) - duplicated fault tolerant computer system with different (non identical) units; 0 H(I1) - redundant bit of information, to check parity errors of data; 0 H(nT) - special hardware to repeat or delay of computing to avoid malfimction influence; H(dT) - special hardware to delay execution (like in timing diagram) to avoid malfunction. Software based redundancv tvoes: S(2T) - double repetition of the same program on the same hardware, gused to check the results; 0 S(1) - informational redundancy of the program, to name one is back-up files; S(Ml,M2) - two different copies of the program for realisation the same function; S(dT) - time delays realised in software for waiting of the guaranteed result. 3. Formation of the Redundancy Classification Consider the process of classification construction for computer systems, when tolerance to hardware faults is required. Tolerance to faults is a realization of the process of transition of the system from correct (fault free) state to another workable state, where an existing fault does not influence computer functioning. A fault tolerant computer differs from a usual one by the way of repair - fault tolerant CS, by definition, must repair and recover itself after arisen faults. In turn, ordinary CS requires special interruption for the service and repair. A well-known sequence of steps to implement fault tolerance includes the phases of a) fault detection, b) fault location and c) excluding of faulty unit fiom the system to continue functioning. A11 phases in this sequence should be completed in proper time and in such manner that the system should notice neither fault nor process of its elimination. There is another type of computer systems, which tolerate the hardware faults with degradation in performance and/or fimctioning this kind of the system called graceful degradation systems. These systems will not be discussed here because, strictly speaking, they cannot be classified as fault tolerant. In reality it is much more actions (steps of algorithm) are needed to eliminate malfunctions and permanent faults. These actions depend on the various functions of the basic computer systems components, their roles and “power” to provide tolerance to hardware faults. Construct here the classification of redundancy with gradually detailed steps and concepts involved. Computer systems as well as other systems can be analysed using first order categories such as matter and time. For CS it is structure and time. Above this, a basic function of the system is information processing and information must be considered as first level description also. The same classification it is possible to use for dependability and for software reliability areas of research, but in these cases, accordingly to the declared aim integrity principle one should not expect to have strong predictive function of classification. Consider now the way of application of the classification of redundancy for providing fault tolerance of the computer. CS, as it was mentioned is a pair of software and hardware. Hardware fault can damage the hardware and the state of software. Therefore, further correct computing after appearance of even a malfunction without a recover of hardware and software states seems to be problematical. The situation becomes even more: complex because the hardware of computer consists, of two different subsystems, which involved iri information processing by different ways, call these subsystems passive and active. Active subsystem produces information in each tact of processor antl hardware fault in there might be detected antl REDUNDANCY OF FTC I I I STRUCTURAL INFORMATIONAL TIME SOFTWARE HARDWARE Fig. 1. Classification of FTC redundancy. Top level on (Fig.1) describes the core of realization the fault tolerance: it is redundancy, straight lines connect redundancy with fKst level 3195 zyxwvut zyxwvut zyxwvuts located almost in the same tact of computing. Active subsystem includes processors, U 0 controllers, etc. Within passive subsystem the information is recorded and stored. Generally, an occurring hardware fault within a passive subsystem can manifest itself with a delay, sometimes counted in hours. A passive subsystem consists of memory (RAM),Flash RAM,file structure with discs etc. A delay between appearance and manifestation of a hardware fault is known as a latent period of fault. This latency means that software will be damaged by hardware faults. Since the 60’s [ 121 and up to now there was no practical evidence or theoretic results which proved that it is possible to design fault tolerant computer so that software state will be authentic after hardware fault. On the contrary, practical estimation [ 131 and comments given in [ 141 prove that the program or data might be damaged long before the manifestation of a hardware fault. Latency of the faults, uncertainty of faulty period and influence on software causes the problem of effective formation, correct searching and recovery of the program for distributed [15] and for standalone computer systems [ 161. Presented arguments are enough to introduce an extension of AFT by including of similar actions directed to hardware (A-C) with actions related to software (D-F). This generalised algorithm presented on Fig2 A. transparent for testing procedures. There is no doubt that the border between these d e f ~ t i o n sof permanent fault and malfunction is rather obscure. For example, when a malfunction happened exactly at the end of the program execution in terms of the particular program it is considered as permanent fault. Again, if a malfunction presence lasts in the system longer than time to execute the particular program, it is considered as a permanent fault. In fault tolerant systems in this case, it becomes possible to locate and reconfigure the system to the acceptable configuration to continue the calculations further and later to check precisely the matter and type of fault [ 171. The problem of the type determination becomes more complex in multiprocessor and pipeline systems, which intensively use asynchronous blocks of hardware. Current practice shows that a malfunction of hardware is easier to eliminate than a permanent fault. Thus, very often the repetition of the program or its segment might save the situation. Therefore, some steps to eliminate malfunction placed into algorithm of fault tolerance can increase reliability of the system as a whole. This is as true as bigger rate of malfunctiodpennanent faults in the system. In availability terms, the faster malfunctions influence is eliminated from the system the better. The second aspect relates more to the technology and environment of the hardware operation. The first one enforces an implantation into the algorithm some actions to check the type of fault as close as possible to beginning. These actions printed in bold in the algorithm body. The generalised algorithm of fault tolerance (GAFT) presented in Fig.2B. zyxwvutsrq zyxw zyxwvutsrqp zyxwvu A . To prove that fault does not exist otherwise B . Locate a faulty component; C Reconfigure the hardware; D To prove that software is not affected; otherwise E Locate faulty states of the program and define the correct state from which to continue F To recover the system fiom preliminary stored correct state of software and G Continue the operation . . . . . A. To prove that fault does not exist otherwise B. To determine the type of fault; C. If the fault is permanent then D. Locate a faulty component of hardware; E. Reconfigure the hardware faulty unit; otherwise F. To prove that fault does not affect the software; G . Locate faulty states of the program and define the correct state of the software; H. To recover the system from previously stored correct state of software and continue Fig.2A First generalization of AFT Given algorithm better presents a sequence of steps and matter OF actions to achieve fault tolerance of hardware because it represents the duality of the object structure (sohare, hardware). Nevertheless, the physics of the fault itself still has not been presented yet in the structure of algorithm. Consider then the nature of fault of hardware as far as it is reflected in the behaviour of the system. Hardware Faults are considered as permanent (solid) and temporary ones (usually called malfunction). The permanent fault manifests itself in the repetition of the execution of the program. The malfunction has to be recovered after repetition of the program or its part. Malfunctions are Fig.2B Generalization of the AFT GAFT consists of two near identical parts: one for hardware and another for software. It is initiated by external reasons. As external reasons should be here taken into account: hardware checking signals, periodical runs of testing procedures or software initiated signals as in acceptance test approach. It was shown in the previous papers [16-17] that a malfunction influence on the program might be 3196 zyxwvutsrqp eliminated by recovery of the code and variables of the program, together with operation system records about this particular program. For the algorithm 2B it means that after the type of fault determination in case of a malfunction, it is possible to jump to the G and H steps. In terms of classic reliability theory stationary availability is determined as working time divided by sum of working time and repair time. For the permanent faults repair time is the time of execution the whole steps of algorithm 2B steps A-H. For the malfunction is just A,B,F,G and H. Both legs of algorithm would be done as fast as possible. In this concern the result, which prove the possibility of concurrent execution of the mentioned steps related to hardware and software [17] has a special interest. Using this concurrency opportunity becomes possible to determine theoretic minimum time redundancy requiring to tolerate both type of hardware fault. Comment specifics of the algorithm steps related to the software recovery. Because the latency of hardware faults some consecutive recovery points (RPs) can store erroneous data. Then, even several iterative steps of recovery could not be enough to achieve a correct state of hardware and software to continue the execution. The problem of determination the correct RP has its special value and might be the subject of further researching. Methods of correct RP searching are described in [ 161. GAFT in our taxonomy presents a new feature (fault tolerance), which is looking for the system. The structure of the algorithm reflects also physics of matter (hardware fault) against which GAFT and a system have been designed. From the other hand, redundancy classification by itself is closely connected with the analyzing object (computer system) and in general terms describes basic types of redundancy applicable to achieve the fault tolerance. Combine of the redundancy classification and our algorithm (sequence of steps to eliminate the influence of fault). The taxonomy obtained is presented in Table 1. Thus, as shown in the Table 1, computer system is considered as fault tolerant if and only if its algorithm of fault tolerance is realized in full, i.e. from the step A to the step 1. Various fault tolerant systems may' differ in time to implement steps of the algorithm, in types of redundancy used on various steps of algorithm; in the types of fault which have to be tolerated. Taxonomy presented in the Table 1 also allows to analyze fault tolerant features of the computer architectures and evaluate how effective are various types of redundancy used to achieve required feature. Table 1. Taxonomy of redundancy types and relevant algorithms of fault tolerance. Error! Not a valid link. Note also that cost of applied redundancy types and architecture of CS combined could be considered as important argument pro or contra selected architecture and engineered solutions. 3.1. Dependability Revisited Developed here taxonomy concerns only fault tolerance of computer systems and provides correctness of the statement such as: hardware of the system correct or the system proves tolerance to hardware fault of some types. Denote set of states of hardware by Sh. Then fault tolerance is achieved when the predicate of hardware correctness P on the set of states Sh is true: P(Sh)=true. By analogy, the similar predicate of absence of software errors on the set of states of the program is defined as P(Ssj=true. The predicate of absence of operator (user) errors denoted as P(Su). Thus, using this notation we may defiie term dependability as P(Sh)&P(Ssj&P(Su)=true. zyxw zyxwvutsrq zyxwvutsr zyx 4.Some General Problem of Redundancy: Instead of Conclusion Staying aside of the details of hardware and software features required for algorithm of fault tolerance, further researches in the application of redundancy classification should be in terms of complexity of the system. Complexity of the system presents in volume of information (internal) and hardware to process this information. This complexity determines the length of algorithm of fault tolerance (each phase becomes larger, longer etc.). Dynamic complexity might be described, for example, as mean rate of information dI processed by the system in time dI(t). Input data combine with internal data ancl hardware and change the state of the system. From the point of view of imbedding a new quality - fault tolerance there are two different ways: an extension of input information or internal structures of software and hardware - as we did in this paper. Algorithm steps differ on checking steps and recovery steps. Redundancy, introduced in th'e system might be spent on checking and recovery. When (and it is always true) the redundancy level is limited the problems of optimal redundancy splitting, utilising and monitoring are arisen. It 1s important when type of fault is defined and proven by practice. Another approach to build fault tolerant systein might be seen as follows: some amount of redundancy is given. Of which fault types it is 3197 possible to tolerate fault in the system with this level of redundancy and how effectively to split the redundancy between phases of checking and recovery. Additionally, above the static analysis of volume of redundancy for fault tolerance there is a dynamic aspect also. Thus, when checking is applied periodically, depending this period the level of required supported redundancy changes for both checlung and correction (recovery) of the system. The approach of run an acceptance test [ 151 between procedures requires to recover state of the system much more hardware and software efforts to tolerate fault comparativelyto [ 16,171. Some other theoretical problems in the area of fault tolerance following from this paper are: to prove that some kind of redundancy is inapplicable for realization of some steps of algorithm to provide fault tolerance - such negative result will allow to reduce the number of variants and structures and, therefore, redundancy options and could be extremely usehl; to analyze restrictions of different kinds of redundancy on criteria of performance, reliability, availability and cost - these evaluated restrictions would allow to reduce selection of the architectures and redundancy types and their amount of combinations, say, by cost criteria, which simplify design. to evaluate an opportunity of application an offered taxonomy as a logic core of concurrent design for fault tolerant systems; to develop and construct reliability models of each kind of redundancy - this permits during the design process of the system to estimate automatically the possible (expected, achievable) reliability, and, thus, to ensure the optimum choice. These directions, together with further development proposed in this paper approach seems to be definite and productive area of firther researches. zyxwv 7.Parkhomenko P.P. About redundancy classification, 1988, Preprint of ICSAN, pp. 1-7, 8.Evreinov E.B. Homogeneous universal computer systems with high performance.“Science”,1966. 9,Mirenkov N.N. Parallel algorithms and correctness of programs, Programmirovanie,1985, 6, pp.3-16 1O.Golovkin B.A. Structures of high performance computer systems and their relations with the structures of programs and algorithms, Technical Cybernetics 5, 1985, pp. 194-229. 11.Cristian F. Rigorous approach to fault tolerant systems development. IBM Report RJ 3784/January 1983. 12.Katzan H. Computer organisation and the system 370. Van Nostrand Reinhold Co. 1971. 13.Chillarege R. Iyer R.K. Measurement based analysis of error latency. IEEE Transaction on comp.Vo1.C-36, No.5, 1987, pp.529-537. 14.Gifford D.,Spector A. Case study: IBM system 360-370, COIIXII.ACM, 1987,V01.30, NO.4 15.Rendall B. System structure for software fault tolerance.Trans. on Soft. eng.Vol.SE-1,.2,191-209 16.Schagaev I. Algorithm of Computation Recovery. Automatic and Remote Control, 7,1986 17.Schagaev I.. Determination of type of hardware faults by software means.IBID, 3,1990 18. Pierce Y. Fault tolerant computer systems, 1965. Addison Wesley 19.Schagaev I. Yet another approach to classification of redundancy. IMEKO Symposium CIM, 1990, Helsinki, Finland, pp. 117-124. zyxwvuts zyxwv zyxwvuts References 1.Tulenov M.T.,Dialectic categories and their interrelations . “High School”, 1986, Moscow. 2.Samburov E.A.. Dialectic categories and their interrelations “Science” ,1987, Moscow. 3.Avizienis A. Architectures of fault tolerant computing systems FTCS 1975 pp 3-16. 4.Avizienis A., Laprie J.K. Dependable computing: from concepts to design diversity. Proc. IEEE Vo1.74,No.5,May 1986 5,Avizienis A. N-version approach to fault tolerance s o h a r e . IEEE Trans. on Soft. Eng., Vol. SE-11, No.12, pp1491-1501, Dec.1985 6.Laprie J.K. Dependability concepts and terminology, ESPRIT BRA, ProJect3092, 3198