Probabilistic error propagation model for mechatronic systems

Andrey Morozov

Probabilistic error propagation model for mechatronic systems

Andrey Morozov

2014, Mechatronics

visibility

…

description

14 pages

link

1 file

Mechatronics xxx (2014) xxx–xxx Contents lists available at ScienceDirect Mechatronics journal homepage: www.elsevier.com/locate/mechatronics Probabilistic error propagation model for mechatronic systems Andrey Morozov ⇑, Klaus Janschek Institute of Automation, Technische Universität Dresden, 01062 Dresden, Germany a r t i c l e i n f o Article history: Received 5 November 2013 Accepted 15 September 2014 Available online xxxx Keywords: Control flow graph Data flow graph Error propagation analysis Discrete time Markov chain Dependability UML a b s t r a c t This paper addresses a probabilistic approach to error propagation analysis of a mechatronic system. These types of systems require highly abstractive models for the proper mapping of the mutual interaction of heterogeneous system components such as software, hardware, and physical parts. A literature overview reveals a number of appropriate error propagation models that are based on Markovian representation of control flow. However, these models imply that data errors always propagate through the control flow. This assumption limits their application to systems, in which components can be triggered in arbitrary order with non-sequential data flow. A motivational example, discussed in this paper, shows that control and data flows must be considered separately for an accurate description of an error propagation process. For this reason, we introduce a new concept of error propagation analysis. The central idea is a synchronous examination of two directed graphs: a control flow graph and a data flow graph. The structures of these graphs can be derived systematically during system development. The knowledge about an operational profile and properties of individual system components allow the definition of additional parameters of the error propagation model. A discrete time Markov chain is applied for the modeling of faults activation, errors propagation, and errors detection during operation of the system. A state graph of this Markov chain can be generated automatically using the discussed dual-graph representation. A specific approach to computation of this Markov chain makes it possible to obtain the probabilities of erroneous and error-free system execution scenarios. This information plays a valuable role in development of dependable systems. For instance, it can help to define an effective testing strategy, to perform accurate reliability estimation, and to speed up error detection and fault localization processes. This paper contains a comprehensive description of a mathematical framework of the new dual-graph error propagation model and a Markov-based method for error propagation analysis. Ó 2014 Elsevier Ltd. All rights reserved. 1. Introduction The research results presented in this article belong to a rather young scientific domain – system dependability. By this reason, in various papers devoted to error propagation analysis, different terms can describe similar entities. In this article, the term ‘‘error’’ is used in a general context that fits for the engineering domain. This paper adheres to the definition proposed by Laprie [1]. A brief overview of the dependability research domain helps to distinguish the term ‘‘error’’ from other similar terms. Dependability is the ability of a system to deliver a service that can be justifiably trusted. The service, delivered by a system, is its behavior as it is perceived by its user. Laprie describes ⇑ Corresponding author. Tel.: +49 351 46332202; fax: + 49 351 46337039. E-mail address: [email protected] (A. Morozov). dependability from three points of view: the attributes of dependability, the means by which dependability is attained, and the threats to dependability. We are focused on the threats: Fault is a defect in the system that can be activated and cause an error. Error is an incorrect internal state of the system, or a discrepancy between the intended behavior of a system and its actual behavior. Failure is an instance in time when the system displays behavior that is contrary to its specification. Activation of a fault leads to the occurrence of an error. The invalid internal system state, generated by an error, may lead to another error or to a failure. Failures are defined according to the system boundary. If an error propagates outside the system, a failure is said to occur. http://dx.doi.org/10.1016/j.mechatronics.2014.09.005 0957-4158/Ó 2014 Elsevier Ltd. All rights reserved. Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http:// dx.doi.org/10.1016/j.mechatronics.2014.09.005 2 A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx Analysis of fault activation, error propagation, and error (or failure) detection is defined in this article as error propagation analysis. The results of this analysis is extremely helpful in a wide range of analytical tasks associated with dependable systems development. The error propagation analysis gives sound support for reliability evaluation, because error propagation has significant influence on the system behavior in critical situations. The error propagation analysis is a necessary activity for safety system design. It helps to estimate the likelihood of error propagation to hazardous parts of the system and identify parts of the system that should be protected with error detection or error recovery mechanisms more strongly than the others. Another possible application area is system testing and debugging. An accurate error propagation analysis assists selecting an appropriate testing strategy. It helps to identify the most critical parts of the system (from either reliability or safety points of view) and to generate such a set of test-cases that will stimulate fault activation in these particular parts and allow the detection of occurred errors. Probabilistic error propagation analysis can be used for system diagnostics. In the case of error detection in observable system outputs, it helps to trace back an error propagation path up to an error-source. It speeds up the error localization process, system testing, and debugging. In real systems, fault activation and further error propagation are very complex processes. This causes the need for a strong mathematical framework to perform an accurate error propagation analysis. Specifics of the mechatronic domain brings additional complexity. The fact is that mechatronic systems incorporate the assembly of heterogeneous components (mechanical, electrical, computer, and information technology) with various mutual interactions. The goal of mechatronic system design is to ensure a proper and coordinated operation of these elements within a feedback structure under all possible operational conditions. According to Janschek [2], one of the big challenges of mechatronics is the use of appropriate models, which describe this mutual interaction on a common abstract layer. The error propagation analysis, as an essential part of the mechatronic system design, also requires a specific model. This model must be able to operate with abstract entities to represent various properties of the heterogenous mechatronic components. A sufficient error propagation model is presented in this article in details. Also, some of the basic ideas you can find in our previous publications [3–5]. 2. State of the art Most of safety–critical mechatronic systems consist of a mix of software and hardware elements. Typically, error propagation analysis of hardware is based on one of the classical reliability evaluation techniques: failure modes and effect analyses (FMEA), hazard and operability studies (HAZOP), fault trees analysis (FTA), event trees (ET), etc. Generally, the process of failure analysis consists of several activities: identifying failures of individual components, modeling the failure logic of the entire system, analyzing the effect of a failure on other components, and determining and engineering the migration of potential hazards. With the emergence of component-based development approaches, investigations began exploring component oriented safety analysis techniques, mainly focusing on creating encapsulated error propagation models. These failure propagation models describe how failure modes of incoming messages, together with internal component faults, propagate to failure mode of outgoing messages [6–10]. In the software engineering domain, the majority of classical error propagation approaches are based on fault injection or error injection techniques, conjugated with further statistical evaluation. Three of them were introduced by Voas [11–13]. An empirical study about propagation of data-state errors was presented in [14]. Candea et al. present a technique for automatically capturing dynamic fault propagation information in [15]. The authors use instrumented middleware to discover potential failure points in the application. Khoshgoftaar et al. [16] describe identification of software modules, which do not propagate errors, induced by a suite of test cases. A number of papers depict the influence of software error propagation phenomena on system reliability [17,18]. Four error propagation models that can be considered as the best candidates for the analysis of mechatronic systems are listed in this section. Unlike the models discussed above, these four candidates are abstract enough to cope with the heterogeneity of components of mechatronic systems and have a strong mathematical foundation. The key properties of these models are compared in Table 1. Abdelmoez’s model [19–21] is a design-level model for error propagation analysis of COTS systems that was also extended for reliability evaluation in [22]. This model uses information about system states and messages in order to compute the probability Table 1 The comparison table of the suitable models for the error propagation analysis of mechatronic systems. Author and years Abdelmoez et al., 2002/2004/2005 Application areas Required data Main idea Purpose Deficiencies COTS State and sequence UML diagrams An early estimate of the error propagation probabilities between system components in terms of states and messages General use and reliability assessment Not abstract enough. Requires very specific and detailed system models Application areas Required data Main idea Purpose Deficiencies Hiller et al., 2001/2005/2007 Modular software for embedded systems Source code and reliability measurements The concept of error permeability through a system module Placement of EDM and ERM. Reduction of error propagation by design More oriented to module level rather than system level analysis. Applicable only for a software part of the system Application areas Required data Main idea Purpose Deficiencies Mohamed et al., 2008/2010 COTS Component UML diagrams, estimated fault activation and error propagation probabilities Error propagation through an architectural service route Reliability assessment Not comprehensive enough. Can be considered an offshoot of Cortellessas model Application areas Required data Main idea Purpose Deficiencies Cortellessa et al., 2006/2007 COTS and SOA Fault activation, error propagation, and control flow transition probabilities Probabilistic error propagation analysis using Markovian representation of control flow Placement of error detection and error recovery mechanisms. Identification of critical components Development of cost-effective testing strategies Does not distinguish between control and data flows Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http:// dx.doi.org/10.1016/j.mechatronics.2014.09.005 A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx of error propagation between system components. The advantage of this model is the possibility of its application in the early phases of system development. However, it requires a very detailed and specific UML description that should also be very accurate for obtaining trustworthy results. Based on this concept, Hiller et al. [23–26] introduce the concept of error permeability through software modules and an error propagation model. This model is defined for modular software of embedded systems and can be used for dependable system design. Hiller’s model seams more suitable for real-world application than Abdelmoez’s model because it operates at the sourcecode level. The detailed case studies and the software tool PROPANE have proven this fact. However, the discussed concept can only be applied to the software part of a mechatronic system because the theoretical background of this model is not comprehensive enough. Mohamed and Zulkernine [27] present another approach to error propagation analysis and its application for system reliability assessment, based on the definition of the architecture service routes. In spite of several deviations, Mohammed’s model can be considered an offshoot of Cortellessa’s model [28,29]. Cortellessa’s model is based on the Markov representation of system control flow. It was originally developed for Commercial Off-The-Shelf (COTS) systems and later extended for Service-Oriented Architecture (SOA) systems. This model has the strongest mathematical background in comparison to the other error propagation models that have been discussed in this chapter. The authors demonstrate its applicability for smart placing of error detection and error recovery mechanisms, planning of cost-effective testing strategies, and system reliability evaluation. 3. General concept and motivation After the literature overview, the general idea of Cortellessa’s model has been selected as the starting point of this research project. Nevertheless, this model has a significant disadvantage. The authors assume that ‘‘data errors always propagate through 3 control flow’’ [28]. It means that Cortellessa’s model does not distinguish between control flow and data flow. It works correctly for the systems with a straightforward design. However, in real-world systems, sequential execution of several components does not imply that each next component will use the data produced by the previous component. The authors do not take into account the systems with components that can be triggered in arbitrary order with non-sequential data flow. The presented work is aimed to eliminate this drawback. A small counter-example of a collision avoidance system, shown in Fig. 1, demonstrates a particular case of inapplicability of Cortellessa’s model. It also shows that accurate system level error propagation analysis requires simultaneous examination of both control flow and data flow structures. Fig. 1(a) describes a mobile robot, equipped with collision avoidance software. A control unit installed in the robot receives data from sensors and controls the movement of the robot. The robot has two speed sensors embedded in two wheels. This robot is also equipped with two ultrasonic sensors that can measure the distance to obstacles: one sensor on the front and another one on the back. A part of the collision avoidance software is represented in the right side of Fig. 1. It is a simple algorithm that computes two parameters: speed of the robot and time to collision. This algorithm starts with reading the information from the sensors to variables fd; bd; ls, and rs. The variables ls and rs represent the speeds of the left and right wheels of the robot, while fd and bd show the current distance to an obstacle measured by the front or back ultrasonic sensors. Once accomplished, it computes the overall speed of the robot and saves to a variable s. The positive value of s means that the robot moves forward, and a time to collision t is computed using information from the front sensor fd. Otherwise, it is assumed that the robot moves backward, and the variable bd is used for the computation of t. In the discussed example, the sensors represent system inputs and the variables t and s system outputs. This gives rise to two primary questions that concern the error propagation process through this system: Fig. 1. A part of a collision avoidance system of a mobile robot and three possible scenarios of error propagation. Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http:// dx.doi.org/10.1016/j.mechatronics.2014.09.005 4 A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx What is the likelihood of observing an error in the output, given an error in the input? Where is the source of the error that is observed in the output? Fig. 1(b)–(d) show three situations that should be discussed in order to answer the first question. Assume that a fault is activated in the left sensor (Fig. 1(b)). In this case, an occurred error will affect the variable ls. The variable ls is used in the computation of s, which is then used in the computation of t. Hence, the error will propagate through the system to both outputs t and s. In this situation, t and s have a data dependence of the variable ls. In case of fault activation in the back sensor (Fig. 1(c) and (d)), the variable bd contains an erroneous value. This error can only propagate to the output t, because computation of s does not depend on bd. However, in this case, error propagation depends on control flow. If the robot is moving backward, then the error will actually affect the output t (Fig. 1(d)). Otherwise, if the robot is moving forward, the variable fd will be used instead of bd, and the error will be masked (Fig. 1(c)). Assume that an operational profile of this mobile robot is known, e.g. from statistical experiments. It moves backward only 1=10 of the time during its canonical utilization. This assumption gives an answer to the first question. An error from the left and right sensors will always propagate to both outputs. An error from the front sensor will propagate to t in 90% of the cases. An error from the back sensor will propagate to t in 10% of the cases. The second question refers to an inverse problem. The answer to this question depends not only on the system behavior, but also in the likelihood of fault activation in each sensor. However, even without this information, it is clear that observation of an error in s and t means that the left sensor or the right sensor (or maybe both of them) produce erroneous data. If there is an error only in the output t and the probabilities of fault activation in each sensor are equal, then the error is coming from the back sensor in 10% of the cases and from the front sensor in 90%. This discussion demonstrates that even in this primitive case, only the separate analysis of control and data flows describes the actuality of the error propagation process. Generally speaking, a data flow analysis shows only the possibility of error propagation from one part of a system to another. A probabilistic control flow analysis makes it possible to estimate the likelihood of this fact. Simultaneous examination of the control and data flows is a key feature of the error propagation model introduced in this article. A general structure of the entire approach presented in this article is shown in Fig. 2. Section 4 introduces a dual-graph error propagation model. This model can be generated using a base-line system representation. The dual-graph error propagation model is a new mathematical framework for probabilistic error propagation analysis that makes simultaneous examination of control and data flow graphs of a system possible. Section 5 discusses an extensive approach to error propagation analysis that uses an absorbing discrete time Markov chain to describe error propagation processes during system execution and a specific method for computation of the Markov chain that enables the probabilities of different system execution scenarios to be obtained in terms of faults activation, errors propagation, and errors detection. Section 6 demonstrates applicability of the introduced concept using a typical mechatronic system. 4. Dual-graph error propagation model 4.1. Formal definition A dual-graph error propagation model (EPM) is an abstract mathematical framework for error propagation analysis (EPA) of mechatronic systems. A behavioral aspect of a system under test forms a basis of the introduced EPM. The system is considered to be a nonempty set of N E independent elements: E :¼ fe1 ; e2 ; . . . ; eNE g Each element represents an executable part of the system. A software function, which is executed during the operation, a hardware sensor that performs measurements, a controller, or even an actuator are all examples of these elements. Execution of the element represents an operation of the corresponding part of the system. An element einit 2 E defines an initial element. It is assumed that a system can only have one initial element and the operation starts with execution of this element. Two directed graph models are defined using this set of elements: a control flow graph (CFG) and a data flow graph (DFG). A control flow graph is a structural representation of system control flow. It represents a possible order of execution of system elements: h i GCF :¼ E; ACF : E ¼ e1 ; . . . ; eNE ; n o CF ei ; ej ; pCF ACF :¼ ei ;ej ; . . . ; ek ; el ; pek ;el The elements E play the role of nodes of the CFG. Arcs of the CFG represent control flow transitions between the elements. A set of CF the arcs is denoted by ACF . An arc ðei ; ej ; pCF from an element ei ;ej Þ 2 A ei to an element ej defines that ej can be executed immediately after execution of ei . Also, in contrast to the initial element, final elements are defined as the elements without outgoing CFG arcs. These elements are called ‘‘final’’, because the system operation stops after their execution. A transition probability is defined as a property of each arc of the CFG. The transition probability through an arc from an element ei to an element ej is denoted by pCF ei ;ej . All values of the transition probabilities are within the limits of the interval ½0; 1, and the sum of the transition probabilities of all outgoing arcs of each CFG node equals 1. In the visual representation of a CFG, transition Fig. 2. A general structure of the presented approach to error propagation analysis. Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http:// dx.doi.org/10.1016/j.mechatronics.2014.09.005 A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx probabilities are denoted by numbers near the arcs. Fig. 3(a) shows an example of a simple CFG. A data flow graph is a structural representation of a data flow of a system: h i GDF :¼ E; ADF : E ¼ e1 ; . . . ; eNE ; ADF :¼ ðei ; ej Þ; . . . ; ðek ; el Þ The DFG of the system contains the same set of nodes E as the CFG. Arcs of the DFG show the possibility of data transfer between the elements. For example, an arc ðei ; ej Þ 2 ADF from an element ei to an element ej denotes that output of ei will be used as input in ej . Fig. 3(b) demonstrates a small example of a DFG. Faults can be activated in the elements during their execution and result in the occurrence of errors. Error propagation through the system comprises of two aspects: error propagation between the elements and error propagation through the elements. The error propagation between the elements is determined by the DFG structure. The error propagation through the elements depends on the properties of a particular element. Each element is defined using four parameters: fault activation probability (FAP), error propagation probability (EPP), error detection probability (EDP), and error detection behavior (EDB). The two graphs and these parameters of all of the elements describe an operational profile of the system. Superposition of control flow sequences on the data flow graph shows error propagation during system execution. Simultaneous probabilistic analysis of the CFG and the DFG forms a backbone of application of the introduced error propagation model. From a data transfer perspective, the element execution is the transformation of input data to output data (see Fig. 4(a)). Each eleout ment ei of the EPM contains N in ei inputs and N ei outputs: Iei :¼ i1 ; i2 ; . . . ; iNin ; ei Oei :¼ n o o1 ; o2 ; . . . ; oNout e i A perfect example of an element is a software function that takes a set of input parameters and transforms them into a set of output parameters. Faults can be activated during the execution of an element. An error occurs because of the fault activation. The defined EPM implies that the occurred error propagates to all outputs of the element (see Fig. 4(b)). In other words, in the occurrence of an execution of the element ei , all of its outputs would be erroneous with a given fault activation probability pFA ei . It is also assumed that fault activations in different elements during the operation of the system are independent events. The probability of fault-free FA execution of ei is denoted by pFA ei and equals ð1 pei Þ. In the existence of an error in at least one of the inputs of an element ei , the error propagates through this element to all data outputs with a given error propagation probability pEP ei (see Fig. 4(c)) 5 The probability of error masking in the element ei is defined as EP pEP ei ¼ 1 pei . Error propagation through the element and fault activation in the element are considered to be independent events. Some system elements can be equipped with error detection mechanisms (EDM). An error in the input of an element ei can be detected with the given error detection probability denoted by pED ei (see Fig. 4(d)). It is assumed that only errors in inputs are detectable. An error that occurs because of fault activation within the element can be detected only by the elements, which will be executed later. In other words, an EDM checks the inputs at the beginning of the execution of the element, before possible fault ED activation. pED ei ¼ ð1 pei Þ defines the probability that the error was not detected by the EDM. The parameter bei defines the error detection behavior (EDB) for the element ei . This parameter takes one of three following values: fail-stop (FS), error message (EM), or error correction (EC). 4.2. Obtaining of required parameters It is assumed that some kind of a base-line system description is available. Different types of base-line system models can be transformed into the dual-graph error propagation model. In the early phases of the system development life cycle, the UML diagrams (see [30]) can be used for derivation of the system level parameters. UML is a widely accepted candidate for heterogeneous system modeling. Although it was originally designed for software systems, it is equally applicable for the modeling of heterogeneous systems like mechatronic systems on a certain abstraction level, e.g. the UML extension for systems engineering, SysML (see [31]). The core aspects, structure, behavior, and interaction, can be described with UML diagrams in a formal and transparent manner. UML activity diagrams are the most suitable for the presented approach, due to the very transparent definition of control and data flows. By this reason, this type of diagram was used in the case study presented in Section 6. However, UML activity diagrams could not provide the control flow probabilities. The UML stereotypes (see [32]) can be used for their customization. A trivial extension of the UML activity diagrams enables the information about control flow probabilities to be appended to the diagram. However, at the design phase, it is rather difficult to define probabilistic control flow properly, because it strongly depends on the future operation profile. Therefore, the introduced EPM can only be applicable in the early phases if the operational profile is known, and the specifications of the elements of the future system are available for definition of the parameters of the elements. Also, the other interaction UML diagrams, like state and sequence diagrams can be used. Besides the UML, other types of state machine models, like Matlab SIMULINK state charts, are acceptable for obtaining system level parameters. Fig. 3. A control flow graph, a data flow graph, and a list of the probabilistic parameters of elements of a dual-graph error propagation model. Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http:// dx.doi.org/10.1016/j.mechatronics.2014.09.005 6 A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx Fig. 4. Element level properties: (a) data inputs and outputs, (b) fault activation, (c) error propagation, (d) error detection. In later phases, the necessary parameters can be more accurately obtained using the available hardware specifications and software source code. However, application of the error propagation model for an already developed system requires the methods for system decomposition. Practically every research devoted to architecture-level system analysis scratches the problem of decomposition of the system into a set of executable elements. As previously mentioned, control flow exists mostly because of the software part of the system. Therefore, it is advisable to use a software part as a basis for the decomposition, extending the actions of the software elements with the hardware and physical activities, as shown in the case study (see Section 6). In the software domain, different authors propose a consideration of single instructions, basic blocks of code, functions, modules, classes, components, and services as the atomic elements of the system. The level of decomposition clearly depends on the trade-off between the number of elements, their complexity (size), and the available information about each element. This problem has been discussed in [33]. Too many small elements lead to a large state space of the mathematical model and difficulties in evaluation. On the other hand, too few elements may cause the problem of distinction of how different elements contribute to the system properties. Several software-based approaches to system decomposition are presented in [34–36]. After the decomposition, the system level and element level parameters should be defined. Existing methods for the definition of the structures of control and data flow graphs are shown in [37–43]. The following techniques are suggested for obtaining the element level parameters. Fault activation probabilities in software and hardware parts of the system can be estimated using the numerous reliability models already discussed in Section 2. The studies of computer hardware faults that also result in data errors are discussed in [44–46]. These articles describe the influence of environmental factors like increasing heat, lowering voltage, radiation, etc. on the occurrence of so-called bit-flips in computer hardware (memory, hard-drive, CPU) that result in errors in computation and stored variables. Error propagation probabilities for software parts can be evaluated using the methods proposed in [24] in the section ‘‘Estimating Error Permeability: An Experimental Approach’’. Error propagation through the hardware parts can be evaluated using an analysis of their specifications and statistical tests. For the physical parts, the creation of corresponding software-based models that describe their behavior is recommended. These models can be used for obtaining the error propagation probabilities in the software level. Error detection probabilities strongly depend on the error detection mechanisms that will be used. The best solution is the statistical evaluation and individual structural analysis for each particular case. 5. Error propagation analysis This section introduces a general and comprehensive approach to error propagation analysis using of the discussed dual-graph error propagation model. The central idea of this approach is the application of a discrete time Markov chain (DTMC) model. Informally, a DTMC model can be represented by a so-called state graph. The nodes of this graph describe a state space of the Markov chain. The directed arcs are weighted by the probabilities of transition between the states of the DTMC. The state graph of the DTMC model that is used for the error propagation analysis we call an error propagation graph (EPG). The nodes of this graph (or states of the DTMC model) represent various states of the system in terms of fault activation, error propagation, and error detection. The arcs show probabilistic transitions between these states. The EPG can be automatically generated using the data contained in the EPM. This section describes the structure of the EPG and the application of the mathematical apparatus of Markov chains for the error propagation analysis. 5.1. Error propagation graph An error propagation graph (EPG) is a state graph of the discrete time Markov chain. It is a weighted, directed graph: h i GEP :¼ S; AEP : S :¼ s1 ; . . . ; sNS ; n o AEP :¼ ðsi ; sj ; psi ;sj Þ; . . . ; ðsk ; sl ; psk ;sl Þ S is a set of the nodes of the EPG. Each node si 2 S represents a system state. AEP is a set of the arcs of the EPG. The weights, assigned to these arcs, define the probabilities of state changes. Fig. 3 shows a dual-graph error propagation model of a reference system. This model allows for the construction of an EPG of this system. Fig. 5 demonstrates a part of this EPG. The algorithm for the EPG generation will be described later on this section. A node of an EPG, or a state of a corresponding DTMC model, describes the system state in between the executions of two elements. Exceptions are initial and final states, which describe the state of the system before the execution of an initial element and after the executions of final elements respectively. Each node si 2 S is characterized with four parameters: si :¼ D FA EP ED enext si ; Esi ; Esi ; Esi E The parameter enext 2 ðE [ f‘none’; ‘FS’gÞ defines an element from the set E that will be executed next. This parameter describes the control flow aspect of the system. An expression enext ¼ ek shows si Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http:// dx.doi.org/10.1016/j.mechatronics.2014.09.005 A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx 7 Fig. 5. A part of an EPG generated using the reference EPM. that a system state si refers to the moment in time when the execution of a particular element has already been completed, and the system is about to run the element ek . In addition, enext can take two specific values: ‘none’ or ‘FS’. Both values show that the system has finished its execution. The states that enext parameters equal to ‘none’ or ‘FS’, are defined as final states of the system. The value ‘none’ describes a regular completion of system execution. The value ‘FS’ shows an unscheduled system stop (fail-stop) because of error detection in the last element executed. The parameter EFA E defines a subset of system elements where faults have been already activated. It means that the elements can only be appended to EFA during the system operation. Also note, that EFA is a set, but not a sequence. Therefore, it does not show the order of fault activations or the number of them, in case of several fault activations in the same element. The parameter EEP E defines a subset of system elements where errors have been propagated at this moment. It describes a current faulty (or fault-free) condition of the system. In other words, EEP is a subset of the element, where outputs are currently erroneous. Unlike the subset EFA , the elements can either be appended or removed from EFA during system execution. The parameter EED E defines a subset of system elements where errors have been detected. It shows an error detection history like EFA shows fault activation history. As an illustration, Fig. 6 depicts a correspondence between a state of the system (shown in Fig. 3) and a node SD of the EPG (shown in Fig. 5). The left side of Fig. 6 describes the current state of the system using additional signs and color markers on the control and data flow graphs. The CFG shows that elements e1 and e2 have already been executed, and an element e3 will be executed next (highlighted in blue). This fact is represented by the parameter enext ¼ e3 at the node of the EPG. The lightning bolt sign on the data flow graph means that there was a fault activation during the execution of the element e1 . EFA ¼ fe1 g at the EPG node demonstrates that fact. Because of this fault activation, the element e1 has an erroneous output, and this error further propagates to the element e2 (highlighted in red). It is shown by the parameter EEP ¼ fe1 ; e2 g of the node sD . Finally, the loupe sign near the element e2 on the DFG denotes that the error has been detected in the element e2 . Correspondingly, EED ¼ fe2 g at the EPG node. Fig. 5 demonstrates a part of the EPG and describes the key principles of representation of the error propagation process. A state sA is defined as the initial state of the EPG. It shows that the first element to be executed will be e1 . It also shows that there are no faults or errors in the system at this moment. The CFG of the reference EPM (see Fig. 3) defines that the element e2 will be the next to be executed, and there are no other options. Therefore, the value of the enext parameter of both descendants of the node sA Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http:// dx.doi.org/10.1016/j.mechatronics.2014.09.005 8 A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx Fig. 6. A correspondence between a system state (left) and a node of an EPG (right). of the EPG (sB and sC ) is the element e2 . There are, however, two possible execution scenarios for e1 : a faulty one and a fault-free one. Hence, the system changes its state from sA to sB with the fault with the probability activation probability pFA e1 , and to sC FA pFA e1 ¼ ð1 pe1 Þ. It also possible to return to the state sA from sC after the faultfree execution of e2 because of the existence of the CFG arc ðe2 ; e1 Þ. It shows that EPG is neither a tree nor a cycle-free directed graph. However, it is not possible to return to sA from sB or any FA other descendant of sB because EFA sB : # EsA . A similar situation hapED pens because of parameter E . EPG will inevitably have the groups of nodes that are connected one-way. The outgoing arcs from the state sB , like ðsB ; sD Þ; ðsB ; sE Þ, and ðsB ; sF Þ that have been already discussed, show the different execution scenarios that are possible for element e2 . Namely, ðsB ; sD Þ defines fault-free execution with error propagation, error detection, and the control flow transition to element e3 . An EPG arc ðsB ; sE Þ shows fault-free execution with error masking, without error detection, and the control flow transition to element e1 . An EPG arc ðsB ; sF Þ also shows fault-free execution with error masking, without error detection, but with the control flow transition to element e4 . Other arcs define other possible combinations. A state change sequence ½sA ! sB ! sE ! sJ shows a particular case of system execution: An error arises because of fault activa- fault-free completion of system execution. The node sH represents a fail-stop because of error detection in the element e3 . It happens because of ‘‘fail-stop’’ error detection behavior that is defined for e3 in the original EPM. The node sI shows that an error has propagated to e3 from e1 as in the case of sH , but has not been detected in e3 . The node sK shows that an error has propagated to e4 , has been detected in e4 , and has been corrected because of the ‘‘error correction’’ value of the error detection behavior parameter of e4 . The states of the Markov chain, represented by the final nodes of the EPG, are absorbing states. A state of a Markov chain is called absorbing if it is impossible to leave it. A Markov chain is absorbing if it has at least one absorbing state, and if from every state it is possible to go to an absorbing state (not necessarily in one step). In an absorbing Markov chain, a state which is not absorbing is called transient [48]. The mathematical apparatus of the absorbing DTMC enables the absorption probabilities for all final states to be estimated. Let us consider an arbitrary absorbing DTMC. Renumber the states so that the transient states come first. If there are r absorbing states and t transient states, the transition matrix will have the following canonical form: tion in e1 (EEP sB ¼ fe1 g) and vanishes because of a second fault-free execution of e1 (EEP sJ ¼ £). An algorithm for EPG generation is described in details in [47]. This algorithm transforms a CFG with defined control flow probabilities, a DFG, and the probabilistic properties of elements of a given EPM to an error propagation graph. The algorithm works iteratively. At the beginning the algorithm creates an initial node of the EPG using an initial element of the CFG as a value of the parameter enext and empty sets for EFA ; EEP , and EED . This node is considered to be the current node in the first iteration. With each succeeding iteration, the algorithm identifies all possible states where the system can move after the execution of the enext , and computes the corresponding transition probabilities. These states are appended as new nodes to the EPG, as well as the arcs from the current node to the new nodes. After that, the algorithm selects the first unvisited node of the EPG as a current node and starts a new iteration. This process repeats until all nodes of the EPG have been visited. 5.2. System execution scenarios According to the EPG, system operation starts with an initial state and is completed in one of the final states. These states are in focus because they correspond to different system execution scenarios. For example, the node sL (see Fig. 5) represents a Here I is an r-by-r identity matrix, 0 is an r-by-t zero matrix, R is a nonzero t-by-r matrix that defines the transitions from transient to absorbing states, and Q is a t-by-t matrix that represents the transitions between transient states of the DTMC. The first t states are transient, and the last r states are absorbing. The absorption probabilities can be computed as the following matrix product: B¼NR where B is a t-by-r matrix of absorption probabilities, and N is so-called fundamental matrix of the DTMC: N ¼ I þ Q þ Q 2 þ ¼ ðI QÞ1 The element bi;j of matrix B represents the probability of absorption in absorbing state j starting from transient state i. The knowledge about the probability distribution between the final elements allows system behavior to be predicted. This is a general and comprehensive way of analyzing fault activation, error propagation, and error detection processes. Let us consider the next three typical problems of error propagation analysis that can be solved using the introduced method. Problem 1. The majority of the existing error propagation models are developed for one specific goal – to estimate the probability of error propagation from one system element to another. This goal Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http:// dx.doi.org/10.1016/j.mechatronics.2014.09.005 9 A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx can be achieved using the introduced approach. For instance, we are only interested in the probability of error propagation from an element ei to an element ej . The next four steps have to be performed in order to obtain this probability. 1. Define fault activation in ei as an initial state of the system: s0 :¼ D E enext : ei ; EFA : fei g; EEP : fg; EFA : fg 2. Apply the EPG generation algorithm, starting with the state s0 to generate S and AEP . 3. Turn the EPG states that contain ej in EEP into absorbing states by removing their outgoing arcs: AEP :¼ AEP n ðsa ; sb ; pÞ; 8sb 2 S; and sa 2 S : ej 2 EEP sa 4. Compute and sum up the probabilities of absorption for the absorbing states that contain ej 2 EEP . The value obtained is a sum of the probabilities of the execution scenarios that lead to error propagation from the element ei to the element ej . This approach allows an error propagation matrix that contains the probabilities of error propagation between all pairs of system elements to be constructed. An analysis of this matrix helps to identify exactly how vulnerable the system is to fault activation and which parts of the system should be protected in the first place. Problem 2. Another example of the approach application is the estimation of the probability of error propagation to a defined set of critical elements, EC E. Here is an algorithm that explains the solution of this task. 1. Define the default initial state of the system: s0 :¼ D E enext : e0 ; EFA : fei g; EEP : fg; EFA : fg where e0 is the initial element of the CFG. 2. Apply the EPG generation algorithm to generate S and AEP . 3. Turn the EPG states that contain at least one critical element ec 2 EC in EEP into absorbing states by removing their outgoing arcs: AEP :¼ AEP n ðsa ; sb ; pÞ; 8sb 2 S; 2 EEP sa and sa 2 S : 9ec C and ec 2 E 4. Compute and sum up the probabilities of absorption for the absorbing states defined in the previous step. where e0 is the initial element of the CFG. 2. Apply the EPG generation algorithm to generate S and AEP . 3. Turn the EPG states that contain ed in EED into absorbing states by removing their outgoing arcs: AEP :¼ AEP n ðsa ; sb ; pÞ; 8sb 2 S; and sa 2 S : ed 2 EED sa 4. Compute and sum up the probabilities of absorption for the absorbing states defined in the previous step. These probabilities of error detection help to identify test cases that will stimulate fault activations in critical elements will allow the majority of the occurred errors to be detected. The introduced approach can be applied for a variety of tasks in system analysis. It can be used in order to improve reliability estimation. The majority of existing reliability models do not take error propagation into account at all. The introduced approach can extend the available reliability models or can even be applied separately. It also enables the probability of error propagation to critical elements to be estimated, supporting existing safety models. The introduced approach can be used to select a system protection strategy by identifying the most suitable places for error detection mechanisms. It can also be applied for fault localization in mechatronic systems, observing a number of system outputs that help to speed up the diagnostics and debugging of the system. The application concept of the dual-graph error propagation model for fault localization was demonstrated in [3]. 6. Case study This section contains a brief overview of a case study that aims for evaluation of the introduced approach to error propagation analysis. The detailed description you can find in [5,47]. The sample system in question is a caterpillar mobile robot, shown in Fig. 7. As in a typical mechatronic system it consists of mutually coupled hardware, software, and physical components. The task of this system is to move a robot from its current location to a target point. An abstraction of the caterpillar mobile robot system as a control system block diagram is shown in Fig. 8. A controlled variable is a pose (position and orientation) of the robot. During each control-cycle, the control software determines the current location and orientation of the robot using the navigation camera that is located above the scene. After that it computes the speeds for the caterpillars and the length of a time interval for the servo motors activation. Next, the control software The result of this algorithm is the probability of error propagation to at least one of the critical elements. For instance, it can help for reliability analysis of a system that has several important data outputs. Also, this approach can be applied to estimate the safety of a mechatronic system, defining the critical elements that can harm an operator or environment because of an erroneous input. It allows the probabilities of error propagation to this set of critical elements to be computed, given the fault activation in a particular system element. This algorithm also enables a sensitive analysis and identifies the elements that have the most effect on the critical ones. Problem 3. One other example of the approach application concerns the system testing process. The next algorithm can estimate the probability of error detection in a particular element ed that is equipped with an error detection mechanism. 1. Define the default initial state of the system: s0 :¼ D E enext : e0 ; EFA : fei g; EEP : fg; EFA : fg Fig. 7. The ‘‘caterpillar mobile robot’’ reference system. Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http:// dx.doi.org/10.1016/j.mechatronics.2014.09.005 10 A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx Fig. 8. A block diagram of a control loop of the ‘‘caterpillar mobile robot’’ reference system. sends corresponding commands to the servo motors that correct the robot motion to go towards the target point. This procedure is repeated until the robot reaches the anticipated target point. Fig. 9 shows the structure of the case study. It starts with the decomposition of the original robot control system into several executable elements. An activity UML diagram [30] of the mechatronic system under consideration is shown in Fig. 10. This diagram was used as a base-line system model within a formal design process because of explicit representation of system elements, control and data flows. The UML diagrams are usually applied to describe the only software part of the systems. However, the activity UML diagrams can model very abstract entities. In this particular case, each activity represents not just the action of the robot control software, but also the hardware and even the physical motion of the robot. The described activities are considered to be actions of the executable elements of the system. Therefore, there is oneto-one correspondence between these activities and the elements of the dual-graph error propagation model of the ‘‘caterpillar mobile robot’’ reference system. A software implemented fault injection (SWIFI) was used in order to obtain a faulty version of the reference system. During the development, debugging, and testing of the original system, the information about introduced software and hardware bugs and design faults has been tracked and saved into a special faultlist. Only the faults from this list have been injected in order to ensure the realistic behavior of the faulty version. However, not every fault is suitable for this case study. We were focused on the temporary faults [1] with low fault activation probabilities that result in erroneous data outputs. A temporary fault, in contrast to a permanent fault, is activated only under specific execution conditions like specific values of input parameters, environmental impact, or a control flow decision. The faults have been injected into all elements except ‘‘GuiOut’’ because it has no data output. Fig. 10. The UML activity diagram of the ‘‘caterpillar mobile robot’’ reference system. Software bugs have been injected into ‘‘Init’’, ‘‘CalcAngle’’, ‘‘CorrStep’’, and ‘‘RegStep’’. For example, many geometric computations are performed in order to determine the angle between the current and the desired robot orientation. During one of these computations in the faulty version of this element, radians are used instead of grads. The faulty part of the code is only activated under specific circumstances that depend on the current robot orientation and the target position. Activation of this fault results in an incorrect value of the computed angle. In the elements ‘‘RcgPic’’ and ‘‘NxtGo’’, hardware faults have been emulated by the software means. For instance, a physical defect in the robot construction is simulated in the fault version of the element NxtGo: One wheel Fig. 9. A structure of the case study. Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http:// dx.doi.org/10.1016/j.mechatronics.2014.09.005 A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx of the mobile robot scratches the frame and goes slowly. It results in an unintended robot movement as soon as its turns left. A complete list of injected faults is presented in Appendix C of [47]. After the fault injection, a separate statistical element level analysis for each individual element was performed. Fault activation probabilities were estimated by executing the original and faulty versions of the elements in parallel ( 500 runs of the entire system). Error propagation probabilities were estimated using the erroneous and error-free inputs for the faulty elements. A software model of physical robot movement has been developed for the element ‘‘NxtGo’’ and used parallel to the real-world robot. The element ‘‘GuiOut’’ has a non-zero error detection probability. This probability has been determined using the defined difference between the erroneous robot pose and the expected one that is detectable by the naked eye. The probabilities of fault activation (FAP), error propagation (EPP), and detection (EDP) are shown on the DFG in Fig. 11. This system was developed according to the described UML representation. Hence, the system level analysis was pretty straightforward. Structures of the control and data flow graphs were easily derived (see Fig. 11). Also, a number of statistical experiments ( 500 runs) have been done in order to obtain the probabilities of control flow transitions. These probabilities are shown on the CFG in Fig. 11. The described system level and element level analyses of the reference robot control system allowed the construction of an error propagation model that is shown in Fig. 11. The generation of a Markov chain was the next step. The generated Markov chain contains 30,431 arcs and 11,887 states: 1 initial state, 9944 transient states, and 1942 absorbing states. Meaning that there are 1942 possible system execution scenarios and the probabilities of absorption in these final states define the probabilities of the corresponding scenarios. All the computations have been made on the regular laptop with a 2.4 GHz Intel Core 2 Duo processor and 2 GB RAM. The generation time was about 1 min. 11 Several computation approaches, discussed in details in [47], have been applied in order to obtain these estimated results. The direct algebraic method, discussed in the previous section, has failed because of simple memory overflow. Next, we tried an iterative, approximation algorithm. The input parameters of this algorithm are: transition probability matrix P of the DTMC, a set of absorbing states, and a predefined accuracy of the estimation. The algorithm contains a loop for a step-wise computation of the DTMC. On each step a vector that shows a current distribution of probabilities among the DTMC states is computed. Because of the absorbing character of the DTMC, these probabilities eventually consolidate in the absorbing states. Therefore, the values of the elements of the vector that represent the absorbing states increase, and the values that represent transient states decrease. The computation continues until the desired accuracy is reached. The sum of values of the transient elements of the distribution vector is considered to be a current accuracy. This method gave a satisfactory result. It has been applied with different values of the accuracy parameter. Even with such high accuracy as 1014 , the computation time was less than 30 s. The exact results were also obtained, using a state space reduction technique, described in [49], and the further application of the direct computation method. The state space reduction from 11,887 states down to 7000 states took about 20 min. The computation time of the reduced DTMC using the direct method was around 15 min. Generally, a DTMC computation time t c can be represented as a polynomial function of a number of states N S that, in turn, depend on a number of elements of an error propagation model N E . It is possible to evaluate an upper bound of N S : NS 6 ðNE þ 2Þ 23NE It shows that N S can grow exponentially with the growth of N E . According to this, the described approach is only applicable for systems with a small number of elements. In fact, the number of Fig. 11. Control flow graph and data flow graph of the reference system with fault activation, error propagation, and error detection probabilities. Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http:// dx.doi.org/10.1016/j.mechatronics.2014.09.005 12 A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx states is much less than the defined upper bound. The parameter NS strongly depends on the properties of the error propagation model: structures of the control flow and data flow graphs and the properties of its elements. For instance, the lack of a data flow arc between two elements results in a lack of several states of the DTMC because error propagation between these elements is impossible. The same holds for the control flow structure. If an element ej cannot be executed after an element ei , then error propagation from ei to ej is impossible as well. In real-world systems, the control flow and data flow graphs are rather sparse, and that entails to reduction of N S . The element level properties of the EPM also affect NS . Assume that a fault activation probability of some element equals to zero. In this case, the DTMC has no states that contain the corresponding element in EFA . A similar situation occurs if an error detection probability equals to zero (or to one). As a general rule, the more element execution scenarios possible, the more states in the corresponding Markov chain. In spite of the fact that the real value N S is lower than the upper bound estimation, it will still grow exponentially. For instance, even if only faults activation is considered without further error propagation and error detections, the DTMC will contain at least 2NE states. Chapter 7 of [47] describes several methods that can help in coping with this problem. One of them is a low probability limitation technique. This technique has been evaluated using the constructed EPM. The idea of the low probability limitation is to stop the process of DTMC generation as long as an appropriate level of accuracy has been reached. Here, the accuracy depends on the number of already generated final states and their probabilities of absorption. The accuracy is computed as 1 P known , where Pknown is the sum of the probabilities of absorption of the already generated final states. Examination of the DTMC generation process has shown that there is not exactly a dependency between the number of final states and the total number of generated states. However, in our case 1734 of 1942 final states have already been created in the middle of the generation process. So it seems reasonable to stop the process after the first 30 s and analyze the generated part of the DTMC. It is obvious that the absorbing probabilities of the final states are not equal. However we observed that the final states with high probabilities of absorption were generated early. We already had an accuracy around 0.15 with the part of the DTMC that consists of only 2000 states. This experiment proves that application of the low probability limitation technique is reasonable for a computation of complex error propagation models. After that a number of experiments with the real system have been carried out. The faulty and original versions of the system are run parallel for statistical evaluation of system behavior. Both versions of the system work according to the same control flow. Control flow decisions for the original system have been provided by the faulty one. In contrast to the control flow, two separate data flows are kept: error-free data flow of the original system and data flow of the faulty system that can contain errors. The physical movement of the robot is controlled by the faulty system. The original system uses a software-implemented model of robot kinematics. The fault activations and propagation of errors are detected in a similar way as during the element level analysis. During the system operation, an original version and a faulty version of each element are executed using the erroneous data flow in order to detect fault activation. Likewise, a faulty version of an element is executed using the error-free and the erroneous data flows in order to detect error propagation through the element. The described system has been run 446 times. During these runs, the system has demonstrated 52 different execution scenarios. The first three results according to the statistical estimation of their probabilities are shown in Fig. 12. Twelve of the most frequent execution scenarios have been selected for evaluation. Statistical estimation of the other scenarios is considered unreliable because of the small number of their occurrence. The computational error of this estimation is less than Fig. 12. Three of the most frequent execution scenarios of the reference system. Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http:// dx.doi.org/10.1016/j.mechatronics.2014.09.005 13 A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx Table 2 Comparison of the experimental results and the prediction of the EPM. N FA EP ED Exp. Mod. Dif. 1 2 3 4 5 6 7 8 9 10 11 12 None CorrStep CalcAngle CalcAngle RegStep RcgPic NxtGo CorrStep CalcAngle, CorrStep CalcAngle, RegStep NxtGo Init n.a. RcgPic, CorrStep, NxtGo RcgPic, CalcAngle, CorrStep, NxtGo CalcAngle RcgPic, RegStep, NxtGo RcgPic RcgPic, NxtGo RcgPic, CorrStep, NxtGo RcgPic, CalcAngle, CorrStep, NxtGo RcgPic, CalcAngle, RegStep, NxtGo RcgPic, NxtGo Init, RcgPic n.a. FS FS n.a. FS FS n.a. n.a. FS FS FS FS 0.3296 0.1861 0.1099 0.0381 0.0381 0.0314 0.0247 0.0247 0.0247 0.0179 0.0179 0.0135 0.3127 0.1545 0.1067 0.0288 0.0453 0.0492 0.0502 0.0226 0.0254 0.0060 0.0502 0.0247 0.0167 0.0316 0.0032 0.0093 0.0072 0.0179 0.0255 0.0021 0.0007 0.0119 0.0322 0.0113 FA – Fault activation, EP – Error propagation, ED – Error detection. Exp. – The probabilities obtained by the statistical experiments. Mod. – The probabilities predicted by the error propagation model. Dif. – Difference. 1% for the standard 95% confidence level. Table 2 shows twelve execution scenarios of the selected system. The model prediction and the experimental results are quite similar. The average difference is about 0.0142, and the maximum difference is 0.0322. However, only three of first execution scenarios have a probability greater than 0.1. The probabilities of the other scenarios are lower, and the relative difference between the experimental and model result becomes more and more significant. Nevertheless, these results prove that the model can accurately predict the most probable scenarios. It also gives the preliminary estimation of the probabilities of the other execution scenarios. At least it can distinguish between realistic and practically impossible execution scenarios. For example, the sorted list of predicted execution scenarios has been crosschecked with the experiments. The first predicted scenario that was not met among the experiments has the predicted probability of only 0.0074. It means that the likelihood that this execution scenario will be among our 446 experiments is only 3.5%. All other scenarios with probabilities greater that 0.0074 have been met at least once. 7. Conclusion Error propagation analysis is an important part of a variety of system analysis tasks. System design, reliability and safety analysis, testing, diagnostics, and many other activities that are aimed at development of a dependable system require a deep understanding of its behavior in an erroneous condition. The specifics of the mechatronic domain necessitate the development of an appropriate mathematical model. This model must operate with abstract entities to represent various properties of the heterogeneous components of mechatronic systems and provide the methods for probabilistic error propagation analysis. A suitable model was presented in this article. The following primary research results have been achieved: The mechatronic-oriented error propagation model. The overview of the existing approaches to error propagation analysis (see Section 2) has revealed that there are several error propagation models that can be theoretically adapted to the analysis of mechatronic systems. However, there is no complete mathematical framework that enables an analysis of the heterogenous mechatronic components. Development of the error propagation model oriented to the mechatronic domain has been the first achievement. The new concept of error propagation analysis. The most appropriate error propagation model has been selected among the available models to be the starting point of this research. Despite the strong probabilistic background, this model has a significant disadvantage: It implies that data errors always propagate through control flow. This assumption makes it inapplicable for the systems in which components can be triggered in arbitrary order with nonsequential data flow. This article shows that the system control and data flows must be separately considered for an accurate description of an error propagation process. This has motivated the development of a new approach to error propagation analysis based on synchronous examination of two directed graphs: a control flow graph and a data flow graph (see Section 3). No similar approaches have been found among the existing error propagation models or in domains related to system dependability. The new dual-graph approach to error propagation analysis has been the second achievement. Extensive mathematical framework. The new concept enables system behavior to be modeled in a more flexible and accurate manner than the models described in Section 2. Moreover, unlike the existing approaches that are only aimed at computation of error propagation probabilities between the system elements, the presented model considers the entire chain of events, starting with fault activation and ending with error detection or system failure. This enables a system execution process to be modeled using the probabilistic control flow graph, activation of multiple faults during the system execution, propagation of the occurred errors through the data flow structure, and detection of these errors by protection mechanisms, taking into account different types of further system behavior, e.g. fails-stop or error correction. This extensive definition of the error propagation processes makes the model applicable for different system analysis tasks. However, during model development a significant amount of effort was made in order to preserve a balance between the breadth of application and model complexity. The development of the new mathematical framework of the dual-graph error propagation model has been the third achievement. Comprehensive approach to error propagation analysis. The comprehensive approach to error propagation analysis has been described in Section 5. The main idea of this approach is an application of a discrete time Markov chain in order to model system execution in terms of faults activation, errors propagation and errors detection. This Markov chain can be generated automatically using the data contained in the error propagation model and makes it possible to obtain the probabilities of different erroneous and error-free scenarios of system operation. The introduced approach supports customization and can be extended or specialized depending on the defined system analysis task. The development of this approach to error propagation analysis has been the fourth achievement. Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http:// dx.doi.org/10.1016/j.mechatronics.2014.09.005 14 A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx Despite the fact that this research is more oriented to theoretical aspects, it has been verified using a typical mechatronic system (see Section 6). The probabilities of possible execution scenarios, predicted by the introduced error propagation model, have been compared to the experimental evaluation. The obtained numerical results lead the conclusion to be drawn that the presented approach is accurate and applicable for the error propagation analysis of mechatronic systems. Acknowledgements This work has been supported by the Erasmus Mundus External Co-operation Window Programme of the European Union. References [1] Laprie JC, Avizienis A, Kopetz H. Dependability: basic concepts and terminology. Secaucus (NJ, USA): Springer-Verlag; 1992. [2] Janschek K. Mechatronic systems design: methods, models, concepts. Springer; 2011. <http://books.google.com/books?id=L0MhTwEACAAJ>. [3] Morozov A, Janschek K. Fast abstract: dual graph model for software errors localization. In: 21st IEEE international symposium on software reliability engineering; 2010. [4] Morozov A, Janschek K. Dual graph error propagation model for mechatronic system analysis. In: Proceedings of the 18th IFAC world congress, August 28– September 2, 2011, Milano, Italy; 2011. p. 9893–8. [5] Morozov A, Janschek K. Case study results for probabilistic error propagation analysis of a mechatronic system. In: Tagungsband Fachtagung Mechatronik 2013, Aachen, 06.03.–08.03.2013; 2013. p. 229–34. [6] Ge X, Paige RF, McDermid JA. Probabilistic failure propagation and transformation analysis. In: SAFECOMP’09; 2009. p. 215–28. [7] Fenelon P, Mcdermid JA, Dd Y. An integrated toolset for software safety analysis. J Syst Softw 1993;21:279–90. [8] Papadopoulos Y, Mcdermid J, Sasse R, Heiner G. Analysis and synthesis of the behaviour of complex programmable electronic systems in conditions of failure. Reliab Eng Syst Safety 2001;71:229–47. [9] Kaiser B, Liggesmeyer P, Jäckel O. A new component concept for fault trees. In: Proceedings of the 8th Australian workshop on safety critical systems and software (SCS’03), Adelaide; 2003. p. 37–46. [10] Wallace M. Modular architectural representation and analysis of fault propagation and transformation. In: Proc. FESCA 2005. ENTCS, vol. 141(3). Elsevier; 2005. p. 53–71. [11] Voas JM. PIE: a dynamic failure-based technique. IEEE Trans Softw Eng 1992;18:717–27. [12] Voas JM, Morell LJ. Propagation and infection analysis (PIA) applied to debugging. In: Proceedings of Southeastcon’90; 1990. p. 379–83. [13] Voas JM. Error propagation analysis for cots systems. IEEE Comput Control Eng J 1997;8:269–72. [14] Michael CC, Jones RC. On the uniformity of error propagation in software. In: Proceedings of the 12th annual conference on computer assurance (COMPASS ’97); 1996. p. 68–76. [15] Candea G, Delgado M, Chen M, Fox A. Automatic failure-path inference: a generic introspection technique for internet applications. In: Proceedings of the third IEEE workshop on internet applications, WIAPP ’03. Washington (DC, USA): IEEE Computer Society; 2003. p. 132–41. <http://portal.acm.org/ citation.cfm?id=832311.837386>. [16] Khoshgoftaar TM, Allen EB, Tang WH, Michael CC, Voas JM. Identifying modules which do not propagate errors. In: Proceedings of the 1999 IEEE symposium on application – specific systems and software engineering and technology, ASSET ’99. Washington (DC, USA): IEEE Computer Society; 1999. p. 185. <http://portal.acm.org/citation.cfm?id=786771.787114>. [17] Sanyal S, Shah V, Bhattacharya S. Framework of a software reliability engineering tool. In: IEEE international symposium high-assurance systems engineering; 1997. [18] Zhang F, Xingshe Z, Yunwei D, Junwen C. Consider of fault propagation in architecture-based software reliability analysis. In: IEEE/ACS international conference computer systems and applications, 2009. AICCSA; 2009. p. 783–6. [19] Ammar HH, Nassar D, Abdelmoez W, Shereshevsky M. A framework for experimental error propagation analysis of software architecture specifications. In: ISSRE’02; 2002. [20] Abdelmoez W, Nassar DM, Shereshevsky M, Gradetsky N, Gunnalan R, Ammar HH, et al. Error propagation in software architectures. In: Proceedings of the software metrics, 10th international symposium. Washington (DC, USA): IEEE Computer Society; 2004. p. 384–93. <http://portal.acm.org/citation.cfm?id= 1018439.1021921. doi:10.1109/METRICS.2004.20>. [21] Nassar D, Rabie W, Shereshevsky M, Gradetsky N, Ammar HH, Bogazzi S, et al. Estimating error propagation probabilities in software architectures; 2002. [22] Popic P, Desovski D, Abdelmoez W, Cukic B. Error propagation in the reliability analysis of component based systems. In: Proceedings of the 16th IEEE international symposium on software reliability engineering. Washington (DC, USA): IEEE Computer Society; 2005. p. 53–62. http://dx.doi.org/10.1109/ ISSRE.2005.18. <http://portal.acm.org/citation.cfm?id=1104997.1105235>. [23] A. Jhumka, M. Hiller, N. Suri, Assessing inter-modular error propagation in distributed software. In: Symp. on reliable distributed systems distributed software; 2001. p. 152–61. [24] Hiller M, Jhumka A, Suri N. An approach for analysing the propagation of data errors in software. In: Proceedings of the 2001 international conference on dependable systems and networks (formerly: FTCS), DSN ’01. Washington (DC, USA): IEEE Computer Society; 2001. p. 161–72. <http://portal.acm.org/ citation.cfm?id=647882.738068>. [25] Hiller M, Jhumkas A, Suri N. Tracking the propagation of data errors in software. In: Dependable computing systems: paradigms. Performance issues and applications. Wiley; 2005. [26] Hiller M, Jhumka A, Suri N. Propane: an environment for examining the propagation of errors in software. SIGSOFT Softw Eng Notes 2002;27: 81–5. [27] Mohamed A, Zulkernine M. On failure propagation in component-based software systems. In: Proceedings of the 2008 the eighth international conference on quality software. Washington (DC, USA): IEEE Computer Society; 2008. p. 402–11. http://dx.doi.org/10.1109/QSIC.2008.46. <http:// portal.acm.org/citation.cfm?id=1441433.1442755>. [28] Cortellessa V, Grassi V. Role and impact of error propagation in software architecture reliability. TRCS 007/2006. Technical report. Dipartimento di Informatica, Universita’ dell’Aquila; 2006. <http://www.di.univaq.it/cortelle/ docs/internalreport.pdf>. [29] Cortellessa V, Grassi V. A modeling approach to analyze the impact of error propagation on reliability of component-based systems. In: Proceedings of the 10th international conference on component-based software engineering, CBSE’07. Berlin, Heidelberg: Springer; 2007. p. 140–56. <http://portal.acm.org/ citation.cfm?id=1770657.1770670>. [30] OMG. Unified modeling language (UML). Core specification; 2010a. [31] OMG. Systems modeling language (SysML). Core specification; 2010b. [32] OMG. UML superstructure specification, v2.0; 2005. [33] Goseva-Popstojanova K, Trivedi KS. Architecture-based approach to reliability assessment of software systems. Perform Eval 2001;45:179–204. [34] Constantinescu C. A decomposition method for reliability analysis of real-time computing systems. In: 1994. Proceedings. Annual reliability and maintainability symposium; 1994. p. 272–7. http://dx.doi.org/10.1109/RAMS.1994.291119. [35] Chen Li Yan. Decomposition method for software reliability analysis. Comput Eng Des 2007. [36] Boudali H, Sozer H, Stoelinga M. Architectural availability analysis of software decomposition for local recovery. Secure Syst Integr Reliab Improv 2009; 0:14–22. [37] Aho AV, Sethi R, Ullman JD. Compilers: principles, techniques, and tools. Boston (MA, USA): Addison-Wesley Longman Publishing Co., Inc.; 1986. [38] Landi W, Ryder BG. A safe approximate algorithm for interprocedural pointer aliasing. SIGPLAN Not 2004;39:473–89. [39] van Eijndhoven JTJ, Stok L. A data flow graph exchange standard. In: 1992. Proceedings., [3rd] European conference design automation, Brussels, Belgium; 1992. p. 193–9. http://dx.doi.org/10.1109/EDAC.1992.205921. [40] Harrold MJ, Offutt AJ, Tewary K. An approach to fault modeling and fault seeding using the program dependence graph. J Syst Softw 1997;36: 273–95. [41] Gokhale SS, Wong WE, Trivedi KS, Horgan JR. An analytical approach to architecture-based software reliability prediction. In: Computer performance and dependability symposium; 1998. p. 13. [42] Yacoub SM, Cukic B, Ammar HH. Scenario-based reliability analysis of component-based software. In: Proceedings of the 10th international symposium on software reliability engineering, ISSRE ’99. Washington (DC, USA): IEEE Computer Society; 1999. <http://portal.acm.org/citation.cfm?id= 851020.856175>. [43] Namballa R, Ranganathan N, Ejnioui A. Control and data flow graph extraction for high-level synthesis. In: VLSI, 2004. Proceedings. IEEE computer society annual symposium; 2004. p. 187–92. http://dx.doi.org/10.1109/ISVLSI.2004. 1339528. [44] Borkar S. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 2005. [45] Schroeder B, Pinheiro E, Weber W-D. Dram errors in the wild: a large-scale field study. In: SIGMETRICS ’09: proceedings of the eleventh international joint conference on Measurement and modeling of computer systems. USA: ACM; 2009. [46] Nightingale EB, Douceur JR, Orgovan V. Cycles, cells and platters: an empirical analysis of hardware failures on a million consumer PCs. In: Sixth conference on Computer systems (EuroSys), 2011. p. 343–56. [47] Morozov A. Dual-graph model for error propagation analysis of mechatronic systems. Beiträge aus der Automatisierungstechnik, Vogt; 2012. <http:// books.google.de/books?id=YLxilAEACAAJ>. [48] Grinstead CM, Snell JL. Chapter 11: markov chains. In: Introduction to probability. American Math. Society; 1997. [49] Górajski M. Reduction of absorbing Markov chain. Ann UMCS Math 2010;63: 91–107. Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http:// dx.doi.org/10.1016/j.mechatronics.2014.09.005

Log In

Probabilistic error propagation model for mechatronic systems

Related papers

Related papers

Related topics