Mechatronics xxx (2014) xxx–xxx
Contents lists available at ScienceDirect
Mechatronics
journal homepage: www.elsevier.com/locate/mechatronics
Probabilistic error propagation model for mechatronic systems
Andrey Morozov ⇑, Klaus Janschek
Institute of Automation, Technische Universität Dresden, 01062 Dresden, Germany
a r t i c l e
i n f o
Article history:
Received 5 November 2013
Accepted 15 September 2014
Available online xxxx
Keywords:
Control flow graph
Data flow graph
Error propagation analysis
Discrete time Markov chain
Dependability
UML
a b s t r a c t
This paper addresses a probabilistic approach to error propagation analysis of a mechatronic system.
These types of systems require highly abstractive models for the proper mapping of the mutual interaction of heterogeneous system components such as software, hardware, and physical parts. A literature
overview reveals a number of appropriate error propagation models that are based on Markovian representation of control flow. However, these models imply that data errors always propagate through the
control flow. This assumption limits their application to systems, in which components can be triggered
in arbitrary order with non-sequential data flow. A motivational example, discussed in this paper, shows
that control and data flows must be considered separately for an accurate description of an error
propagation process.
For this reason, we introduce a new concept of error propagation analysis. The central idea is a synchronous examination of two directed graphs: a control flow graph and a data flow graph. The structures of
these graphs can be derived systematically during system development. The knowledge about an
operational profile and properties of individual system components allow the definition of additional
parameters of the error propagation model. A discrete time Markov chain is applied for the modeling
of faults activation, errors propagation, and errors detection during operation of the system. A state graph
of this Markov chain can be generated automatically using the discussed dual-graph representation. A
specific approach to computation of this Markov chain makes it possible to obtain the probabilities of
erroneous and error-free system execution scenarios.
This information plays a valuable role in development of dependable systems. For instance, it can help
to define an effective testing strategy, to perform accurate reliability estimation, and to speed up error
detection and fault localization processes. This paper contains a comprehensive description of a
mathematical framework of the new dual-graph error propagation model and a Markov-based method
for error propagation analysis.
Ó 2014 Elsevier Ltd. All rights reserved.
1. Introduction
The research results presented in this article belong to a rather
young scientific domain – system dependability. By this reason, in
various papers devoted to error propagation analysis, different
terms can describe similar entities. In this article, the term ‘‘error’’
is used in a general context that fits for the engineering domain.
This paper adheres to the definition proposed by Laprie [1]. A brief
overview of the dependability research domain helps to
distinguish the term ‘‘error’’ from other similar terms.
Dependability is the ability of a system to deliver a service
that can be justifiably trusted. The service, delivered by a system,
is its behavior as it is perceived by its user. Laprie describes
⇑ Corresponding author. Tel.: +49 351 46332202; fax: + 49 351 46337039.
E-mail address:
[email protected] (A. Morozov).
dependability from three points of view: the attributes of dependability, the means by which dependability is attained, and the
threats to dependability. We are focused on the threats:
Fault is a defect in the system that can be activated and
cause an error.
Error is an incorrect internal state of the system, or a
discrepancy between the intended behavior of a
system and its actual behavior.
Failure is an instance in time when the system displays
behavior that is contrary to its specification.
Activation of a fault leads to the occurrence of an error. The
invalid internal system state, generated by an error, may lead to
another error or to a failure. Failures are defined according to the
system boundary. If an error propagates outside the system, a
failure is said to occur.
http://dx.doi.org/10.1016/j.mechatronics.2014.09.005
0957-4158/Ó 2014 Elsevier Ltd. All rights reserved.
Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http://
dx.doi.org/10.1016/j.mechatronics.2014.09.005
2
A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx
Analysis of fault activation, error propagation, and error (or failure) detection is defined in this article as error propagation analysis.
The results of this analysis is extremely helpful in a wide range of
analytical tasks associated with dependable systems development.
The error propagation analysis gives sound support for reliability
evaluation, because error propagation has significant influence on
the system behavior in critical situations. The error propagation
analysis is a necessary activity for safety system design. It helps
to estimate the likelihood of error propagation to hazardous parts
of the system and identify parts of the system that should be protected with error detection or error recovery mechanisms more
strongly than the others. Another possible application area is system testing and debugging. An accurate error propagation analysis
assists selecting an appropriate testing strategy. It helps to identify
the most critical parts of the system (from either reliability or
safety points of view) and to generate such a set of test-cases that
will stimulate fault activation in these particular parts and allow
the detection of occurred errors. Probabilistic error propagation
analysis can be used for system diagnostics. In the case of error
detection in observable system outputs, it helps to trace back an
error propagation path up to an error-source. It speeds up the error
localization process, system testing, and debugging.
In real systems, fault activation and further error propagation
are very complex processes. This causes the need for a strong
mathematical framework to perform an accurate error propagation
analysis. Specifics of the mechatronic domain brings additional
complexity. The fact is that mechatronic systems incorporate the
assembly of heterogeneous components (mechanical, electrical,
computer, and information technology) with various mutual interactions. The goal of mechatronic system design is to ensure a
proper and coordinated operation of these elements within a feedback structure under all possible operational conditions. According
to Janschek [2], one of the big challenges of mechatronics is the use
of appropriate models, which describe this mutual interaction on a
common abstract layer. The error propagation analysis, as an
essential part of the mechatronic system design, also requires a
specific model. This model must be able to operate with abstract
entities to represent various properties of the heterogenous
mechatronic components. A sufficient error propagation model is
presented in this article in details. Also, some of the basic ideas
you can find in our previous publications [3–5].
2. State of the art
Most of safety–critical mechatronic systems consist of a mix of
software and hardware elements. Typically, error propagation
analysis of hardware is based on one of the classical reliability
evaluation techniques: failure modes and effect analyses (FMEA),
hazard and operability studies (HAZOP), fault trees analysis (FTA),
event trees (ET), etc. Generally, the process of failure analysis consists of several activities: identifying failures of individual components, modeling the failure logic of the entire system, analyzing the
effect of a failure on other components, and determining and engineering the migration of potential hazards. With the emergence of
component-based development approaches, investigations began
exploring component oriented safety analysis techniques, mainly
focusing on creating encapsulated error propagation models. These
failure propagation models describe how failure modes of incoming messages, together with internal component faults, propagate
to failure mode of outgoing messages [6–10].
In the software engineering domain, the majority of classical
error propagation approaches are based on fault injection or error
injection techniques, conjugated with further statistical evaluation.
Three of them were introduced by Voas [11–13]. An empirical
study about propagation of data-state errors was presented in
[14]. Candea et al. present a technique for automatically capturing
dynamic fault propagation information in [15]. The authors use
instrumented middleware to discover potential failure points in
the application. Khoshgoftaar et al. [16] describe identification of
software modules, which do not propagate errors, induced by a
suite of test cases. A number of papers depict the influence of software error propagation phenomena on system reliability [17,18].
Four error propagation models that can be considered as the
best candidates for the analysis of mechatronic systems are listed
in this section. Unlike the models discussed above, these four
candidates are abstract enough to cope with the heterogeneity of
components of mechatronic systems and have a strong mathematical
foundation. The key properties of these models are compared in
Table 1.
Abdelmoez’s model [19–21] is a design-level model for error
propagation analysis of COTS systems that was also extended for
reliability evaluation in [22]. This model uses information about
system states and messages in order to compute the probability
Table 1
The comparison table of the suitable models for the error propagation analysis of mechatronic systems.
Author and years
Abdelmoez et al., 2002/2004/2005
Application areas
Required data
Main idea
Purpose
Deficiencies
COTS
State and sequence UML diagrams
An early estimate of the error propagation probabilities between system components in terms of states and messages
General use and reliability assessment
Not abstract enough. Requires very specific and detailed system models
Application areas
Required data
Main idea
Purpose
Deficiencies
Hiller et al., 2001/2005/2007
Modular software for embedded systems
Source code and reliability measurements
The concept of error permeability through a system module
Placement of EDM and ERM. Reduction of error propagation by design
More oriented to module level rather than system level analysis. Applicable only for a software part of the system
Application areas
Required data
Main idea
Purpose
Deficiencies
Mohamed et al., 2008/2010
COTS
Component UML diagrams, estimated fault activation and error propagation probabilities
Error propagation through an architectural service route
Reliability assessment
Not comprehensive enough. Can be considered an offshoot of Cortellessas model
Application areas
Required data
Main idea
Purpose
Deficiencies
Cortellessa et al., 2006/2007
COTS and SOA
Fault activation, error propagation, and control flow transition probabilities
Probabilistic error propagation analysis using Markovian representation of control flow
Placement of error detection and error recovery mechanisms. Identification of critical components Development of cost-effective testing strategies
Does not distinguish between control and data flows
Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http://
dx.doi.org/10.1016/j.mechatronics.2014.09.005
A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx
of error propagation between system components. The advantage
of this model is the possibility of its application in the early phases
of system development. However, it requires a very detailed and
specific UML description that should also be very accurate for
obtaining trustworthy results.
Based on this concept, Hiller et al. [23–26] introduce the concept of error permeability through software modules and an error
propagation model. This model is defined for modular software of
embedded systems and can be used for dependable system
design. Hiller’s model seams more suitable for real-world application than Abdelmoez’s model because it operates at the sourcecode level. The detailed case studies and the software tool PROPANE have proven this fact. However, the discussed concept can
only be applied to the software part of a mechatronic system
because the theoretical background of this model is not
comprehensive enough.
Mohamed and Zulkernine [27] present another approach to
error propagation analysis and its application for system reliability
assessment, based on the definition of the architecture service
routes. In spite of several deviations, Mohammed’s model can be
considered an offshoot of Cortellessa’s model [28,29]. Cortellessa’s
model is based on the Markov representation of system control
flow. It was originally developed for Commercial Off-The-Shelf
(COTS) systems and later extended for Service-Oriented Architecture (SOA) systems. This model has the strongest mathematical
background in comparison to the other error propagation models
that have been discussed in this chapter. The authors demonstrate
its applicability for smart placing of error detection and error
recovery mechanisms, planning of cost-effective testing strategies,
and system reliability evaluation.
3. General concept and motivation
After the literature overview, the general idea of Cortellessa’s
model has been selected as the starting point of this research
project. Nevertheless, this model has a significant disadvantage.
The authors assume that ‘‘data errors always propagate through
3
control flow’’ [28]. It means that Cortellessa’s model does not
distinguish between control flow and data flow. It works correctly
for the systems with a straightforward design. However, in
real-world systems, sequential execution of several components
does not imply that each next component will use the data
produced by the previous component. The authors do not take into
account the systems with components that can be triggered in
arbitrary order with non-sequential data flow. The presented work
is aimed to eliminate this drawback.
A small counter-example of a collision avoidance system,
shown in Fig. 1, demonstrates a particular case of inapplicability
of Cortellessa’s model. It also shows that accurate system level
error propagation analysis requires simultaneous examination of
both control flow and data flow structures.
Fig. 1(a) describes a mobile robot, equipped with collision
avoidance software. A control unit installed in the robot receives
data from sensors and controls the movement of the robot. The
robot has two speed sensors embedded in two wheels. This robot
is also equipped with two ultrasonic sensors that can measure
the distance to obstacles: one sensor on the front and another
one on the back. A part of the collision avoidance software is
represented in the right side of Fig. 1. It is a simple algorithm that
computes two parameters: speed of the robot and time to collision.
This algorithm starts with reading the information from the sensors to variables fd; bd; ls, and rs. The variables ls and rs represent
the speeds of the left and right wheels of the robot, while fd and bd
show the current distance to an obstacle measured by the front or
back ultrasonic sensors. Once accomplished, it computes the overall speed of the robot and saves to a variable s. The positive value of
s means that the robot moves forward, and a time to collision t is
computed using information from the front sensor fd. Otherwise,
it is assumed that the robot moves backward, and the variable bd
is used for the computation of t.
In the discussed example, the sensors represent system inputs
and the variables t and s system outputs. This gives rise to two
primary questions that concern the error propagation process
through this system:
Fig. 1. A part of a collision avoidance system of a mobile robot and three possible scenarios of error propagation.
Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http://
dx.doi.org/10.1016/j.mechatronics.2014.09.005
4
A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx
What is the likelihood of observing an error in the output, given
an error in the input?
Where is the source of the error that is observed in the
output?
Fig. 1(b)–(d) show three situations that should be discussed in
order to answer the first question. Assume that a fault is activated
in the left sensor (Fig. 1(b)). In this case, an occurred error will
affect the variable ls. The variable ls is used in the computation
of s, which is then used in the computation of t. Hence, the error
will propagate through the system to both outputs t and s. In this
situation, t and s have a data dependence of the variable ls.
In case of fault activation in the back sensor (Fig. 1(c) and (d)),
the variable bd contains an erroneous value. This error can only
propagate to the output t, because computation of s does not
depend on bd. However, in this case, error propagation depends
on control flow. If the robot is moving backward, then the error will
actually affect the output t (Fig. 1(d)). Otherwise, if the robot is
moving forward, the variable fd will be used instead of bd, and
the error will be masked (Fig. 1(c)).
Assume that an operational profile of this mobile robot is
known, e.g. from statistical experiments. It moves backward only
1=10 of the time during its canonical utilization. This assumption
gives an answer to the first question. An error from the left and
right sensors will always propagate to both outputs. An error from
the front sensor will propagate to t in 90% of the cases. An error
from the back sensor will propagate to t in 10% of the cases.
The second question refers to an inverse problem. The answer
to this question depends not only on the system behavior, but also
in the likelihood of fault activation in each sensor. However, even
without this information, it is clear that observation of an error
in s and t means that the left sensor or the right sensor (or maybe
both of them) produce erroneous data. If there is an error only in
the output t and the probabilities of fault activation in each sensor
are equal, then the error is coming from the back sensor in 10% of
the cases and from the front sensor in 90%.
This discussion demonstrates that even in this primitive case,
only the separate analysis of control and data flows describes the
actuality of the error propagation process. Generally speaking, a
data flow analysis shows only the possibility of error propagation
from one part of a system to another. A probabilistic control flow
analysis makes it possible to estimate the likelihood of this fact.
Simultaneous examination of the control and data flows is a key
feature of the error propagation model introduced in this article.
A general structure of the entire approach presented in this article is shown in Fig. 2. Section 4 introduces a dual-graph error propagation model. This model can be generated using a base-line
system representation. The dual-graph error propagation model
is a new mathematical framework for probabilistic error propagation analysis that makes simultaneous examination of control
and data flow graphs of a system possible. Section 5 discusses an
extensive approach to error propagation analysis that uses an
absorbing discrete time Markov chain to describe error propagation
processes during system execution and a specific method for
computation of the Markov chain that enables the probabilities
of different system execution scenarios to be obtained in terms
of faults activation, errors propagation, and errors detection.
Section 6 demonstrates applicability of the introduced concept
using a typical mechatronic system.
4. Dual-graph error propagation model
4.1. Formal definition
A dual-graph error propagation model (EPM) is an abstract
mathematical framework for error propagation analysis (EPA) of
mechatronic systems. A behavioral aspect of a system under test
forms a basis of the introduced EPM. The system is considered to
be a nonempty set of N E independent elements:
E :¼ fe1 ; e2 ; . . . ; eNE g
Each element represents an executable part of the system. A
software function, which is executed during the operation, a
hardware sensor that performs measurements, a controller, or even
an actuator are all examples of these elements. Execution of the
element represents an operation of the corresponding part of the
system. An element einit 2 E defines an initial element. It is assumed
that a system can only have one initial element and the operation
starts with execution of this element.
Two directed graph models are defined using this set of
elements: a control flow graph (CFG) and a data flow graph (DFG).
A control flow graph is a structural representation of system control
flow. It represents a possible order of execution of system
elements:
h
i
GCF :¼ E; ACF :
E ¼ e1 ; . . . ; eNE ;
n
o
CF
ei ; ej ; pCF
ACF :¼
ei ;ej ; . . . ; ek ; el ; pek ;el
The elements E play the role of nodes of the CFG. Arcs of the CFG
represent control flow transitions between the elements. A set of
CF
the arcs is denoted by ACF . An arc ðei ; ej ; pCF
from an element
ei ;ej Þ 2 A
ei to an element ej defines that ej can be executed immediately after
execution of ei . Also, in contrast to the initial element, final elements
are defined as the elements without outgoing CFG arcs. These elements are called ‘‘final’’, because the system operation stops after
their execution.
A transition probability is defined as a property of each arc of the
CFG. The transition probability through an arc from an element ei
to an element ej is denoted by pCF
ei ;ej . All values of the transition
probabilities are within the limits of the interval ½0; 1, and the
sum of the transition probabilities of all outgoing arcs of each
CFG node equals 1. In the visual representation of a CFG, transition
Fig. 2. A general structure of the presented approach to error propagation analysis.
Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http://
dx.doi.org/10.1016/j.mechatronics.2014.09.005
A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx
probabilities are denoted by numbers near the arcs. Fig. 3(a) shows
an example of a simple CFG.
A data flow graph is a structural representation of a data flow of
a system:
h
i
GDF :¼ E; ADF :
E ¼ e1 ; . . . ; eNE ;
ADF :¼ ðei ; ej Þ; . . . ; ðek ; el Þ
The DFG of the system contains the same set of nodes E as the CFG.
Arcs of the DFG show the possibility of data transfer between the
elements. For example, an arc ðei ; ej Þ 2 ADF from an element ei to
an element ej denotes that output of ei will be used as input in ej .
Fig. 3(b) demonstrates a small example of a DFG.
Faults can be activated in the elements during their execution
and result in the occurrence of errors. Error propagation through
the system comprises of two aspects: error propagation between
the elements and error propagation through the elements. The
error propagation between the elements is determined by the
DFG structure. The error propagation through the elements
depends on the properties of a particular element. Each element
is defined using four parameters: fault activation probability
(FAP), error propagation probability (EPP), error detection probability (EDP), and error detection behavior (EDB). The two graphs
and these parameters of all of the elements describe an operational
profile of the system. Superposition of control flow sequences on
the data flow graph shows error propagation during system execution. Simultaneous probabilistic analysis of the CFG and the DFG
forms a backbone of application of the introduced error propagation model.
From a data transfer perspective, the element execution is the
transformation of input data to output data (see Fig. 4(a)). Each eleout
ment ei of the EPM contains N in
ei inputs and N ei outputs:
Iei :¼
i1 ; i2 ; . . . ; iNin ;
ei
Oei :¼
n
o
o1 ; o2 ; . . . ; oNout
e
i
A perfect example of an element is a software function that takes a
set of input parameters and transforms them into a set of output
parameters. Faults can be activated during the execution of an
element. An error occurs because of the fault activation. The defined
EPM implies that the occurred error propagates to all outputs of the
element (see Fig. 4(b)). In other words, in the occurrence of an
execution of the element ei , all of its outputs would be erroneous
with a given fault activation probability pFA
ei . It is also assumed that
fault activations in different elements during the operation of the
system are independent events. The probability of fault-free
FA
execution of ei is denoted by pFA
ei and equals ð1 pei Þ.
In the existence of an error in at least one of the inputs of an
element ei , the error propagates through this element to all data outputs with a given error propagation probability pEP
ei (see Fig. 4(c))
5
The probability of error masking in the element ei is defined as
EP
pEP
ei ¼ 1 pei . Error propagation through the element and fault
activation in the element are considered to be independent events.
Some system elements can be equipped with error detection
mechanisms (EDM). An error in the input of an element ei can be
detected with the given error detection probability denoted by
pED
ei (see Fig. 4(d)). It is assumed that only errors in inputs are
detectable. An error that occurs because of fault activation within
the element can be detected only by the elements, which will be
executed later. In other words, an EDM checks the inputs at the
beginning of the execution of the element, before possible fault
ED
activation. pED
ei ¼ ð1 pei Þ defines the probability that the error
was not detected by the EDM. The parameter bei defines the error
detection behavior (EDB) for the element ei . This parameter takes
one of three following values: fail-stop (FS), error message (EM),
or error correction (EC).
4.2. Obtaining of required parameters
It is assumed that some kind of a base-line system description
is available. Different types of base-line system models can be
transformed into the dual-graph error propagation model. In the
early phases of the system development life cycle, the UML diagrams (see [30]) can be used for derivation of the system level
parameters. UML is a widely accepted candidate for heterogeneous system modeling. Although it was originally designed for
software systems, it is equally applicable for the modeling of heterogeneous systems like mechatronic systems on a certain
abstraction level, e.g. the UML extension for systems engineering,
SysML (see [31]). The core aspects, structure, behavior, and interaction, can be described with UML diagrams in a formal and
transparent manner. UML activity diagrams are the most suitable
for the presented approach, due to the very transparent definition
of control and data flows. By this reason, this type of diagram was
used in the case study presented in Section 6. However, UML
activity diagrams could not provide the control flow probabilities.
The UML stereotypes (see [32]) can be used for their customization. A trivial extension of the UML activity diagrams enables
the information about control flow probabilities to be appended
to the diagram.
However, at the design phase, it is rather difficult to define
probabilistic control flow properly, because it strongly depends
on the future operation profile. Therefore, the introduced EPM
can only be applicable in the early phases if the operational profile
is known, and the specifications of the elements of the future system are available for definition of the parameters of the elements.
Also, the other interaction UML diagrams, like state and sequence
diagrams can be used. Besides the UML, other types of state
machine models, like Matlab SIMULINK state charts, are acceptable
for obtaining system level parameters.
Fig. 3. A control flow graph, a data flow graph, and a list of the probabilistic parameters of elements of a dual-graph error propagation model.
Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http://
dx.doi.org/10.1016/j.mechatronics.2014.09.005
6
A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx
Fig. 4. Element level properties: (a) data inputs and outputs, (b) fault activation, (c) error propagation, (d) error detection.
In later phases, the necessary parameters can be more accurately obtained using the available hardware specifications and
software source code. However, application of the error propagation model for an already developed system requires the methods
for system decomposition. Practically every research devoted to
architecture-level system analysis scratches the problem of
decomposition of the system into a set of executable elements.
As previously mentioned, control flow exists mostly because of
the software part of the system. Therefore, it is advisable to use a
software part as a basis for the decomposition, extending the
actions of the software elements with the hardware and physical
activities, as shown in the case study (see Section 6).
In the software domain, different authors propose a consideration of single instructions, basic blocks of code, functions, modules, classes, components, and services as the atomic elements of
the system. The level of decomposition clearly depends on the
trade-off between the number of elements, their complexity (size),
and the available information about each element. This problem
has been discussed in [33]. Too many small elements lead to a large
state space of the mathematical model and difficulties in evaluation. On the other hand, too few elements may cause the problem
of distinction of how different elements contribute to the system
properties. Several software-based approaches to system decomposition are presented in [34–36].
After the decomposition, the system level and element level
parameters should be defined. Existing methods for the definition
of the structures of control and data flow graphs are shown in
[37–43]. The following techniques are suggested for obtaining
the element level parameters. Fault activation probabilities in
software and hardware parts of the system can be estimated using
the numerous reliability models already discussed in Section 2.
The studies of computer hardware faults that also result in data
errors are discussed in [44–46]. These articles describe the
influence of environmental factors like increasing heat, lowering
voltage, radiation, etc. on the occurrence of so-called bit-flips in
computer hardware (memory, hard-drive, CPU) that result in
errors in computation and stored variables. Error propagation
probabilities for software parts can be evaluated using the
methods proposed in [24] in the section ‘‘Estimating Error
Permeability: An Experimental Approach’’. Error propagation
through the hardware parts can be evaluated using an analysis
of their specifications and statistical tests. For the physical parts,
the creation of corresponding software-based models that
describe their behavior is recommended. These models can be
used for obtaining the error propagation probabilities in the
software level. Error detection probabilities strongly depend on
the error detection mechanisms that will be used. The best
solution is the statistical evaluation and individual structural
analysis for each particular case.
5. Error propagation analysis
This section introduces a general and comprehensive approach
to error propagation analysis using of the discussed dual-graph
error propagation model. The central idea of this approach is the
application of a discrete time Markov chain (DTMC) model.
Informally, a DTMC model can be represented by a so-called state
graph. The nodes of this graph describe a state space of the Markov
chain. The directed arcs are weighted by the probabilities of
transition between the states of the DTMC. The state graph of the
DTMC model that is used for the error propagation analysis we call
an error propagation graph (EPG). The nodes of this graph (or states
of the DTMC model) represent various states of the system in terms
of fault activation, error propagation, and error detection. The arcs
show probabilistic transitions between these states. The EPG can
be automatically generated using the data contained in the EPM.
This section describes the structure of the EPG and the application
of the mathematical apparatus of Markov chains for the error
propagation analysis.
5.1. Error propagation graph
An error propagation graph (EPG) is a state graph of the discrete
time Markov chain. It is a weighted, directed graph:
h
i
GEP :¼ S; AEP :
S :¼ s1 ; . . . ; sNS ;
n
o
AEP :¼ ðsi ; sj ; psi ;sj Þ; . . . ; ðsk ; sl ; psk ;sl Þ
S is a set of the nodes of the EPG. Each node si 2 S represents a system state. AEP is a set of the arcs of the EPG. The weights, assigned to
these arcs, define the probabilities of state changes.
Fig. 3 shows a dual-graph error propagation model of a reference system. This model allows for the construction of an EPG of
this system. Fig. 5 demonstrates a part of this EPG. The algorithm
for the EPG generation will be described later on this section.
A node of an EPG, or a state of a corresponding DTMC model,
describes the system state in between the executions of two elements. Exceptions are initial and final states, which describe the
state of the system before the execution of an initial element and
after the executions of final elements respectively. Each node
si 2 S is characterized with four parameters:
si :¼
D
FA
EP
ED
enext
si ; Esi ; Esi ; Esi
E
The parameter enext 2 ðE [ f‘none’; ‘FS’gÞ defines an element from
the set E that will be executed next. This parameter describes the
control flow aspect of the system. An expression enext
¼ ek shows
si
Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http://
dx.doi.org/10.1016/j.mechatronics.2014.09.005
A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx
7
Fig. 5. A part of an EPG generated using the reference EPM.
that a system state si refers to the moment in time when the execution of a particular element has already been completed, and the
system is about to run the element ek . In addition, enext can take
two specific values: ‘none’ or ‘FS’. Both values show that the system
has finished its execution. The states that enext parameters equal to
‘none’ or ‘FS’, are defined as final states of the system. The value
‘none’ describes a regular completion of system execution. The value
‘FS’ shows an unscheduled system stop (fail-stop) because of error
detection in the last element executed.
The parameter EFA E defines a subset of system elements
where faults have been already activated. It means that the elements can only be appended to EFA during the system operation.
Also note, that EFA is a set, but not a sequence. Therefore, it does
not show the order of fault activations or the number of them, in
case of several fault activations in the same element.
The parameter EEP E defines a subset of system elements
where errors have been propagated at this moment. It describes
a current faulty (or fault-free) condition of the system. In other
words, EEP is a subset of the element, where outputs are currently
erroneous. Unlike the subset EFA , the elements can either be
appended or removed from EFA during system execution.
The parameter EED E defines a subset of system elements
where errors have been detected. It shows an error detection
history like EFA shows fault activation history.
As an illustration, Fig. 6 depicts a correspondence between a
state of the system (shown in Fig. 3) and a node SD of the EPG
(shown in Fig. 5). The left side of Fig. 6 describes the current state
of the system using additional signs and color markers on the control and data flow graphs. The CFG shows that elements e1 and e2
have already been executed, and an element e3 will be executed
next (highlighted in blue). This fact is represented by the parameter enext ¼ e3 at the node of the EPG. The lightning bolt sign on the
data flow graph means that there was a fault activation during the
execution of the element e1 . EFA ¼ fe1 g at the EPG node
demonstrates that fact. Because of this fault activation, the element
e1 has an erroneous output, and this error further propagates to the
element e2 (highlighted in red). It is shown by the parameter
EEP ¼ fe1 ; e2 g of the node sD . Finally, the loupe sign near the element e2 on the DFG denotes that the error has been detected in
the element e2 . Correspondingly, EED ¼ fe2 g at the EPG node.
Fig. 5 demonstrates a part of the EPG and describes the key
principles of representation of the error propagation process. A
state sA is defined as the initial state of the EPG. It shows that the
first element to be executed will be e1 . It also shows that there
are no faults or errors in the system at this moment. The CFG of
the reference EPM (see Fig. 3) defines that the element e2 will be
the next to be executed, and there are no other options. Therefore,
the value of the enext parameter of both descendants of the node sA
Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http://
dx.doi.org/10.1016/j.mechatronics.2014.09.005
8
A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx
Fig. 6. A correspondence between a system state (left) and a node of an EPG (right).
of the EPG (sB and sC ) is the element e2 . There are, however, two
possible execution scenarios for e1 : a faulty one and a fault-free
one. Hence, the system changes its state from sA to sB with the fault
with the probability
activation probability pFA
e1 , and to sC
FA
pFA
e1 ¼ ð1 pe1 Þ.
It also possible to return to the state sA from sC after the faultfree execution of e2 because of the existence of the CFG arc
ðe2 ; e1 Þ. It shows that EPG is neither a tree nor a cycle-free directed
graph. However, it is not possible to return to sA from sB or any
FA
other descendant of sB because EFA
sB : # EsA . A similar situation hapED
pens because of parameter E . EPG will inevitably have the groups
of nodes that are connected one-way.
The outgoing arcs from the state sB , like ðsB ; sD Þ; ðsB ; sE Þ, and
ðsB ; sF Þ that have been already discussed, show the different execution scenarios that are possible for element e2 . Namely, ðsB ; sD Þ
defines fault-free execution with error propagation, error detection, and the control flow transition to element e3 . An EPG arc
ðsB ; sE Þ shows fault-free execution with error masking, without
error detection, and the control flow transition to element e1 . An
EPG arc ðsB ; sF Þ also shows fault-free execution with error masking,
without error detection, but with the control flow transition to element e4 . Other arcs define other possible combinations.
A state change sequence ½sA ! sB ! sE ! sJ shows a particular
case of system execution: An error arises because of fault activa-
fault-free completion of system execution. The node sH represents
a fail-stop because of error detection in the element e3 . It happens
because of ‘‘fail-stop’’ error detection behavior that is defined for e3
in the original EPM. The node sI shows that an error has propagated
to e3 from e1 as in the case of sH , but has not been detected in e3 .
The node sK shows that an error has propagated to e4 , has been
detected in e4 , and has been corrected because of the ‘‘error
correction’’ value of the error detection behavior parameter of e4 .
The states of the Markov chain, represented by the final nodes
of the EPG, are absorbing states. A state of a Markov chain is called
absorbing if it is impossible to leave it. A Markov chain is absorbing
if it has at least one absorbing state, and if from every state it is
possible to go to an absorbing state (not necessarily in one step).
In an absorbing Markov chain, a state which is not absorbing is
called transient [48]. The mathematical apparatus of the absorbing
DTMC enables the absorption probabilities for all final states to be
estimated.
Let us consider an arbitrary absorbing DTMC. Renumber the
states so that the transient states come first. If there are r absorbing
states and t transient states, the transition matrix will have the following canonical form:
tion in e1 (EEP
sB ¼ fe1 g) and vanishes because of a second fault-free
execution of e1 (EEP
sJ ¼ £).
An algorithm for EPG generation is described in details in [47].
This algorithm transforms a CFG with defined control flow probabilities, a DFG, and the probabilistic properties of elements of a
given EPM to an error propagation graph. The algorithm works
iteratively. At the beginning the algorithm creates an initial node
of the EPG using an initial element of the CFG as a value of the
parameter enext and empty sets for EFA ; EEP , and EED . This node is
considered to be the current node in the first iteration. With each
succeeding iteration, the algorithm identifies all possible states
where the system can move after the execution of the enext , and
computes the corresponding transition probabilities. These states
are appended as new nodes to the EPG, as well as the arcs from
the current node to the new nodes. After that, the algorithm selects
the first unvisited node of the EPG as a current node and starts a
new iteration. This process repeats until all nodes of the EPG have
been visited.
5.2. System execution scenarios
According to the EPG, system operation starts with an initial
state and is completed in one of the final states. These states are
in focus because they correspond to different system execution
scenarios. For example, the node sL (see Fig. 5) represents a
Here I is an r-by-r identity matrix, 0 is an r-by-t zero matrix, R is a
nonzero t-by-r matrix that defines the transitions from transient to
absorbing states, and Q is a t-by-t matrix that represents the transitions between transient states of the DTMC. The first t states are
transient, and the last r states are absorbing.
The absorption probabilities can be computed as the following
matrix product:
B¼NR
where B is a t-by-r matrix of absorption probabilities, and N is
so-called fundamental matrix of the DTMC:
N ¼ I þ Q þ Q 2 þ ¼ ðI QÞ1
The element bi;j of matrix B represents the probability of absorption
in absorbing state j starting from transient state i.
The knowledge about the probability distribution between the
final elements allows system behavior to be predicted. This is a
general and comprehensive way of analyzing fault activation, error
propagation, and error detection processes. Let us consider the
next three typical problems of error propagation analysis that
can be solved using the introduced method.
Problem 1. The majority of the existing error propagation models
are developed for one specific goal – to estimate the probability of
error propagation from one system element to another. This goal
Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http://
dx.doi.org/10.1016/j.mechatronics.2014.09.005
9
A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx
can be achieved using the introduced approach. For instance, we
are only interested in the probability of error propagation from an
element ei to an element ej . The next four steps have to be
performed in order to obtain this probability.
1. Define fault activation in ei as an initial state of the system:
s0 :¼
D
E
enext : ei ; EFA : fei g; EEP : fg; EFA : fg
2. Apply the EPG generation algorithm, starting with the
state s0 to generate S and AEP .
3. Turn the EPG states that contain ej in EEP into absorbing
states by removing their outgoing arcs:
AEP :¼ AEP n ðsa ; sb ; pÞ; 8sb 2 S;
and sa 2 S : ej 2 EEP
sa
4. Compute and sum up the probabilities of absorption for
the absorbing states that contain ej 2 EEP .
The value obtained is a sum of the probabilities of the execution
scenarios that lead to error propagation from the element ei to
the element ej . This approach allows an error propagation matrix
that contains the probabilities of error propagation between all pairs
of system elements to be constructed. An analysis of this matrix
helps to identify exactly how vulnerable the system is to fault
activation and which parts of the system should be protected in
the first place.
Problem 2. Another example of the approach application is the
estimation of the probability of error propagation to a defined set
of critical elements, EC E. Here is an algorithm that explains the
solution of this task.
1. Define the default initial state of the system:
s0 :¼
D
E
enext : e0 ; EFA : fei g; EEP : fg; EFA : fg
where e0 is the initial element of the CFG.
2. Apply the EPG generation algorithm to generate S and AEP .
3. Turn the EPG states that contain at least one critical
element ec 2 EC in EEP into absorbing states by removing
their outgoing arcs:
AEP :¼ AEP n ðsa ; sb ; pÞ; 8sb 2 S;
2
EEP
sa
and sa 2 S : 9ec
C
and ec 2 E
4. Compute and sum up the probabilities of absorption for
the absorbing states defined in the previous step.
where e0 is the initial element of the CFG.
2. Apply the EPG generation algorithm to generate S and AEP .
3. Turn the EPG states that contain ed in EED into absorbing
states by removing their outgoing arcs:
AEP :¼ AEP n ðsa ; sb ; pÞ; 8sb 2 S;
and sa 2 S : ed 2 EED
sa
4. Compute and sum up the probabilities of absorption for
the absorbing states defined in the previous step.
These probabilities of error detection help to identify test cases
that will stimulate fault activations in critical elements will allow
the majority of the occurred errors to be detected.
The introduced approach can be applied for a variety of tasks in
system analysis. It can be used in order to improve reliability estimation. The majority of existing reliability models do not take
error propagation into account at all. The introduced approach
can extend the available reliability models or can even be applied
separately. It also enables the probability of error propagation to
critical elements to be estimated, supporting existing safety
models. The introduced approach can be used to select a system
protection strategy by identifying the most suitable places for error
detection mechanisms. It can also be applied for fault localization
in mechatronic systems, observing a number of system outputs
that help to speed up the diagnostics and debugging of the system.
The application concept of the dual-graph error propagation model
for fault localization was demonstrated in [3].
6. Case study
This section contains a brief overview of a case study that aims
for evaluation of the introduced approach to error propagation
analysis. The detailed description you can find in [5,47]. The sample system in question is a caterpillar mobile robot, shown in Fig. 7.
As in a typical mechatronic system it consists of mutually coupled
hardware, software, and physical components. The task of this system is to move a robot from its current location to a target point.
An abstraction of the caterpillar mobile robot system as a control
system block diagram is shown in Fig. 8.
A controlled variable is a pose (position and orientation) of the
robot. During each control-cycle, the control software determines
the current location and orientation of the robot using the
navigation camera that is located above the scene. After that it
computes the speeds for the caterpillars and the length of a time
interval for the servo motors activation. Next, the control software
The result of this algorithm is the probability of error propagation to at least one of the critical elements. For instance, it can help
for reliability analysis of a system that has several important data
outputs. Also, this approach can be applied to estimate the safety of
a mechatronic system, defining the critical elements that can harm
an operator or environment because of an erroneous input. It
allows the probabilities of error propagation to this set of critical
elements to be computed, given the fault activation in a particular
system element. This algorithm also enables a sensitive analysis
and identifies the elements that have the most effect on the critical
ones.
Problem 3. One other example of the approach application concerns the system testing process. The next algorithm can estimate
the probability of error detection in a particular element ed that is
equipped with an error detection mechanism.
1. Define the default initial state of the system:
s0 :¼
D
E
enext : e0 ; EFA : fei g; EEP : fg; EFA : fg
Fig. 7. The ‘‘caterpillar mobile robot’’ reference system.
Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http://
dx.doi.org/10.1016/j.mechatronics.2014.09.005
10
A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx
Fig. 8. A block diagram of a control loop of the ‘‘caterpillar mobile robot’’ reference
system.
sends corresponding commands to the servo motors that correct
the robot motion to go towards the target point. This procedure
is repeated until the robot reaches the anticipated target point.
Fig. 9 shows the structure of the case study. It starts with the
decomposition of the original robot control system into several executable elements. An activity UML diagram [30] of the mechatronic
system under consideration is shown in Fig. 10. This diagram was
used as a base-line system model within a formal design process
because of explicit representation of system elements, control
and data flows. The UML diagrams are usually applied to describe
the only software part of the systems. However, the activity UML
diagrams can model very abstract entities. In this particular case,
each activity represents not just the action of the robot control
software, but also the hardware and even the physical motion of
the robot. The described activities are considered to be actions of
the executable elements of the system. Therefore, there is oneto-one correspondence between these activities and the elements
of the dual-graph error propagation model of the ‘‘caterpillar
mobile robot’’ reference system.
A software implemented fault injection (SWIFI) was used in
order to obtain a faulty version of the reference system. During
the development, debugging, and testing of the original system,
the information about introduced software and hardware bugs
and design faults has been tracked and saved into a special faultlist. Only the faults from this list have been injected in order to
ensure the realistic behavior of the faulty version. However, not
every fault is suitable for this case study. We were focused on
the temporary faults [1] with low fault activation probabilities that
result in erroneous data outputs. A temporary fault, in contrast to a
permanent fault, is activated only under specific execution conditions like specific values of input parameters, environmental
impact, or a control flow decision. The faults have been injected
into all elements except ‘‘GuiOut’’ because it has no data output.
Fig. 10. The UML activity diagram of the ‘‘caterpillar mobile robot’’ reference
system.
Software bugs have been injected into ‘‘Init’’, ‘‘CalcAngle’’, ‘‘CorrStep’’, and ‘‘RegStep’’. For example, many geometric computations
are performed in order to determine the angle between the current
and the desired robot orientation. During one of these computations in the faulty version of this element, radians are used instead
of grads. The faulty part of the code is only activated under specific
circumstances that depend on the current robot orientation and
the target position. Activation of this fault results in an incorrect
value of the computed angle. In the elements ‘‘RcgPic’’ and
‘‘NxtGo’’, hardware faults have been emulated by the software
means. For instance, a physical defect in the robot construction is
simulated in the fault version of the element NxtGo: One wheel
Fig. 9. A structure of the case study.
Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http://
dx.doi.org/10.1016/j.mechatronics.2014.09.005
A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx
of the mobile robot scratches the frame and goes slowly. It results
in an unintended robot movement as soon as its turns left. A complete list of injected faults is presented in Appendix C of [47].
After the fault injection, a separate statistical element level analysis for each individual element was performed. Fault activation
probabilities were estimated by executing the original and faulty
versions of the elements in parallel ( 500 runs of the entire
system). Error propagation probabilities were estimated using
the erroneous and error-free inputs for the faulty elements. A software model of physical robot movement has been developed for
the element ‘‘NxtGo’’ and used parallel to the real-world robot.
The element ‘‘GuiOut’’ has a non-zero error detection probability.
This probability has been determined using the defined difference
between the erroneous robot pose and the expected one that is
detectable by the naked eye. The probabilities of fault activation
(FAP), error propagation (EPP), and detection (EDP) are shown on
the DFG in Fig. 11.
This system was developed according to the described UML representation. Hence, the system level analysis was pretty straightforward. Structures of the control and data flow graphs were easily
derived (see Fig. 11). Also, a number of statistical experiments (
500 runs) have been done in order to obtain the probabilities of
control flow transitions. These probabilities are shown on the
CFG in Fig. 11.
The described system level and element level analyses of the
reference robot control system allowed the construction of an error
propagation model that is shown in Fig. 11. The generation of a
Markov chain was the next step. The generated Markov chain contains 30,431 arcs and 11,887 states: 1 initial state, 9944 transient
states, and 1942 absorbing states. Meaning that there are 1942
possible system execution scenarios and the probabilities of
absorption in these final states define the probabilities of the
corresponding scenarios. All the computations have been made
on the regular laptop with a 2.4 GHz Intel Core 2 Duo processor
and 2 GB RAM. The generation time was about 1 min.
11
Several computation approaches, discussed in details in [47],
have been applied in order to obtain these estimated results. The
direct algebraic method, discussed in the previous section, has
failed because of simple memory overflow. Next, we tried an iterative, approximation algorithm. The input parameters of this algorithm are: transition probability matrix P of the DTMC, a set of
absorbing states, and a predefined accuracy of the estimation.
The algorithm contains a loop for a step-wise computation of the
DTMC. On each step a vector that shows a current distribution of
probabilities among the DTMC states is computed. Because of the
absorbing character of the DTMC, these probabilities eventually
consolidate in the absorbing states. Therefore, the values of the elements of the vector that represent the absorbing states increase,
and the values that represent transient states decrease. The computation continues until the desired accuracy is reached. The
sum of values of the transient elements of the distribution vector
is considered to be a current accuracy. This method gave a satisfactory result. It has been applied with different values of the accuracy
parameter. Even with such high accuracy as 1014 , the computation time was less than 30 s. The exact results were also obtained,
using a state space reduction technique, described in [49], and the
further application of the direct computation method. The state
space reduction from 11,887 states down to 7000 states took about
20 min. The computation time of the reduced DTMC using the
direct method was around 15 min.
Generally, a DTMC computation time t c can be represented as a
polynomial function of a number of states N S that, in turn, depend
on a number of elements of an error propagation model N E . It is
possible to evaluate an upper bound of N S :
NS 6 ðNE þ 2Þ 23NE
It shows that N S can grow exponentially with the growth of N E .
According to this, the described approach is only applicable for
systems with a small number of elements. In fact, the number of
Fig. 11. Control flow graph and data flow graph of the reference system with fault activation, error propagation, and error detection probabilities.
Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http://
dx.doi.org/10.1016/j.mechatronics.2014.09.005
12
A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx
states is much less than the defined upper bound. The parameter NS
strongly depends on the properties of the error propagation model:
structures of the control flow and data flow graphs and the properties of its elements. For instance, the lack of a data flow arc between
two elements results in a lack of several states of the DTMC because
error propagation between these elements is impossible. The same
holds for the control flow structure. If an element ej cannot be executed after an element ei , then error propagation from ei to ej is
impossible as well. In real-world systems, the control flow and data
flow graphs are rather sparse, and that entails to reduction of N S .
The element level properties of the EPM also affect NS . Assume that
a fault activation probability of some element equals to zero. In this
case, the DTMC has no states that contain the corresponding element in EFA . A similar situation occurs if an error detection probability equals to zero (or to one). As a general rule, the more element
execution scenarios possible, the more states in the corresponding
Markov chain.
In spite of the fact that the real value N S is lower than the upper
bound estimation, it will still grow exponentially. For instance,
even if only faults activation is considered without further error
propagation and error detections, the DTMC will contain at least
2NE states. Chapter 7 of [47] describes several methods that can
help in coping with this problem. One of them is a low probability
limitation technique. This technique has been evaluated using the
constructed EPM. The idea of the low probability limitation is to
stop the process of DTMC generation as long as an appropriate
level of accuracy has been reached. Here, the accuracy depends
on the number of already generated final states and their probabilities of absorption. The accuracy is computed as 1 P known , where
Pknown is the sum of the probabilities of absorption of the already
generated final states. Examination of the DTMC generation process has shown that there is not exactly a dependency between
the number of final states and the total number of generated states.
However, in our case 1734 of 1942 final states have already been
created in the middle of the generation process. So it seems reasonable to stop the process after the first 30 s and analyze the generated part of the DTMC. It is obvious that the absorbing probabilities
of the final states are not equal. However we observed that the
final states with high probabilities of absorption were generated
early. We already had an accuracy around 0.15 with the part of
the DTMC that consists of only 2000 states. This experiment proves
that application of the low probability limitation technique is reasonable for a computation of complex error propagation models.
After that a number of experiments with the real system have
been carried out. The faulty and original versions of the system
are run parallel for statistical evaluation of system behavior. Both
versions of the system work according to the same control flow.
Control flow decisions for the original system have been provided
by the faulty one. In contrast to the control flow, two separate data
flows are kept: error-free data flow of the original system and data
flow of the faulty system that can contain errors. The physical
movement of the robot is controlled by the faulty system. The original system uses a software-implemented model of robot kinematics. The fault activations and propagation of errors are
detected in a similar way as during the element level analysis. During the system operation, an original version and a faulty version of
each element are executed using the erroneous data flow in order
to detect fault activation. Likewise, a faulty version of an element is
executed using the error-free and the erroneous data flows in order
to detect error propagation through the element.
The described system has been run 446 times. During these
runs, the system has demonstrated 52 different execution scenarios. The first three results according to the statistical estimation of
their probabilities are shown in Fig. 12.
Twelve of the most frequent execution scenarios have been
selected for evaluation. Statistical estimation of the other scenarios
is considered unreliable because of the small number of their
occurrence. The computational error of this estimation is less than
Fig. 12. Three of the most frequent execution scenarios of the reference system.
Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http://
dx.doi.org/10.1016/j.mechatronics.2014.09.005
13
A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx
Table 2
Comparison of the experimental results and the prediction of the EPM.
N
FA
EP
ED
Exp.
Mod.
Dif.
1
2
3
4
5
6
7
8
9
10
11
12
None
CorrStep
CalcAngle
CalcAngle
RegStep
RcgPic
NxtGo
CorrStep
CalcAngle, CorrStep
CalcAngle, RegStep
NxtGo
Init
n.a.
RcgPic, CorrStep, NxtGo
RcgPic, CalcAngle, CorrStep, NxtGo
CalcAngle
RcgPic, RegStep, NxtGo
RcgPic
RcgPic, NxtGo
RcgPic, CorrStep, NxtGo
RcgPic, CalcAngle, CorrStep, NxtGo
RcgPic, CalcAngle, RegStep, NxtGo
RcgPic, NxtGo
Init, RcgPic
n.a.
FS
FS
n.a.
FS
FS
n.a.
n.a.
FS
FS
FS
FS
0.3296
0.1861
0.1099
0.0381
0.0381
0.0314
0.0247
0.0247
0.0247
0.0179
0.0179
0.0135
0.3127
0.1545
0.1067
0.0288
0.0453
0.0492
0.0502
0.0226
0.0254
0.0060
0.0502
0.0247
0.0167
0.0316
0.0032
0.0093
0.0072
0.0179
0.0255
0.0021
0.0007
0.0119
0.0322
0.0113
FA – Fault activation, EP – Error propagation, ED – Error detection.
Exp. – The probabilities obtained by the statistical experiments.
Mod. – The probabilities predicted by the error propagation model.
Dif. – Difference.
1% for the standard 95% confidence level. Table 2 shows twelve
execution scenarios of the selected system.
The model prediction and the experimental results are quite
similar. The average difference is about 0.0142, and the maximum
difference is 0.0322. However, only three of first execution scenarios have a probability greater than 0.1. The probabilities of the
other scenarios are lower, and the relative difference between
the experimental and model result becomes more and more significant. Nevertheless, these results prove that the model can accurately predict the most probable scenarios. It also gives the
preliminary estimation of the probabilities of the other execution
scenarios. At least it can distinguish between realistic and practically impossible execution scenarios. For example, the sorted list
of predicted execution scenarios has been crosschecked with the
experiments. The first predicted scenario that was not met among
the experiments has the predicted probability of only 0.0074. It
means that the likelihood that this execution scenario will be
among our 446 experiments is only 3.5%. All other scenarios with
probabilities greater that 0.0074 have been met at least once.
7. Conclusion
Error propagation analysis is an important part of a variety of
system analysis tasks. System design, reliability and safety analysis, testing, diagnostics, and many other activities that are aimed
at development of a dependable system require a deep understanding of its behavior in an erroneous condition. The specifics
of the mechatronic domain necessitate the development of an
appropriate mathematical model. This model must operate with
abstract entities to represent various properties of the heterogeneous components of mechatronic systems and provide the methods for probabilistic error propagation analysis. A suitable model
was presented in this article. The following primary research results
have been achieved:
The mechatronic-oriented error propagation model. The overview
of the existing approaches to error propagation analysis (see Section 2) has revealed that there are several error propagation models that can be theoretically adapted to the analysis of mechatronic
systems. However, there is no complete mathematical framework
that enables an analysis of the heterogenous mechatronic components. Development of the error propagation model oriented to the
mechatronic domain has been the first achievement.
The new concept of error propagation analysis. The most appropriate error propagation model has been selected among the available models to be the starting point of this research. Despite the
strong probabilistic background, this model has a significant
disadvantage: It implies that data errors always propagate through
control flow. This assumption makes it inapplicable for the systems
in which components can be triggered in arbitrary order with nonsequential data flow. This article shows that the system control and
data flows must be separately considered for an accurate description of an error propagation process. This has motivated the development of a new approach to error propagation analysis based on
synchronous examination of two directed graphs: a control flow
graph and a data flow graph (see Section 3). No similar approaches
have been found among the existing error propagation models or
in domains related to system dependability. The new dual-graph
approach to error propagation analysis has been the second
achievement.
Extensive mathematical framework. The new concept enables
system behavior to be modeled in a more flexible and accurate
manner than the models described in Section 2. Moreover, unlike
the existing approaches that are only aimed at computation of
error propagation probabilities between the system elements, the
presented model considers the entire chain of events, starting with
fault activation and ending with error detection or system failure.
This enables a system execution process to be modeled using the
probabilistic control flow graph, activation of multiple faults during the system execution, propagation of the occurred errors
through the data flow structure, and detection of these errors by
protection mechanisms, taking into account different types of further system behavior, e.g. fails-stop or error correction. This extensive definition of the error propagation processes makes the model
applicable for different system analysis tasks. However, during
model development a significant amount of effort was made in
order to preserve a balance between the breadth of application
and model complexity. The development of the new mathematical
framework of the dual-graph error propagation model has been the
third achievement.
Comprehensive approach to error propagation analysis. The comprehensive approach to error propagation analysis has been
described in Section 5. The main idea of this approach is an application of a discrete time Markov chain in order to model system
execution in terms of faults activation, errors propagation and
errors detection. This Markov chain can be generated automatically using the data contained in the error propagation model
and makes it possible to obtain the probabilities of different erroneous and error-free scenarios of system operation. The introduced
approach supports customization and can be extended or
specialized depending on the defined system analysis task. The
development of this approach to error propagation analysis has been
the fourth achievement.
Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http://
dx.doi.org/10.1016/j.mechatronics.2014.09.005
14
A. Morozov, K. Janschek / Mechatronics xxx (2014) xxx–xxx
Despite the fact that this research is more oriented to theoretical aspects, it has been verified using a typical mechatronic system
(see Section 6). The probabilities of possible execution scenarios,
predicted by the introduced error propagation model, have been
compared to the experimental evaluation. The obtained numerical
results lead the conclusion to be drawn that the presented
approach is accurate and applicable for the error propagation analysis of mechatronic systems.
Acknowledgements
This work has been supported by the Erasmus Mundus External
Co-operation Window Programme of the European Union.
References
[1] Laprie JC, Avizienis A, Kopetz H. Dependability: basic concepts and
terminology. Secaucus (NJ, USA): Springer-Verlag; 1992.
[2] Janschek K. Mechatronic systems design: methods, models, concepts. Springer;
2011. <http://books.google.com/books?id=L0MhTwEACAAJ>.
[3] Morozov A, Janschek K. Fast abstract: dual graph model for software errors
localization. In: 21st IEEE international symposium on software reliability
engineering; 2010.
[4] Morozov A, Janschek K. Dual graph error propagation model for mechatronic
system analysis. In: Proceedings of the 18th IFAC world congress, August 28–
September 2, 2011, Milano, Italy; 2011. p. 9893–8.
[5] Morozov A, Janschek K. Case study results for probabilistic error propagation
analysis of a mechatronic system. In: Tagungsband Fachtagung Mechatronik
2013, Aachen, 06.03.–08.03.2013; 2013. p. 229–34.
[6] Ge X, Paige RF, McDermid JA. Probabilistic failure propagation and
transformation analysis. In: SAFECOMP’09; 2009. p. 215–28.
[7] Fenelon P, Mcdermid JA, Dd Y. An integrated toolset for software safety
analysis. J Syst Softw 1993;21:279–90.
[8] Papadopoulos Y, Mcdermid J, Sasse R, Heiner G. Analysis and synthesis of the
behaviour of complex programmable electronic systems in conditions of
failure. Reliab Eng Syst Safety 2001;71:229–47.
[9] Kaiser B, Liggesmeyer P, Jäckel O. A new component concept for fault trees. In:
Proceedings of the 8th Australian workshop on safety critical systems and
software (SCS’03), Adelaide; 2003. p. 37–46.
[10] Wallace M. Modular architectural representation and analysis of fault
propagation and transformation. In: Proc. FESCA 2005. ENTCS, vol.
141(3). Elsevier; 2005. p. 53–71.
[11] Voas JM. PIE: a dynamic failure-based technique. IEEE Trans Softw Eng
1992;18:717–27.
[12] Voas JM, Morell LJ. Propagation and infection analysis (PIA) applied to
debugging. In: Proceedings of Southeastcon’90; 1990. p. 379–83.
[13] Voas JM. Error propagation analysis for cots systems. IEEE Comput Control Eng
J 1997;8:269–72.
[14] Michael CC, Jones RC. On the uniformity of error propagation in software. In:
Proceedings of the 12th annual conference on computer assurance (COMPASS
’97); 1996. p. 68–76.
[15] Candea G, Delgado M, Chen M, Fox A. Automatic failure-path inference: a
generic introspection technique for internet applications. In: Proceedings of
the third IEEE workshop on internet applications, WIAPP ’03. Washington (DC,
USA): IEEE Computer Society; 2003. p. 132–41. <http://portal.acm.org/
citation.cfm?id=832311.837386>.
[16] Khoshgoftaar TM, Allen EB, Tang WH, Michael CC, Voas JM. Identifying
modules which do not propagate errors. In: Proceedings of the 1999 IEEE
symposium on application – specific systems and software engineering and
technology, ASSET ’99. Washington (DC, USA): IEEE Computer Society; 1999. p.
185. <http://portal.acm.org/citation.cfm?id=786771.787114>.
[17] Sanyal S, Shah V, Bhattacharya S. Framework of a software reliability
engineering tool. In: IEEE international symposium high-assurance systems
engineering; 1997.
[18] Zhang F, Xingshe Z, Yunwei D, Junwen C. Consider of fault propagation in
architecture-based software reliability analysis. In: IEEE/ACS international
conference computer systems and applications, 2009. AICCSA; 2009. p. 783–6.
[19] Ammar HH, Nassar D, Abdelmoez W, Shereshevsky M. A framework for
experimental error propagation analysis of software architecture specifications.
In: ISSRE’02; 2002.
[20] Abdelmoez W, Nassar DM, Shereshevsky M, Gradetsky N, Gunnalan R, Ammar
HH, et al. Error propagation in software architectures. In: Proceedings of the
software metrics, 10th international symposium. Washington (DC, USA): IEEE
Computer Society; 2004. p. 384–93. <http://portal.acm.org/citation.cfm?id=
1018439.1021921. doi:10.1109/METRICS.2004.20>.
[21] Nassar D, Rabie W, Shereshevsky M, Gradetsky N, Ammar HH, Bogazzi S, et al.
Estimating error propagation probabilities in software architectures; 2002.
[22] Popic P, Desovski D, Abdelmoez W, Cukic B. Error propagation in the reliability
analysis of component based systems. In: Proceedings of the 16th IEEE
international symposium on software reliability engineering. Washington (DC,
USA): IEEE Computer Society; 2005. p. 53–62. http://dx.doi.org/10.1109/
ISSRE.2005.18. <http://portal.acm.org/citation.cfm?id=1104997.1105235>.
[23] A. Jhumka, M. Hiller, N. Suri, Assessing inter-modular error propagation in
distributed software. In: Symp. on reliable distributed systems distributed
software; 2001. p. 152–61.
[24] Hiller M, Jhumka A, Suri N. An approach for analysing the propagation of data
errors in software. In: Proceedings of the 2001 international conference on
dependable systems and networks (formerly: FTCS), DSN ’01. Washington (DC,
USA): IEEE Computer Society; 2001. p. 161–72. <http://portal.acm.org/
citation.cfm?id=647882.738068>.
[25] Hiller M, Jhumkas A, Suri N. Tracking the propagation of data errors in
software. In: Dependable computing systems: paradigms. Performance issues
and applications. Wiley; 2005.
[26] Hiller M, Jhumka A, Suri N. Propane: an environment for examining the
propagation of errors in software. SIGSOFT Softw Eng Notes 2002;27:
81–5.
[27] Mohamed A, Zulkernine M. On failure propagation in component-based
software systems. In: Proceedings of the 2008 the eighth international
conference on quality software. Washington (DC, USA): IEEE Computer
Society; 2008. p. 402–11. http://dx.doi.org/10.1109/QSIC.2008.46. <http://
portal.acm.org/citation.cfm?id=1441433.1442755>.
[28] Cortellessa V, Grassi V. Role and impact of error propagation in software
architecture reliability. TRCS 007/2006. Technical report. Dipartimento di
Informatica, Universita’ dell’Aquila; 2006. <http://www.di.univaq.it/cortelle/
docs/internalreport.pdf>.
[29] Cortellessa V, Grassi V. A modeling approach to analyze the impact of error
propagation on reliability of component-based systems. In: Proceedings of the
10th international conference on component-based software engineering,
CBSE’07. Berlin, Heidelberg: Springer; 2007. p. 140–56. <http://portal.acm.org/
citation.cfm?id=1770657.1770670>.
[30] OMG. Unified modeling language (UML). Core specification; 2010a.
[31] OMG. Systems modeling language (SysML). Core specification; 2010b.
[32] OMG. UML superstructure specification, v2.0; 2005.
[33] Goseva-Popstojanova K, Trivedi KS. Architecture-based approach to reliability
assessment of software systems. Perform Eval 2001;45:179–204.
[34] Constantinescu C. A decomposition method for reliability analysis of real-time
computing systems. In: 1994. Proceedings. Annual reliability and maintainability
symposium; 1994. p. 272–7. http://dx.doi.org/10.1109/RAMS.1994.291119.
[35] Chen Li Yan. Decomposition method for software reliability analysis. Comput
Eng Des 2007.
[36] Boudali H, Sozer H, Stoelinga M. Architectural availability analysis of software
decomposition for local recovery. Secure Syst Integr Reliab Improv 2009;
0:14–22.
[37] Aho AV, Sethi R, Ullman JD. Compilers: principles, techniques, and
tools. Boston (MA, USA): Addison-Wesley Longman Publishing Co., Inc.; 1986.
[38] Landi W, Ryder BG. A safe approximate algorithm for interprocedural pointer
aliasing. SIGPLAN Not 2004;39:473–89.
[39] van Eijndhoven JTJ, Stok L. A data flow graph exchange standard. In: 1992.
Proceedings., [3rd] European conference design automation, Brussels,
Belgium; 1992. p. 193–9. http://dx.doi.org/10.1109/EDAC.1992.205921.
[40] Harrold MJ, Offutt AJ, Tewary K. An approach to fault modeling and fault
seeding using the program dependence graph. J Syst Softw 1997;36: 273–95.
[41] Gokhale SS, Wong WE, Trivedi KS, Horgan JR. An analytical approach to
architecture-based software reliability prediction. In: Computer performance
and dependability symposium; 1998. p. 13.
[42] Yacoub SM, Cukic B, Ammar HH. Scenario-based reliability analysis of
component-based software. In: Proceedings of the 10th international
symposium on software reliability engineering, ISSRE ’99. Washington (DC,
USA): IEEE Computer Society; 1999. <http://portal.acm.org/citation.cfm?id=
851020.856175>.
[43] Namballa R, Ranganathan N, Ejnioui A. Control and data flow graph extraction
for high-level synthesis. In: VLSI, 2004. Proceedings. IEEE computer society
annual symposium; 2004. p. 187–92. http://dx.doi.org/10.1109/ISVLSI.2004.
1339528.
[44] Borkar S. Designing reliable systems from unreliable components: the
challenges of transistor variability and degradation. IEEE Micro 2005.
[45] Schroeder B, Pinheiro E, Weber W-D. Dram errors in the wild: a large-scale
field study. In: SIGMETRICS ’09: proceedings of the eleventh international joint
conference on Measurement and modeling of computer systems. USA: ACM;
2009.
[46] Nightingale EB, Douceur JR, Orgovan V. Cycles, cells and platters: an empirical
analysis of hardware failures on a million consumer PCs. In: Sixth conference
on Computer systems (EuroSys), 2011. p. 343–56.
[47] Morozov A. Dual-graph model for error propagation analysis of mechatronic
systems. Beiträge aus der Automatisierungstechnik, Vogt; 2012. <http://
books.google.de/books?id=YLxilAEACAAJ>.
[48] Grinstead CM, Snell JL. Chapter 11: markov chains. In: Introduction to
probability. American Math. Society; 1997.
[49] Górajski M. Reduction of absorbing Markov chain. Ann UMCS Math 2010;63:
91–107.
Please cite this article in press as: Morozov A, Janschek K. Probabilistic error propagation model for mechatronic systems. Mechatronics (2014), http://
dx.doi.org/10.1016/j.mechatronics.2014.09.005