Building Business Cases for Risk and Reliability Technologies

Vitali Volovoi

Building Business Cases for Risk and Reliability Technologies

Vitali Volovoi

visibility

…

description

8 pages

link

1 file

System reliability tools as a means for supporting business cases are considered. The resulting quantitative insights connect new technology developers with the business and management world. Recent advances in the ability to generate (sense), collect, store, and process large amounts of data have direct implications for the reliability engineering. The long term visions of the " Industrial Internet " and the " Internet of Things " are very attractive indeed. However, in the short term, there is a great need for incremental steps in demonstrating the value of the associated suite of technologies. To this end, improving reliability of complex systems and mitigating the negative impact of their failures are often listed among the short-to-medium term use cases. The paper focuses on the desired properties of system reliability models that can quantify the benefits for these use cases. Specifically, the models must be created relatively fast with the minimal resources and provide a high level of transparency, so that the models can be documented in a way that makes them easily auditable and understandable by the decision makers. To address these needs, the use of a new version of Stochastic Petri Nets (SPNs) called Abridged Petri Nets (APN) as a candidate framework is suggested. Several specific examples are considered. In particular, condition-based maintenance (CBM) concepts with both periodic and continuous monitoring are investigated and compared. The utility of various performance measures for CBM are discussed, including false positives rates and correlation between consecutive inspection success. In addition, a scenario of introducing new technologies (such as advanced fleet-based or Big Data analytics) into the maintenance and spares supply chain is presented.

Building Business Cases for Risk and Reliability Technologies V. Volovoi Independent Consultant Alpharetta, Georgia, USA ABSTRACT: System reliability tools as a means for supporting business cases are considered. The resulting quantitative insights connect new technology developers with the business and management world. Recent advances in the ability to generate (sense), collect, store, and process large amounts of data have direct implications for the reliability engineering. The long term visions of the “Industrial Internet” and the “Internet of Things” are very attractive indeed. However, in the short term, there is a great need for incremental steps in demonstrating the value of the associated suite of technologies. To this end, improving reliability of complex systems and mitigating the negative impact of their failures are often listed among the short-to-medium term use cases. The paper focuses on the desired properties of system reliability models that can quantify the benefits for these use cases. Specifically, the models must be created relatively fast with the minimal resources and provide a high level of transparency, so that the models can be documented in a way that makes them easily auditable and understandable by the decision makers. To address these needs, the use of a new version of Stochastic Petri Nets (SPNs) called Abridged Petri Nets (APN) as a candidate framework is suggested. Several specific examples are considered. In particular, condition-based maintenance (CBM) concepts with both periodic and continuous monitoring are investigated and compared. The utility of various performance measures for CBM are discussed, including false positives rates and correlation between consecutive inspection success. In addition, a scenario of introducing new technologies (such as advanced fleet-based or Big Data analytics) into the maintenance and spares supply chain is presented. 1 INTRODUCTION The modern capabilities for dealing with large-scale data collection, storage, and statistical processing have clearly become a focal point of public attention, as testified by the popularity of buzzwords such as “Analytics” and “Big Data” (Siegel 2013). It seems natural to assume that these capabilities are directly applicable for addressing the challenges of predicting failures of complex systems. Furthermore, Amazon, Google, and similar “New Economy” companies at the forefront of the Big Data movement also operate “cloud” infrastructures, and so should be interested in preventing their failures. However, this particular example of complex engineering systems benefits from the feasibility of affordable and highly redundant configurations, with the consequences of individual failures often negligible. This is not always the case with other complex (e.g., safety-critical) systems. At the same time, at least at the moment, the Big Data embraces correlation at the expense of causation. The vast majority of modern analytics focuses on domain-independent machine-learning methods. They fall under the category of black-box as opposed to white-box models (Blischke & Murthy 2000). In this context, the key question is whether the failure process of an entity is modeled with or without the explicit recognition of individual constituents (components) that comprise the entity. Here “component” refers to an elementary building block of a white-box (system) model, which can correspond to a lowerlevel entity if models are constructed hierarchically, or to the lowest level of the hierarchy, as determined by practical considerations (e.g., individual modules, such as line-replaceable units or LRU). White-box models explicitly describe interrelationships among the entities that comprise the system, and so can be considered as a subset of engineering models. In contrast, black-box models do not require explicit modeling of the constituents, as the focus is on the best possible mapping between the inputs and the outputs without any attempt to analyze the causal mechanisms that lead to the observed or predicted mapping. As a result, such methods have broad range of applications, and they can be very useful in predicting regularly occurring events in existing complex systems. However, they are of limited use in predicting the behavior of new systems, the impact of new features introduced to existing systems, or extreme events affecting existing systems for which not enough historical data exists. The shifting balance between causation and correlation is rather explicitly advocated (Siegel 2013), and can be considered as a natural compensation for overreliance in the recent past on constructing causal models that were often counter-productive. As eloquently explained by one of the living classics of psychology, humans have clear propensity to construct causal models even when it is not rational (Kahneman 2011), so this current compensation brought in by the BigData tide has merit. There is another important distinction, that can also explain why, from this viewpoint, the subject of this paper is distinct from the mainstream Big Data applications. Most of the current applications of Big Data are focused on consumers behavior where causal connections are tenuous at best. In contrast, the application of Big Data to engineering systems (so-called “The Internet of Things”(L. Atzori 2010) as applied to large industrial systems, sometimes referred to as “Industrial Internet”) is fundamentally different from this perspective, as the components and systems have been designed (engineered) to follow certain logical relationships, so that known causal relationships are present at least most of the time. It is too early to tell how this increased role of causality will play out in the context of the Internet of Things. However, it is clear that the higher consequences of errors require a more direct reliance on engineering models, and pure domainindependent machine learning is insufficient. The juxtaposition between causality and correlation is not a zero-sum game, and as happened many times before in the history of science, a tool from one field can successfully migrate into another field, thus enhancing the recipient. The obvious near-term benefits to the system reliability include drastic improvements to and automation of failure data collection, which provide better input for reliability models and allow verification of the quality of reliability predictions. One of the promising directions lies at the intersection of analytics and failure modeling and anomaly detection, where large-scale experiments demonstrated the feasibility and usefulness of automatic processing of operational data (Chu et al. 2011). SPN is well-recognized as a flexible framework for dynamic system reliability (i.e., “white-box”) modeling. However, SPN models are generally perceived as too complicated (and even spaghetti-like) to be effective in facilitating the direct communication with the decision makers (Bowden 2000). As a result, there is a recent trend for using SPN as an intermediate layer of modeling hidden from the end user (Signoret et al. 2013). However, this approach results in the loss of the visibility of the modeled dynamic interactions. In contrast, a streamlined version of SPN called Abridged Petri Nets (APNs) (Volovoi 2013) retains modeling power of SPNs while enhancing the visual modeling clarity, making the models more suitable for the end user consumption and facilitating auditable tool-independent self-documentation. APN in briefly discussed next along with an application-based (as opposed to a computer science) perspective on SPNs. 2 ABRIDGED PETRI NETS Both Markov chains and Stochastic Petri nets (SPNs) can be viewed as state-space representations of stochastic processes. While each state of a Markov chain represents a possible state of the entire modeled system, Petri nets model the states of individual components rather than the explicit states of the entire system. Graphically, Petri nets complement Markov chain-state diagrams with two new types of objects: first, small filled circles (called tokens) denoting individual components are introduced, each placed inside of one of the larger hollow circles that denote the potential states of those components (the latter entities are named “places” as opposed to “states” in Markov diagrams). Second, in order to model interactions among components, the tokens are routed among places via intermediate stops or junctures, called transitions, which are denoted with solid rectangles. Any two places cannot be connected by an arc directly; instead they must be connected through a transition. The number of input and output arcs does not need to coincide, enabling the merging and splitting of tokens and their routes. The timing of state changes can be modeled by specifying time delays for transition “firing”: an atomic (i.e., indivisible) action that removes tokens from all input places for the transition and deposits tokens into its output places. Such Petri nets are timed Petri nets, or, more specifically, Stochastic Petri Nets (SPNs) (Balbo 2007, Haas 2002), when delays can be nondeterministic and follow a specified distribution. In simulation no limitations on the associated types of distributions is needed. However, historically the name SPN is often referred to models with exponentially distributed delays only, so that they could be converted to Markov chains and solved using appropriate techniques for the underlying differential equations. Inhibitors provide a “zero test” capability and are known to increase the modeling power of Petri nets to that of a Touring machine (Balbo 2007, Haas 2002). Both inhibitors and enablers (non-zero test arcs) are usually considered supplementary mechanisms for modeling interactions rather than a replacement for fork-joining capabilities provided by transitions. Historically, inhibitors were viewed with a certain degree of skepticism by the Petri net community, as they make traditional analysis of structural properties (such as reachability analysis) more complex. However, the latest view of this drawback of inhibitors is not as straightforward as the new algorithms can successfully handle inhibitors (Ciardo 2004). In addition, there is a sufficient number of applications (e.g., modeling failure and maintenance processes of complex systems) where the state-space of the problem is relatively well understood and the main utility of the modeling consists of quantitative performance evaluation of the system. These considerations provide an impetus for relying on triggers as a sole (or at least the main) alternative mechanism for describing components’ interactions. This enables direct connectivity of places, thus disposing of the hubs (transitions) altogether. When combined with hierarchical representation, the result is a compact yet powerful modeling framework. It can be shown that the combination of enablers and inhibitors with direct transitions between places allows the modeling of any system that can be modeled using Petri nets with multiple inputs and outputs to a transition (in other words, the modeling power is not reduced). The result is referred to as Abridged Petri Nets (APNs) (Volovoi 2013) and resembles a hybrid between traditional Markov chainstate charts and Petri nets: transitions connect places directly (similar to Markov chains), but tokens are present to represent individual components of the system (similar to Petri nets). Importantly, the tokens have discrete labels (colors) as well as continuous labels (age) (Volovoi 2004). Transitions as the hubs or junctures of tokens’ movements (depicted as rectangles) provide an ingenuous mechanism for modeling components’ interactions in classical SPNs. The original idea of Petri nets was conceived in the context of chemical reactions (Petri & Reisig 2008), where the merging (joining) and splitting (forking) of entities is quite common, which might explain the fundamental role of transitions that provide a direct means of modeling these processes. The merging and splitting is also important in the context of informational flows. In other applications such events occur as well, but equally if not more important are the changes that occur to entities individually. In fact, there are clear competitive advantages to modular structuring of complex systems (Simon 2002), which might explain its prominence in both natural and engineering systems leading to so-called near-decomposibility (Courtois 1977). With this consideration in mind, entities that comprise complex systems can be considered as operating mainly independently of each other, with interactions occurring relatively rarely. However, when these interactions do occur, they are critical to the system’s behavior. APN modeling framework reflects this neardecomposability and considers independent (parallel) behavior as the default while explicitly focusing attention on the interactions. The joining (merging) and forking (splitting) of entities represents important mechanisms for modeling a system’s interactions, and for those types of interactions the use of the junctures for routing tokens among places is appropriate. However, quite often the actual mechanisms are related to enabling or inhibiting state transitions, while joining and forking provide a means of modeling such mechanisms. The difference is subtle and often not noticeable when tokens are indistinguishable (as in classical SPNs), but the consequences become significant when the entities represented by tokens have distinct identities (labels, such as colors (Jensen 1993)). Next, specific properties of APN are briefly described. 1. An APN is defined as a network of places (denoted as hollow large circles) that are connected by directed arcs (transitions). Changes in the system’s state are modeled by transition firing: i.e., moving a token from the transition’s input place to its output place. The combined position of APN tokens at any given moment represents the net marking and fully specifies the modeled system. 2. Each transition has no more than a single input and single output place (if an input place is of the source type, it generates a new token every time the transition fires, while the output place can be of the sink type: upon the firing of such a transition, a token is removed from the net). 3. The use of token junctures, that correspond to the transitions used in regular SPNs is optional. While junctures do not increase the modeling power, there are situations where the use of merging and splitting of tokens allow for more compact models. In contrast to SPNs, these entities are relatively rare, and analogous to the batching and separating building blocks in process-interaction frameworks for discreteevent simulations (Law & Kelton 2000). These entities are not used in the current paper. 4. Each token can have a discrete label (color) that can change when token is fired in accordance with the policy specified by the firing transition. In addition, tokens have continuous labels (ages) that can change both when tokens move, and with the progression of time while a token stays in the same place (the latter property is specified by the aging transition for the place, which is not necessarily the same as the firing transition (Volovoi 2004)). 5. A transition is enabled or disabled based on the combined marking of the input places of the associated triggers (inhibitors and enablers). Inhibitors are depicted as arcs originating at a place and terminating at a transition with a hollow circle. An inhibitor of multiplicity k disables a transition that it terminates at if the number of tokens in its input place is at least k. An enabler (depicted as an arc originating at a place and terminating at a transition with a filled circle) is opposite to an inhibitor: a transition is disabled unless an enabler of multiplicity k has at least k tokens in its input place (Volovoi 2006). Enablers are effectively test arcs (Christensen and Hansen 1993)), which are used in system biology modeling (Matsuno et al. 2003), where they are denoted with directed dashed arcs; notations used in this paper are chosen to emphasize the fact that enablers are the opposite of inhibitors. It can be shown that the use of triggers allow to fully compensate for the single-input singleoutput limitation on the transition (no modeling power is lost as a result). 6. Transitions have color- and age-dependent policies that specify the delay between the moment when the token is enabled and when it is fired (for example one can specify separate distributions for distinct colors, while the age can accumulate as the cumulative distribution function of the aging transition). If a token-transition pair is enabled, a firing delay is specified based on the combination of token and transition properties. If the token stays enabled throughout the delay, after this delay expires the token is fired. If there are multiple enabled tokens in the same place, they all can participate in the firing “race” in parallel. Similarly, the same token can be involved in a race with several transitions. If a tokentransition pair is disabled, the firing is preempted and the aging label of the token can change as a result of being enabled for a finite amount of time. 7. The delays can be deterministic (including zero delay) or follow any specified random distributions. The firing after a specified delay is “atomic”: it is a single action of moving a token from an input place to an output place (tokens don’t dwell between places as they potentially do in some versions of SPNs (Bowden 2000)). 8. The performance of the system is based on the statistical properties of marking of the system. “Sensors” or “listeners” at each place (denoted with small filled square at the right bottom of the place) can evaluate the chances and the number of times a given threshold of the number of tokens is crossed, or evaluate the relevant statistics about the number of tokens at a given place. In the latter case the correlation matrix for all results can be evaluated as well, providing the mechanism for calculating the variances (in addition to the mean values) of global metrics that aggregate the readings of individual sensors. 9. Fusing places, commonly used in hierarchical Petri nets (see for example, (Jensen 1993)) is employed to connect different parts of the model. Fused places appear as distinct graphical entities Replacement T4, 1 only T7, -1 T6, 1 only After Detection Damaged T2 +1 T1 Good T3 T9, -1 T8 T5 +1 Failure Figure 1: APN model for CBM with continuous monitoring during the model construction, but represent the same entity in simulation. This feature is not directly employed in the current paper, but it critical for creating system-level models with multiple components. 3 CBM WITH CONTINUOUS MONITORING There is a significant body of literature devoting to the evaluating CBM performance, including the description of the best practices, see for example (ADS-79DHDBK 2013). Two types of performance measures can be identified: 1. Diagnostic, or classification performance measures that quantify the ability to classify the entities into the groups directly related to specific maintenance guidance. In the simplest scenario, a two-way classification (“good,” no maintenance action vs. “bad,” triggering a maintenance action) can be considered. Multi-way classifications are also possible, such as score cards described in (ADS-79D-HDBK 2013) that colorcode the entities in accordance with the urgency of the maintenance actions. In either case, the classification correspond to some physical difference (fault) that can be independently verified (e.g. presence of cracks of a certain size, presence of debris, etc.). The corresponding performance measures (such as false positive and false negative rates) can be evaluated based on success/failure criteria using standard statistical procedures. 2. Prognostic or reliability performance that is directly related to the timing of a failure (or a degradation that has direct negative economic consequences). To this end, remaining useful life (RUL) or P-F interval, which is characterized by a distribution is usually discussed. The utilization of this type of measures is significantly more vague. While the measures for accuracy of prediction of RUL have been proposed based on the comparison between the predicted and the actual RUL, neither specific performance requirements for this measure, nor statistical procedures for evaluating this measure are provided (ADS79D-HDBK 2013). The main difficulty of using accuracy-based performance measures for RUL prediction is that RUL is an intermediate indicator that is further transformed into a maintenance decision, and so the consequences of errors in predicting RUL must be assessed in the context of the quality of that decision (i.e., how those errors affected the decision). If a RUL prediction error followed a well-behaved distribution, such as a normal distribution centered around zero (so that the error shrinks as the prognostic distance decreases), averaging the errors would provide a consistent performance measure. However, this is not always the case, as described next, and making this assumption can lead to incorrect conclusions. There is no consensus about the relationship between the two types of metrics. For example, it is not even clear whether RUL is sufficient on its own, or the classification measures (such as false positive and negative rates) are also needed. Theoretically, if the actual P-F interval distribution is fully known for each entity, one can recover both the missed defects and the false positive rates. Indeed, let us consider an entity that is flagged to be in need of a maintenance action. The actual P-F distribution can be visualized as having a bi-modal shape: the first (larger) peak corresponds to the correct identification of the fault, with the second (smaller) peak located at larger time values corresponds to a mistaken identification of the fault, allowing for the possibility that RUL is quite large. A similar bi-modal structure of P-F distribution for a part that is deemed as normal by the CBM algorithm (with the reversed order of the peaks by size: the first smaller peak corresponds to the missed defects). Depending on the nature of the features used for fault identification, the distribution of RUL can be more complicated. However, from the practical perspective, either some assumptions are made regarding the shape of the underlying distribution, or first two moments (mean and variance) of this distribution are conveyed. As a result, classification performance measures are often lost in translation (see for example (Feldman et al. 2009)) and accuracy-based performance measures are of limited use. A pragmatic approach consists of combining both types of performance measures, making them complementary to each other. Specifically, this implies that the resolution of RUL prediction should match the classification groups that correspond to the specific maintenance actions. For example if there are four groups used in a score card (ADS-79D-HDBK 2013) going from green to red, a distribution of RUL would be estimated for the entire sup-population of each group (without attempting to make a finer dis- tinction of RUL within each group other than perhaps accounting for the age (usage) of an entity). This subpopulation RUL can be then directly related to the maintenance urgency based on the risk acceptance in a rigorous quantitative manner. Let us next demonstrate this approach for a two-way classification. A model of a single component CBM with continuous monitoring (Volovoi 2012) is considered and implemented in APN. The resulting model is shown in figure 1, with transition properties are provided in table 1. A two-phase degradation is considered with the parameters of the transitions from the “Good” state (or place as it is referred to in Petri nets) to the “Damaged” place (transition T 1 in figure 1) and the transition from the “Damaged” place to the “Failure” place ((transition T 8 in figure 1) provided in (Goode et al. 2000). The token color in the model represents the outcome of the CBM technology - color 0 indicates that the component is deemed to be normal and should remain in operation, while color 1 indicates the damaged diagnosis by CBM. Correspondingly, fast transitions T 4 and T 6 are only enabled for color 1. In general, the duration of these two transitions corresponds to the maintenance delay (how long does it for the component deemed to be damaged to be replaced) (Volovoi 2012). In the considered example an immediate maintenance action is modeled to facilitate comparison with periodic inspection scenarios. When the transition T 1 fires the token, the modeled component transitions to the damaged state. The two outgoing transitions from the “Damaged” place correspond to the probabilistic choice: the token firing by the transition T 2 corresponds to the situation when the damage is detected by the CBM technology, and the token changes its color from 0 to 1; otherwise, the damage is undetected, the token is fired by transition T 3, and the color of the token remains 0. A simple way to implement this probabilistic choice is to assign fast exponential distributions for the firing the delays for the involved transitions with the relative rates proportional to the desired probabilities. In the example considered, T 2 and T 3 are chosen to represent 95% detection rate (see table 1). Similar arrangement is made to model false positives. The scale of T 5 is chosen so that for a given component that is a 5% chance that it will be erroneously identified as damaged while it was in the normal state (5% false positive rate or 95% specificity). Note, that the T 5 is chosen to follow Weibull distribution with the same shape parameter as the T 1 transition. As discussed in (Volovoi & Vega 2012), the shape of the distribution is important for the outcome of the race between two transitions. For example, selecting an exponential distribution for T 5 with the same 19 : 1 ratio of mean delays as compared to T 1, would provide a significantly different (in this case, larger) false positive rate. Color Color Name Change Policy Type T1 0 0 Weibull T2 1 0 Exponential 1 None T3 0 0 Exponential T4 0 0 None 1 Fixed T5 1 0 Weibull 0 0 None T6 0 0 None 1 Fixed T7 -1 0 Fixed T8 0 0 Weibull T9 -1 0 Fixed T10 0 0 Fixed T11 0 0 Fixed Par 1 Scale Rate Value 1000 h Par 2 Value 0.526 Shape 2.91 9.5E+06 Rate 5.0E+05 Duration Scale 1E-06 9.994 Shape 2.91 Duration 1E-06 Duration 1E-06 Scale 0.222 Shape 1.03 Duration 1E-06 Duration 2E-06 Duration Varies (see Table 2) Table 1: Transition descriptions for APN of CBM model Replacement Periodic Inspec Calendar T10 T11 T4, 1 only T7, -1 T6, 1 only After Detection Damaged T2 +1 T1 Good T3 T9, -1 T8 T5 +1 Failure Figure 2: APN model with periodic inspection 4 CBM WITH PERIODIC INSPECTIONS Next, we develop a model with periodic inspection that is consistent with the model developed in the previous section for continuous monitoring. Specifically, we consider a situation where the observations occur periodically, and those observations effectively identify a “signature” (feature or pattern) of a fault of interest in a repeatable (and usually automated), fashion. Since this “signature” is imperfect, there are cases where both false positive and missing defects error occur. However, in this model we neglect the errors (“noise”) introduced by a single periodic inspection (in other words, if an inspection is repeated immediately after another inspection, no additional benefits are obtained). This model is appropriate for automated repeatable procedures and can be contrasted with a model of periodic inspections where errors are specific to a particular inspection instance (Volovoi 2007). The latter is more appropriate for independent inspections that have individual characteristics, for example, a visual inspection for cracks. It must be noted that the independence of the inspection results must be treated with caution, especially when it is used to justify a very high overall success rate of a procedure. For example, (ADS79D-HDBK 2013) considers a scenario where indi- T11 (h) Failures Replacements 1000 12.27 2.66 500 9.96 5.99 200 6.53 11.04 100 4.28 14.39 50 2.79 16.65 20 1.75 18.25 10 1.37 18.83 0 1.00 19.40 Table 2: Expected number of failures and replacements for CBM model vidual probability of detection of a fault is 0.9 with six inspections assumed to achieve six nines probability of detection. The situation is similar to the common cause failures in redundant systems, as it is highly unlikely that the inspections are truly independent. Indeed, the defect location and orientation are common factors affecting the probability of detection (e.g., hard-to-reach location can combine with an orientation that is the least detectable from the viewpoint provided by the likely access to the inspected specimen). The APN model is shown in figure 2. The model is similar to the one for continuous inspection (figure 1), except that two places are added in the top-right portion of the model to represent the periodic inspection schedule. The calendar token needs to be in the “Periodic Inspection” place for the transitions T 6 and T 4 to be enabled (see the two enablers to those transitions originating in the “Periodic Inspection” place). Transition T 10 is chosen to be slow enough to make sure that T 6 and T 4 are enabled long enough to fire the component token that has color 1 (i.e., duration of T 10 is longer than the corresponding policies of those two transitions for color 1, see table 1). The results of 1 million Monte Carlo simulations for τ = 10, 000 hours and various inspection intervals are shown in table 2). The last row corresponds to the continuous monitoring results. The expected number of replacements and failures are shown, so combining this information with the costs of each type of events can provide foundation for the cost benefit analysis. To put these numbers in perspective, we note that the mean time to failure of the modeled component is MT T F = 688.4 hours, so the run to failure (RTF) scenario result in m∞ = τ /MT T F = 14.53 failures for the modeled time duration. Accounting for the finite duration (starting with a new component and simulating τ units of time) provides a very close value mRT F = 14.11. Finally, an age-based replacement every 500 hours would result in ma500 = 5.34 failures with additional 15.46 replacements. 5 UTILIZATION OF ANALYTICS Let us consider the following scenario that demonstrates interactions of several technologies and poten- tial utilization of the APN capabilities. The new aspect, as compared to the CBM scenarios above, is that here the feasibility of a two-phase process (similar to what is often done in the health-care where the first a screening test that can have potentially high false positive rates, but still serves a useful purpose). The enditem (systems) have repairable line replaceable units (LRU) that have unique serial numbers and upon repair can be installed on an end-item that is different from the original. A portion of the LRU population during the operation transitions to a “bad actor” state (we assume that it happens to 10% of the population every operational cycle). A regular part has a mean time to failure (MTTF) of 250 days, while a repaired “bad actor” part has MTTF of 50 days. We consider a population of 200 parts, the total duration of the regular repair cycle is eight days (two days retrograde shipment to the depot, five days of repair, and one day shipment back from the depot). A business case is considered for using a new testing equipment that detects the bad actors. The test takes one day, with only one item being tested at a time (so queues are possible). The item determined to be a “bad actor” can undergo an additional repair or replacement that takes two additional days (in addition to the five days required for regular repair). Finally, there is a fleet-level analytics that relies on the Big Data technologies (e.g., as described in (Chu et al. 2011) ) that provide (an imperfect) means to predict (based on the usage history and other relevant “markers”) whether the part is a “bad actor”. The APN model is shown in figure 3 and corresponding parameters are given in table 3. This procedure has 10% false positive and 10% false negative rates (those rates can be changed by adjusting parameters of transitions T 4 and T 5). The very definition of a “bad actor” is directly related to the results of the additional test (the units that fail the test are designated as the bad actors, and their performance is shown to be inferior to the units that pass the test). Three scenarios are considered: the baseline when no additional testing, the scenario where all the units are tested, and finally, the “Analytics” scenario that allows selective testing: only items flagged by the fleet-wide analytics are tested. The parameters in table 3 correspond to the third scenario. In order to obtain the scenario with no testing the right branch of the model is turned off by turning off T 5 transition for both colors. Similarly, turning off T 4 transition for both colors will route all units through the additional testing. There are several performance measures that can be considered - including the costs based on the expected number of different maintenance actions (similar to what was considered in the previous example). For the sake of variety we focus instead on the availability measure given by the number of operating units. Figure 4 shows the expected number of operating units as a function of time for the first 300 days of operation. For clarity, the durations of depot actions are Depot Bad Actor Additional Test Queue T2 T3 T1 T5 T4 Operation: 200 T11 T6 Regular Repair T10 T7 Return T8 Additional Test T9 Rep. Bad Actors Figure 3: APN model of a simple repair process with a “bad actor” sub-population Color Color Name Change Policy Type T1 0 0 Exponential 1 Exponential T2 1 0 Exponential 1 None T3 0 0 Fixed T4 0 0 Exponential 1 Exponential T5 1 0 Exponential 0 0 Exponential T6 0 0 Fixed T7 0 0 Fixed 1 None T8 0 0 None 1 Fixed Duration Rate Rate Rate Rate Duration Duration 2.0 9.0E+06 1.0E+06 1.0E+06 9.0E+06 1E-06 1.0 Duration 1.0 T9 T10 T11 Duration Duration Duration 2.0 5.0 1.0 -1 0 0 0 Fixed 0 Fixed 0 Fixed Par 1 Rate Rate Rate Value Days 0.0036 0.02 4E-04 Description Regular failures rate Bad actor failures rate Failure + conversion to bad actor rate Shipment to depot True negative False negative False positive True positive Immediate Additional test time regular items Additional test time bad actors Repair/repacement of bad actors Regular repair Shipment from depot Table 3: Parameter description for the APN model of bad actor problem chosen to be deterministic, while in reality those durations will vary, allowing for the smoother trends. One can observe a kink when the units start returning from the depot and the three curves diverge. After that, the curve with no testing continues its downward slope, as the portion of “bad actors” in the population will continue to increase. In contrast, both selective and full test scenarios converge to a steady-state solution relatively fast with the selective testing providing superior performance. For the given set of parameters there is a queue for the additional testing that, on average, has about 1.18 units, while the selective testing has a negligibly small average of units in the queue. If the duration of the additional test is increased to two days, then the system effectively becomes unstable with the queue length growing effectively linear with time until the number of operating units becomes so small that the system reaches an equilibrium. The selective testing is even more beneficial in this case. The example demonstrates how a combination of two technologies (in this case, the introduction of an additional testing and the fleet-wide analytics) provides benefits not achievable by each technology alone. One can envision evaluating the benefits of alternative operational policies, such as testing all units but only once per several cycles of repair, etc. 198 No Testing 197 Mean Operating Units All Tested 196 Selective Testing 195 194 193 192 191 0 50 100 150 Time (days) 200 250 300 Figure 4: APN results for the expected number of operating units 6 CONCLUSIONS System reliability tools can rigorously support a business case for a new technology. Quantitative evaluation of value proposition provides a vital link between the technology developers and the business and/or management world. Implementation of a new technology, such as prognostic health management or other changes in maintenance operations carries significant upfront costs. As a result, providing convincing and credible assessment of the value proposition is of great importance. A well-constructed system reliability model created to support a business case clarifies the quantitative requirements for the performance measures of the developed technology. This facilitates not only obtaining the green light for the technology implementation, but provides clear guidelines during the implementation itself. Ideally, the original systemreliability model gets updated, as the technology is developed to reflect a more detailed knowledge about the operation of the new system. To address these needs, the use of a new version of Stochastic Petri Nets (SPNs) called Abridged Petri Nets (APN) is suggested. Several specific examples are considered. In particular, condition-based maintenance (CBM) concepts with both periodic and continuous monitoring are investigated and compared. The utility of various performance measures for CBM are discussed, including false positives rates and correlation between consecutive inspection success. In addition, a scenario of introducing new technologies (such as advanced fleet-based or Big Data analytics) into the maintenance and spare supply chain is presented. In all examples the models are fully described in a toolindependent fashion: the results can be reproduced without the need of a specific software. REFERENCES ADS-79D-HDBK (2013). Condition Based Maintenance System for US Army Aircraft. Aeronautical Design Standard Handbook. US Army Research, Development, and Engineering Command, Aviation Engineering Directorate. Balbo, G. (2007). Introduction to generalized stochastic Petri nets. In M. Bernardo and J. Hillston (Eds.), Formal Methods for Performance Evaluation, Volume 4486 of Lecture Notes in Computer Science, pp. 83–131. Springer-Verlag. Blischke, W. & D. Murthy (2000). Reliability. Modeling, Prediction, and Optimization. New York, NY: Wiley. Bowden, F. (2000). A brief survey and synthesis of the roles of time in Petri nets. Mathematical and Computer Modelling 21, 55–68. Christensen, S. & N. D. Hansen (1993). Coloured Petri nets extended with place capacities, test arcs and inhibitor arcs. In M. Ajmone Marsan (Ed.), Application and Theory of Petri Nets, Lecture Notes in Computer Science, pp. 186–205. Springer Berlin Heidelberg. Chu, E., D. Gorinevsky, & S. Boyd (2011). Scalable statistical monitoring of fleet data. In World IFAC Congress, Milano, Italy. Ciardo, G. (2004). Reachability set generation for Petri nets: Can brute force be smart? In Application and Theory of Petri Nets, Volume 3099 of Lecture Notes in Computer Science, pp. 17–34. Springer Berlin Heidelberg. Courtois, P. J. (1977). Decomposability: queueing and computer system applications. New York, NY: Academic Press. Feldman, K., T. Jazouli, & P. Sandborn (2009). A methodology for determining the return on investment associated with prognostics and health management. IEEE Trans. on Reliability 58(2), 305–316. Goode, K. B., J. Moore, & B. Roylance (2000). Plant machinery working life prediction method utilizing reliability and condition-monitoring data. Proceedings of the Institution of Mechanical Engineers, Part E: Journal of Process Mechanical Engineering 214(2), 109–122. Haas, P. J. (2002). Stochastic Petri Nets. Modelling, Stability, Simulation. New York: Springer. Jensen, K. (1993). Coloured Petri Nets. Basic Concepts, Analysis Methods and Practical Use, Volume 1. Berlin: Springer. Kahneman, D. (2011). Thinking, Fast and Slow. New York, NY: Farrar, Straus and Giroux. L. Atzori, A. Iera, G. M. (2010). The internet of things: A survey. Computer Networks 54, 2787–2805. Law, A. & W. Kelton (2000). Simulation Modeling and Analysis (3rd ed.). New York, NY: McGraw-Hill. Matsuno, H., Y. Tanaka, H. Aoshima, A. Doi, M. Matsui, & S. Miyano (2003). Biopathways representation and simulation on hybrid functional Petri net. Silico Biology 3(3), 389– 404. Petri, C. & W. Reisig (2008). Petri net. Scholarpedia 3, 6477. Siegel, E. (2013). Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die. New York, NY: Wiley. Signoret, J.-P., Y. Dutuit, P.-J. Cacheux, C. Folleau, S. Collas, & P. Thomas (2013). Make your Petri nets understandable: Reliability block diagrams driven Petri nets. Reliability Engineering and System Safety 113, 61–75. Simon, H. A. (2002). Near decomposability and the speed of evolution. Industrial and Corporate Change 11(3), 587— 599. Volovoi, V. (2006). Stochastic Petri nets modeling using SPN. In Proceedings of Annual Reliability and Maintainability Symposium, pp. 75–81. IEEE. Volovoi, V. (2007). Developing system-level maintenance policies using stochastic Petri nets with aging tokens. In Proceedings of Annual Reliability and Maintainability Symposium, pp. 2007RM–170. IEEE. Volovoi, V. (2012). IVHM – The Business Case, Chapter Quantification of System-Level Business Effects of IVHM. Warrendale, PA: Society of Automotive Engineers (SAE) International. Volovoi, V. (2013). Abridged petri nets. ArXiv, arXiv:1312.2865. Volovoi, V. & R. V. Vega (2012). On compact modeling of coupling effects in maintenance processes of complex systems. International Journal of Engineering Science 51, 193–210. Volovoi, V. V. (2004). Modeling of system reliability using Petri nets with aging tokens. Reliability Engineering and System Safety 84(2), 149–161.

Log In

Building Business Cases for Risk and Reliability Technologies

Related papers

Related papers

Related topics