PHMC 13 067
PHMC 13 067
PHMC 13 067
Thomas Hubauer1, Steffen Lamparter2, Mikhail Roshchin3, Nina Solomakhina4, and Stuart Watson5
1,2,3
Siemens AG, Munich, 81739, Germany
[email protected]
4
Siemens AG, Munich, 81739, Germany
[email protected]
5
Siemens Industrial Turbomachinery Ltd, Lincoln, LN5 7FD, United Kingdom
[email protected]
2
Annual Conference of the Prognostics and Health Management Society 2013
3
Table 3. Processed data
5
Annual Conference of the Prognostics and Health Management Society 2013
(c)
(a)
(b)
(c) (d)
Figure 7. (a) Oscillatory signals. (b) Noisy data - BSignal measurements look like a white noise. (c) RSignal shows
divergent values. (d) SSignal and GSignal have rise/drop amplitudes differing from other duplicating sensors.
Thus, usually data typically does not expire and become any data cleaning technique should satisfy several
irrelevant in several years but on the other hand, has to be requirements (Rahm & Do, 2000):
stored for decades.
should detect and remove all major errors and
In conclusion, for successful information analysis it is inconsistencies both in individual data sources and
crucial to determine how reliable data is and to bring it to when integrating multiple sources.
the representation convenient for required purposes. In the
following section we discuss methods and techniques should be supported by tools to limit manual
developed to achieve this goal. inspection and programming effort and be extensible
to easily cover additional sources.
5. DATA ANALYSIS should not be performed in isolation but together
In this chapter there are examined techniques which help to with schema-related data transformations based on
get use of low quality data. Firstly, there are described data comprehensive metadata.
cleaning methods and in addition, a proposal to improve Statistical methods are used to: (i) visualize the data; (ii)
them is made. Data analysis techniques are described in the summarize and describe existing data by means of
second part of the chapter. univariate and multivariate analysis; (iii) offer hypotheses
and decisions with the aid of statistical tests; (iv) interpret
5.1. Quality assessment and cleaning data employing sampling techniques.
There are several directions in data cleaning and existing One of the most widely used statistical tools for data quality
techniques aimed at particular problems (Rahm & Do, assessment is called quality indicator (Bergdahl et al.,
2000): duplicate identification and elimination, data 2007). It is a measure of how well provided information
transformations, schema matching, data mining approaches meets criteria and requirements for an output quality. Also
and others. Moreover there are also unified techniques. The there exist a number of statistical/probabilistic techniques
main scientific approaches include statistical, machine and its modifications (Winkler, 1999), 1-1 matching
learning and knowledge-based approaches. But in general methods and bridging file technique (Winkler, 2004).
6
Annual Conference of the Prognostics and Health Management Society 2013
7
Annual Conference of the Prognostics and Health Management Society 2013
model-driven correction of a model in case of techniques and reason over it to infer new knowledge. Their
changes in system structure. biggest advantage is the possibility of thorough diagnostics.
There are two fundamental knowledge-based methods used
5.2. Analysis and diagnostics by engineers:
In this subsection there are shortly explained, how the Fault/symptom diagnostic approach;
cleaned data is studied and processed further in the current
Causal tree diagnostic approach.
industry case. The main use is continuous diagnosing of the
condition of the appliance in order to predict and prevent In special situations several approaches may be combined
future faults of the machinery and to react instantly as for better results, but still both approaches are not disjoint,
anomalies or faults in operating are detected. Two main i.e. there are methods which might be referred to both types.
approaches for that are: (i) data-driven and (ii) knowledge- However, each approach has its advantages as well as
based techniques. Data-driven approaches includes pattern drawbacks and an engineer chooses appropriate diagnostic
recognition, neural networks, numerical approaches; technique based on the type of an appliance, complexity of
knowledge-based techniques include case descriptions, modeling, availability of necessary data and other factors.
faults and correct behavior modeling. The following factors
determine the choice of the appropriate diagnosing method 6. CONCLUSION
in a particular case (ISO 13379-1, 2009):
For current industry use-case, data is employed to conduct
application and initial design of the equipment; calculations necessary for emergency diagnostics, prognosis
of efficiency and further analysis. However, due to
availability of data to be analyzed and its complexity;
imperfect, incomplete or defective information and data
and
schemes these tasks have become rather difficult to realize:
required qualifications of a resulting computations wrong, missing or incorrectly formatted data may result in
and models. erroneous computations and false decisions, which can be
quite disastrous for an industry processes, especially for
Brief summary of above-mentioned diagnosing techniques, large-scale industries.
presented in (ISO 13379-1, 2009):
The current paper studies data quality and different
Data-driven approach methods classify different approaches to its assessment. We summarized and
functioning states of an appliance: normal, fault one, fault illustrated the most common defectiveness of a large-scaled
two etc. In order to achieve this, firstly the model is trained industrial database by the example of Siemens Energy
with the historical data from each condition and after that Domain and its equipment measurements. We also reviewed
launched with the new data, which has to be classified. existing techniques that are used to overcome errors in data
The great advantage of data-driven approach is that it does and proposed an approach to address data quality problems.
not have need for a thorough knowledge of the system to be And as shown in the examples, there is no doubt that data
diagnosed. The other strong advantage is absence of requires continuous control and quality improvement,
constraints on the data type of independent variables. As a although the design of a convenient technological solution
disadvantage it is worth to mention, that it might be to that challenge is far from trivial.
computationally difficult to train a model, as it requires
comparatively large amount of prescribed fault and non- REFERENCES
fault states to construct a model. In addition, modelling by ISO 13379-1, I. D. (2009). Condition monitoring and
this approach does not result with an explanatory diagnosis. diagnostics of machines data interpretation and
The list of the most common data-driven techniques: diagnostics techniques part 1: General guidelines. ISO,
Geneva, Switzerland.
Statistical data analysis, case-based reasoning; Batini, C., & Scannapieca, M. (2006). Data quality:
Neural networks; concepts, methodologies and techniques. Springer.
Bergdahl, M., Ehling, M., Elvers, E., Földesi, E., Körner, T.,
Classification trees; Kron, A., and others (2007). Handbook on data quality
assessment methods and tools, 9–10.
Random forests;
Buechi, M., Borthwick, A., Winkel, A., & Goldberg, A.
Logistic regression; (2003) ClueMaker: A Language for Approximate
Record Matching. IQ, 207-223.
Support vector machines. Corporation, A., & Consulting, W. M. (2011). Data quality
Knowledge-based approaches are used to represent in the insurance market.
knowledge using various knowledge representation Foken, T., Göockede, M., Mauder, M., Mahrt, L., Amiro,
B., & Munger, W. (2005) Post-field data quality
8
Annual Conference of the Prognostics and Health Management Society 2013
control. Handbook of micrometeorology, Springer, Strong, D. M., Lee, Y. W., & Wang, R. Y. (1997). Data
181-208. quality in context. Communications of the ACM, 40(5),
Gendron, M. S., & D’Onofrio, M. J. (2001). Data quality in 103–110.
healthcare industry. Data Quality, 7(1), 23–31. Tejada, S., Knoblock, C. A., & Minton, S. (2001). Learning
Kahn, B. K., Strong, D. M., & Wang, R. Y. (2002). object identification rules for information integration.
Information quality benchmarks: product and service Information Systems, 26(8), 607–633.
performance. Communications of the ACM, 45(4), 184– Wand, Y., & Wang, R. Y. (1996). Anchoring data quality
192. dimensions in ontological foundations.
Laudon, K. C. (1986). Data quality and due process in large Communications of the ACM, 39(11), 86–95.
interorganizational record systems. Communications of Wang, R. Y., Strong, D. M., & Guarascio, L. M. (1996).
the ACM, 29(1), 4–11. Beyond accuracy: What data quality means to data
Lup Low, W., Li Lee, M., & Wang Ling, T. (2001). A consumers. J. of Management Information Systems,
knowledge-based approach for duplicate elimination in 12(4), 5–33.
data cleaning. Information Systems, 26(8), 585–606. Winkler, W. E. (1999) The state of record linkage and
Optique. (2012). Optique: project description. Retrieved current research problems. Statistical Research
November, 2012, from CVS: "http://www.optique- Division, US Census Bureau.
project.eu/ about-optique/about-optique/". Winkler, W. E. (2004). Methods for evaluating and creating
Pipino, L. L., Lee, Y.W., & Wang, R. Y. (2002). Data data quality. Information Systems, Elsevier, 29, 531-
quality assessment. Communications of the ACM, 550.
45(4), 211–218. Yan, S., Lee, D., Kan, M.-Y., & Giles, L. C. (2007)
Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and Adaptive sorted neighborhood methods for efficient
current approaches. IEEE Data Eng. Bull., 23(4), 3–13. record linkage. Proceedings of the 7th ACM/IEEE-CS
Safran, D. G., Kosinski, M., Tarlov, A. R., Rogers, W. H., joint conference on Digital libraries, 185-194
Taira, D. A., Lieberman, N., & Ware, J. E. (1998). The
primary care assessment survey: tests of data quality
and measurement performance. Medical care, 36(5),
728–739.