PHMC 13 067

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Analysis of data quality issues in real-world industrial data

Thomas Hubauer1, Steffen Lamparter2, Mikhail Roshchin3, Nina Solomakhina4, and Stuart Watson5
1,2,3
Siemens AG, Munich, 81739, Germany
[email protected]
4
Siemens AG, Munich, 81739, Germany
[email protected]
5
Siemens Industrial Turbomachinery Ltd, Lincoln, LN5 7FD, United Kingdom
[email protected]

ABSTRACT include wrong decisions, inadequate prognoses based on


imperfect statistics, troublesome handling and analysis of
In large industries usage of advanced technological methods data. Hence, data cleansing is one of the most important
and modern equipment comes with the problem of storing, tasks in information technologies, especially in knowledge-
interpreting and analyzing huge amount of information. based systems. The current work examines data analysis and
Handling information becomes more complicated and cleaning using an example from the Siemens Energy Sector,
important at the same time. So, data quality is one of major in particular its subdivision Oil and Gas solutions. Part of its
challenges considering a rapid growth of information, operational data is analyzed for possible data quality
fragmentation of information systems, incorrect data problems and a number of approaches to their solution are
formatting and other issues. The aim of this paper is to considered.
describe industrial data processing and analytics on the real-
world use case. The most crucial data quality issues are This paper is structured as follows. Firstly, an introduction
described, examined and classified in terms of Data Quality to the topic of data quality and a description of related work
Dimensions. Factual industrial information supports and is provided. The third section describes an industrial use
illustrates each encountered data deficiency. In addition, we case and particular data schemes. The forth section
describe methods for elimination data quality issues and examines primary data characteristics describing its quality
data analysis techniques, which are applied after cleaning conditions. First subsection here comes with the list of
data procedure. In addition, an approach to address data generally defined data quality dimensions. Next, they are
quality problems in large-scale industrial datasets is discussed in conjunction with our industrial scenario and
proposed. This techniques and methods comprise several illustrated with factual examples. In the fifth section there
well-known techniques, which come from both worlds of are proposed techniques and methods, which help to
mathematical logic and also statistics, improving data overcome difficulties of low data quality and make use of
quality procedure and cleaning results. such information. In addition, data analysis techniques are
described. The paper concludes with a summary of findings
1. INTRODUCTION and statements of further requirements and needs for future
development in data quality assessment and data cleaning
Caused by decreasing software cost and technological for industrial data-related procedures.
improvements, the amount of data produced, processed and
stored by companies grows continuously. This data contains 2. RELATED WORK
information regarding work process, equipment, staff
involved and even more. Based on this data decisions are Unsatisfactory data quality affects each field of action in
made, long-term plans are drawn up and statistics are both IT-related procedures and business-related tasks. Many
compiled. Therefore, even small amounts of poor quality companies elaborate their approaches to data quality
data may cause problems and costly consequences. assessment with respect to their own data purposes and
Examples for complications caused by dirty information types. Huge amounts of data, including names, addresses,
numerical and categorical values have to be stored and
Thomas Hubauer et al. This is an open-access article distributed under manipulated. Towards to improvement of information
the terms of the Creative Commons Attribution 3.0 United States
License, which permits unrestricted use, distribution, and reproduction quality assessment there is a number of research works
in any medium, provided the original author and source are credited.
Annual Conference of the Prognostics and Health Management Society 2013

Figure 1. Appliance structure and data flow


conducted by post, insurance (Corporation & Consulting, Each appliance comprises several industrial computers,
2011) and product trading (Pipino, Lee, & Wang, 2002) which operate based upon information from sensors and
companies, criminal-record governmental system (Laudon, serve the functions of (i) control unit and (ii) data collector.
1986) and many others (Wang, Strong, & Guarascio, 1996). Overall approximately 2000 sensors are used to monitor the
In industry and research fields the challenge of complex functioning of a single appliance.
data access is relevant as well: there exist a number of
The control unit serves the following functions: receiving
research works from different branches, such as industrial
sensor measurements, real-time monitoring of the appliance
ecology (Weidema, B. P. & Wesnæs, 1996), healthcare
and communication of all information to a data collector. To
industry (Safran et al., 1998; Gendron & D’Onofrio, 2001),
conduct monitoring, it processes received sensor data in
meteorology (Foken et al., 2005), sensor networks (Wallis et
several ways and generates corresponding short messages
al., 2007). Moreover, recently there have been launched a
(“events”), that describe the status of a unit and its
project “Optique” intended to improve data quality and to
functioning. There are three levels of data processing
provide a quick end-user access to Big Data. It is conducted
offered by a control unit:
jointly by several European universities and two big
industrial companies: Siemens AG and Statoil USA. The no processing applied at all, data remains as it was
goals of the project are (Optique, 2012): generated by hardware sensors (“raw” data);
to provide a semantic end-to-end connection between soft sensors: small chunks of code, which use
users and data sources; predefined rules (i.e., thresholds, trends) in order to
generate events for condition monitoring; there are
to enable users to rapidly formulate intuitive queries
usually approx. 2000 soft sensors and mostly each
using familiar vocabularies and conceptualizations;
soft sensor is assigned to one or several hard sensors;
to integrate data spread across multiple distributed
simple analysis: information preprocessing, based on
data sources, including streaming sources;
hard sensors' measurements and soft sensors'
to exploit massive parallelism for scalability far calculations (e.g., Fast Fourier Transformation).
beyond traditional RDBMSs and thus to reduce the
The main function of a data collector is to accumulate the
turnaround time for information requests to minutes
information, passed by control unit and to send it regularly
rather than days.
to the central database. Below are described different types
of tables stored in databases:
3. CASE STUDY: INDUSTRIAL DATA
Serial numbers, identification codes and all
This paper relates to data quality at Siemens Energy Sector.
general characteristics of unit components. Below
Data handling and processing in energy domain is becoming
is shown exemplary Table 1 with such information.
a big challenge, while power generation is getting more and
more important in the course of time. In the addition, location of the appliance, weather
conditions, history of operation and conducted
Siemens Energy Services maintains thousands of power
maintenance, performance indices are provided.
generation facilities, specifically, the major core
Generally, information of that kind is polytypic: it
components: gas and steam turbines, called in the latter
contains strings, numerical data and other types.
“appliances” or “rotating equipment”. Operational support is
provided through a global network of more than 50 service Table 1. Main characteristics of an appliance
centers. These centers are in turn linked to a common
database center, which stores the information coming from ID Engine Type Power Output Frequency …
the appliances in several thousands databases. Further in this T1 TurbineType1 12.90MW(e) 50/60 Hz …
chapter data organization, processing and data types used in
T2 TurbineType2 19.10MW(e) 50/60 Hz …
the tables are presented (see also Figure 1).

2
Annual Conference of the Prognostics and Health Management Society 2013

Table 2. Measurement data


SensorID Timestamp Value1 Value2 …
TMP23-1 2010/07/23 23:11:55 44 49 …

Measurements of sensors and monitoring devices.


Table 2 depicts the schema of such data, which is
often called as “raw” data, referring to the fact, that it
represents unprocessed data incoming from
machinery itself. Tables of that category contain
mostly numerical data and have extremely large size.
Pre-processed data and events. Typically data
preprocessed by control unit and soft sensors is
stored in different tables. Though these tables are
distinguished, they have the same structure, showed
below in the Table 3. Tables of this category have a
huge size as well and consist mostly of text and
date/time data.
Processed data. In that category databases store Figure 2. Data Quality Dimensions
results of analyses, conducted previously by service
centre for a particular appliance. All diagnostics 4.1. Main characteristics of a data
results based on data from central database, store in
the database as well and might be used for further Overall there are 16 typical data quality dimensions
diagnostics. describing data features (Kahn, Strong, & Wang, 2002) as
listed in Figure 2.
Each table has up to 20 attributes and contains various data
formats, including scaled (nominal, ordinal, interval, ratio Typically, classification of dimensions slightly differs
types), separated (with comma, tabulation), binary, floating- depending on the purpose of information and used data
point (single, double) data types. Per a single appliance types. From time to time some dimensions are omitted and
overall amount of tables exceeds 150. In sum, tabulations others are split up to several more concrete attributes. The
with sensor and event data result in 100 TB of timestamped reason is that in various fields of actions some particular
data. Moreover, sensors continuously produce characteristics are more important and more attention is paid
measurements at a rate between 1 and 1000 Hz and about 30 to them. For instance, for military government information
GB of a new sensor and event data are generated per day. security is a major feature, whereas for postal services
Due to numerous causes, such as different vendors of complete and free-of-errors address database is more of a
devices or historical reasons, for a database scheme there priority. For easier prioritizing and handling data quality
exist more than 10 various logical schemes. issues, dimensions can be clustered in three
hyperdimensions (Karr, Sanil, & Banks, 2006):
Thus, in described situation there arises a number of
challenges that complicate access to information and its Process: characteristics related to a maintenance of
processing. Their overcoming requires great amount of time data, such as Ease of Manipulation, Value-Added,
and resources. Further in this paper these challenges and Security.
approaches to them are discussed more thoroughly. Data: characteristics of the information itself, such
as Believability, Completeness, Free of Error,
4. DATA Q UALITY DIMENSIONS Objectivity, Relevancy.
In order to point out and classify the defects of data, special User: characteristics related to usage and interaction
data characteristics have been defined. Deficient condition with users, such as Appropriate Amount of
of any one of them has an impact on effective analysis and Information, Accessibility, Timeliness,
processing of the information. They are called Data Quality Understandability.
Dimensions.
Nevertheless, all above-listed data attributes are important
The first part of this section lists general data characteristics for databases of any purpose and there exist different
used to describe data of any purpose. The second part techniques and methods to estimate them and correct
explores dimensions of industrial data and provides some existing data to improve its attributes.
explanatory factual examples.

3
Table 3. Processed data

ApplianceID Time Class ErrorCode Downtime …


XX476 2010/07/23 21:10:35 Warning OilTemperatureHigh 00:00:05 …

To obtain acceptable data quality however often requires a


lot of time and resources and at times even manual
correction to ensure cleanliness of data.
From now on we focus on industrial data and its significant
data quality attributes, in particular in the domain of energy
solutions.
Figure 4. Event data loss
4.2. Data Quality Dimensions in industry
characteristic of a data. Nevertheless, data loss is not
In this section we analyze the quality of real industrial data uncommon in industry for several reasons. These reasons
based on (a relevant subset of) the data quality dimensions include the inability to access the required data: the
defined previously. One of the tools used during this project appliance might be located in a remote region and due to a
for analyzing data in the Siemens database and exploring its bad (or absent) connection between the data collector in the
quality is the “Diagnostics of rotating equipment” software. unit and the main database, the information may be
Its main features include: unavailable. Another reason is device faults. Depending on
loading from a database sensor and event data causes, there might be absent only one type of data tables:
corresponding to one particular or several appliances, “raw” or event data, and in that case it is still possible to
components or devices during a certain time period; make use of available information in order to conduct an
analysis. More severely is the case that no data for a
visualization of data using tables and graphs; particular period is available at all. Figure 3 depicts loss of
sensor measurements whereas Figure 4 shows absence of
analyzing sensor signals by means of statistical event pre-processed data for a week between 20th and 26th
methods; of September.
identifying patterns in event data i.e., revealing Consistent Representation
regularities preceding occurrences of a particular
event. When information comes from multiple sources, it is
essential to have data represented in the same format. In the
In the following we give concrete examples of Data Quality current use case there exist a number of contraventions:
Dimensions presented in Section 4.1. In order to illustrate
relevancy of data quality problems there are used various recordings of timestamps, as date and time
thermocouples measurements monitoring functioning of a can be written in several ways. For instance, devices
gas turbine. of one kind write timestamps as
DD/MM/YYYY~hh:mm:ss while another have a
Completeness, accessibility format YYYY-MM-DD~hh:mm:ss and many more of
The fullness of information i.e., the fact that data is not other types of devices having other date and time
missing and sufficiently detailed, is the most important formats.
data types of some information sources and
monitoring devices require conversion from one
format to another e.g., from String to Float or from
String to Integer.
different monitoring systems and control units
indicate the same event in different ways. That
happens due to diverse reasons such as various
device vendors, different software versions or even
location. Therefore a lack of standardization might
occur and the same entities and events might be
denoted differently. As shown in Figure 5, when a
device S1 measurements show the failure of
Figure 3. Signal data loss vibration devices, the corresponding event is denoted
in several ways: different quantity of spaces between
Annual Conference of the Prognostics and Health Management Society 2013

Figure 5. Different denotations of the same event


event message and sensor ID, sensor ID in
parentheses etc, which badly affects analysis and
statistics.
Free of Errors, Believability, Accuracy
In order to rely on results of analysis, data should be correct,
precise and relevant. The possible causes of occurrence of
erroneous and inaccurate data are very diverse: (i) one or (a)

several devices of the appliance faulted and gave inaccurate


or wrong measurements; (ii) control unit failure occurred
and there was an error during data preprocessing; (iii) there
are three data transfer segments - from sensors to control
unit, from control unit to data collector and between the
appliance and data warehouse, for each frequencies of data
transfer and speeds of data flow differ. It might happen that
poor connection distorted information on one of these
segments. Below are listed a few examples of discrediting
data or insufficient data accuracy.
Time Synchronization - timestamps of events and
measurements incoming from several different (b)
devices might slightly differ due to such reasons as
(1) time settings of a particular devices; (2) Figure 6. (a) Values out of range: minus temperatures.
frequency and duration of data transfers between (b) Outliers: measurements of range minimum, maximum.
components, control unit and data collector.
Signal alternation. On the Figure 8 is shown the case,
Range of values. Figure 6a shows an example where when two signal at some moment alternated each
thermocouple sensor measure values are out of other and swapped their measurements, as it is also
domain, namely minus temperatures. Additionally, marked with the black rectangles.
occasionally outliers occur – spikes or sudden
changes of value within the domain. They should be Ease of Manipulation, Data Schemes
treated properly during the analysis. Figure 6b Data schemes and structures are highly heterogeneous,
depicts an example of outliers - all sensors show depended upon which technique was used to create it, which
alternately range maximum and minimum. unit it belongs to, from where it comes historically.
Oscillations and noise. Figure 7a shows heavy Moreover, not all foreign keys between databases are
oscillations of all signals. Figure 7b depicts the case, present. If information on the same entity is distributed
when signal measurements contain too much noisy among several sources, for instance, if information
data. concerning a particular malfunction of an appliance should
be extracted from tables “Incident Summary”, “Daily Event
Vast difference in measures. If there are several Log”, “Burner tip temperature” and others, the problem of
sensing elements, which duplicate each other, and missing foreign keys do not allow for easy merging of data.
they measure completely different values, then it is
problematically to rely on these measurements. On a Timeliness, Appropriate Amount of Information
Figure 7c RSignal measurements differ from all For a thorough analysis it is critical to have all data
other measurements for more than 100 degrees. In available and updated. Though for each diagnostics case the
the case shown on Figure 7d, duplicating sensors considered time period always differs: it might be sufficient
measure the similar values, but as soon as to consider only the last hour in order to identify a cause of
temperature drops or rises, sensors measurements an event, but in other cases one needs to analyze the last
change with the different amplitude, as it is marked several years, for example to detect a deterioration of a
inside of black rectangles. particular component.

5
Annual Conference of the Prognostics and Health Management Society 2013

(c)
(a)

(b)
(c) (d)

Figure 7. (a) Oscillatory signals. (b) Noisy data - BSignal measurements look like a white noise. (c) RSignal shows
divergent values. (d) SSignal and GSignal have rise/drop amplitudes differing from other duplicating sensors.

Thus, usually data typically does not expire and become any data cleaning technique should satisfy several
irrelevant in several years but on the other hand, has to be requirements (Rahm & Do, 2000):
stored for decades.
should detect and remove all major errors and
In conclusion, for successful information analysis it is inconsistencies both in individual data sources and
crucial to determine how reliable data is and to bring it to when integrating multiple sources.
the representation convenient for required purposes. In the
following section we discuss methods and techniques should be supported by tools to limit manual
developed to achieve this goal. inspection and programming effort and be extensible
to easily cover additional sources.
5. DATA ANALYSIS should not be performed in isolation but together
In this chapter there are examined techniques which help to with schema-related data transformations based on
get use of low quality data. Firstly, there are described data comprehensive metadata.
cleaning methods and in addition, a proposal to improve Statistical methods are used to: (i) visualize the data; (ii)
them is made. Data analysis techniques are described in the summarize and describe existing data by means of
second part of the chapter. univariate and multivariate analysis; (iii) offer hypotheses
and decisions with the aid of statistical tests; (iv) interpret
5.1. Quality assessment and cleaning data employing sampling techniques.
There are several directions in data cleaning and existing One of the most widely used statistical tools for data quality
techniques aimed at particular problems (Rahm & Do, assessment is called quality indicator (Bergdahl et al.,
2000): duplicate identification and elimination, data 2007). It is a measure of how well provided information
transformations, schema matching, data mining approaches meets criteria and requirements for an output quality. Also
and others. Moreover there are also unified techniques. The there exist a number of statistical/probabilistic techniques
main scientific approaches include statistical, machine and its modifications (Winkler, 1999), 1-1 matching
learning and knowledge-based approaches. But in general methods and bridging file technique (Winkler, 2004).

6
Annual Conference of the Prognostics and Health Management Society 2013

Table 4. Measurement data for an exemplary sensor

Sensor ID Timestamp Value


TMPS1 2010/08/28 13:21:55 597.2
TMPS1 2010/08/28 13:22:00 598.5
TMPS1 2010/08/28 13:22:05 599.6
TMPS1 2010/08/28 13:22:10 600.3

Figure 8. BSignal and RSignal traded places.


In the current use case statistical approach is widely used for
detecting faults in sensor readings. For large-scale databases Figure 9. Temperature sensor measurements range
that enlarge day by day with new portions of sensor represented in a model
measurements it is highly essential to use fast and robust improve the quality of data by means of rules extracted from
techniques detecting changes in signal behavior. The main domain knowledge and domain-independent
approach is time-series analysis, cross- and autocorrelation, transformations (Batini & Scannapieca, 2006), e.g. the
spectrum and Fourier analyses in particular. In addition, Intelliclean system (Lup, Lee, & Wang, 2001) aimed at
there are simple indicators and distribution tests are efficient duplicate elimination, the Atlas technique (Tejada,
exploited in order to detect quickly changes in statistical Knoblock, & Minton, 2001) which allows to obtain new
parameters of sensor readings. rules through a learning process and Clue-Based method for
There exist a number of effective machine-learning record matching (Buechi et al., 2003).
algorithms. The most widely used are artificial neural As a proposition for a further work, we propose to combine
networks, clustering algorithms, support vector machines, existing techniques in order to increase productivity and
similarity learning. For a faulty sensor readings detection effectiveness of the data cleaning process. The dataset
machine learning approach is successfully used for analysis introduced here can serve as a test. As a motivating
of several sensor signals at once in order to establish example, consider measurements of a temperature sensor
confidence level for each device and thus to identify presented in a Table 4 both in a semantic model and as a
malfunctioning sensors straight away. statistical value.
Another application of machine learning algorithms is In the model-based representation of a sensor data, such as
duplicate elimination. For this task usually clustering and indicated on Figure 9, after processing a measurements
neural networks are exploited. One more technique is sorted presented in a Table 4 the system would detect an outlier at
neighbourhood method and its modifications (Bertolazzi, De a time 13:22:10.
Santis, & Scannapieco, 2003; Yan et al., 2007). All these
methods are used in large-scale databases as well On the contrary, analyzing data with statistical methods,
(Hernandez & Stolfo, 1995) and in the current use-case can there would be a trend detected. Thus, having available
be exploited to get rid of duplicates in event data. results both by model-based reasoning and statistical
techniques would prevent a false alarm.
For a knowledge-based approach the application domain can
be represented (Batini & Scannapieca, 2006): Likewise, it is useful to combine multivariate statistical
analysis and machine learning algorithms such as clustering
procedurally in form of program code, or implicitly and neural networks for establishing a quality of several
as patterns of activation in a neural network; sensors measurements.
as an explicit and declarative representation, in terms Therefore, that joint approach would help to improve the
of a knowledge base, consisting of logical formulas following weak points in managing low-quality data:
or rules expressed in a representation language.
efficient detection of data deficiency, such as (i) false
Typically the most general approach to perform data positive errors and (ii) false negative errors;
transformations are extensions of standard query language detecting correlations between particular sequences
SQL (Rahm & Do, 2000), which allows flexible of events and their consequences and between
transformation step definitions, their easy reuse and measurements using numerous solutions, such as
supports query processing tasks. pattern-matching algorithms, independency tests and
Additionally, there are several systems developed which others; and

7
Annual Conference of the Prognostics and Health Management Society 2013

model-driven correction of a model in case of techniques and reason over it to infer new knowledge. Their
changes in system structure. biggest advantage is the possibility of thorough diagnostics.
There are two fundamental knowledge-based methods used
5.2. Analysis and diagnostics by engineers:
In this subsection there are shortly explained, how the Fault/symptom diagnostic approach;
cleaned data is studied and processed further in the current
Causal tree diagnostic approach.
industry case. The main use is continuous diagnosing of the
condition of the appliance in order to predict and prevent In special situations several approaches may be combined
future faults of the machinery and to react instantly as for better results, but still both approaches are not disjoint,
anomalies or faults in operating are detected. Two main i.e. there are methods which might be referred to both types.
approaches for that are: (i) data-driven and (ii) knowledge- However, each approach has its advantages as well as
based techniques. Data-driven approaches includes pattern drawbacks and an engineer chooses appropriate diagnostic
recognition, neural networks, numerical approaches; technique based on the type of an appliance, complexity of
knowledge-based techniques include case descriptions, modeling, availability of necessary data and other factors.
faults and correct behavior modeling. The following factors
determine the choice of the appropriate diagnosing method 6. CONCLUSION
in a particular case (ISO 13379-1, 2009):
For current industry use-case, data is employed to conduct
application and initial design of the equipment; calculations necessary for emergency diagnostics, prognosis
of efficiency and further analysis. However, due to
availability of data to be analyzed and its complexity;
imperfect, incomplete or defective information and data
and
schemes these tasks have become rather difficult to realize:
required qualifications of a resulting computations wrong, missing or incorrectly formatted data may result in
and models. erroneous computations and false decisions, which can be
quite disastrous for an industry processes, especially for
Brief summary of above-mentioned diagnosing techniques, large-scale industries.
presented in (ISO 13379-1, 2009):
The current paper studies data quality and different
Data-driven approach methods classify different approaches to its assessment. We summarized and
functioning states of an appliance: normal, fault one, fault illustrated the most common defectiveness of a large-scaled
two etc. In order to achieve this, firstly the model is trained industrial database by the example of Siemens Energy
with the historical data from each condition and after that Domain and its equipment measurements. We also reviewed
launched with the new data, which has to be classified. existing techniques that are used to overcome errors in data
The great advantage of data-driven approach is that it does and proposed an approach to address data quality problems.
not have need for a thorough knowledge of the system to be And as shown in the examples, there is no doubt that data
diagnosed. The other strong advantage is absence of requires continuous control and quality improvement,
constraints on the data type of independent variables. As a although the design of a convenient technological solution
disadvantage it is worth to mention, that it might be to that challenge is far from trivial.
computationally difficult to train a model, as it requires
comparatively large amount of prescribed fault and non- REFERENCES
fault states to construct a model. In addition, modelling by ISO 13379-1, I. D. (2009). Condition monitoring and
this approach does not result with an explanatory diagnosis. diagnostics of machines data interpretation and
The list of the most common data-driven techniques: diagnostics techniques part 1: General guidelines. ISO,
Geneva, Switzerland.
Statistical data analysis, case-based reasoning; Batini, C., & Scannapieca, M. (2006). Data quality:
Neural networks; concepts, methodologies and techniques. Springer.
Bergdahl, M., Ehling, M., Elvers, E., Földesi, E., Körner, T.,
Classification trees; Kron, A., and others (2007). Handbook on data quality
assessment methods and tools, 9–10.
Random forests;
Buechi, M., Borthwick, A., Winkel, A., & Goldberg, A.
Logistic regression; (2003) ClueMaker: A Language for Approximate
Record Matching. IQ, 207-223.
Support vector machines. Corporation, A., & Consulting, W. M. (2011). Data quality
Knowledge-based approaches are used to represent in the insurance market.
knowledge using various knowledge representation Foken, T., Göockede, M., Mauder, M., Mahrt, L., Amiro,
B., & Munger, W. (2005) Post-field data quality

8
Annual Conference of the Prognostics and Health Management Society 2013

control. Handbook of micrometeorology, Springer, Strong, D. M., Lee, Y. W., & Wang, R. Y. (1997). Data
181-208. quality in context. Communications of the ACM, 40(5),
Gendron, M. S., & D’Onofrio, M. J. (2001). Data quality in 103–110.
healthcare industry. Data Quality, 7(1), 23–31. Tejada, S., Knoblock, C. A., & Minton, S. (2001). Learning
Kahn, B. K., Strong, D. M., & Wang, R. Y. (2002). object identification rules for information integration.
Information quality benchmarks: product and service Information Systems, 26(8), 607–633.
performance. Communications of the ACM, 45(4), 184– Wand, Y., & Wang, R. Y. (1996). Anchoring data quality
192. dimensions in ontological foundations.
Laudon, K. C. (1986). Data quality and due process in large Communications of the ACM, 39(11), 86–95.
interorganizational record systems. Communications of Wang, R. Y., Strong, D. M., & Guarascio, L. M. (1996).
the ACM, 29(1), 4–11. Beyond accuracy: What data quality means to data
Lup Low, W., Li Lee, M., & Wang Ling, T. (2001). A consumers. J. of Management Information Systems,
knowledge-based approach for duplicate elimination in 12(4), 5–33.
data cleaning. Information Systems, 26(8), 585–606. Winkler, W. E. (1999) The state of record linkage and
Optique. (2012). Optique: project description. Retrieved current research problems. Statistical Research
November, 2012, from CVS: "http://www.optique- Division, US Census Bureau.
project.eu/ about-optique/about-optique/". Winkler, W. E. (2004). Methods for evaluating and creating
Pipino, L. L., Lee, Y.W., & Wang, R. Y. (2002). Data data quality. Information Systems, Elsevier, 29, 531-
quality assessment. Communications of the ACM, 550.
45(4), 211–218. Yan, S., Lee, D., Kan, M.-Y., & Giles, L. C. (2007)
Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and Adaptive sorted neighborhood methods for efficient
current approaches. IEEE Data Eng. Bull., 23(4), 3–13. record linkage. Proceedings of the 7th ACM/IEEE-CS
Safran, D. G., Kosinski, M., Tarlov, A. R., Rogers, W. H., joint conference on Digital libraries, 185-194
Taira, D. A., Lieberman, N., & Ware, J. E. (1998). The
primary care assessment survey: tests of data quality
and measurement performance. Medical care, 36(5),
728–739.

You might also like