BookSlides 3A Data Exploration

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality
uality Issues Summary
Fundamentals of Machine Learning for

Predictive Data Analytics
Chapter 3: Data Exploration
Sections 3.1, 3.2, 3.3, 3.4
John Kelleher and Brian Mac Namee and Aoife D’Arcy
john.d.kelleher@dit.ie brian.macnamee@ucd.ie aoife@theanalyticsstore.com

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary
1 The Data Quality Report

Case Study: Motor Insurance Fraud
2 Getting To Know The Data

3 Identifying Data Quality Issues

4 Handling Data Quality Issues

Handling Missing Values
Handling Outliers
5 Summary
The Data Quality Report

A data quality report includes tabular reports that describe

the characteristics of each feature in an ABT using
standard statistical measures of central tendency and
variation.
The tabular reports are accompanied by data
visualizations:
A histogram for each continuous feature in an ABT.
A bar plot for each categorical feature in an ABT.
Table: The structures of the tables included in a data quality report to

describe (a) continuous features and (b) categorical features.
(a) Continuous Features

% 1st 3rd Std.
Feature Count Miss. Card. Min. Qrt. Mean Median Qrt. Max. Dev.
—— —— —— —— —— —— —— —— —— —— ——
—— —— —— —— —— —— —— —— —— —— ——
—— —— —— —— —— —— —— —— —— —— ——
(b) Categorical Features

2nd 2nd
% Mode Mode 2nd Mode Mode
Feature Count Miss. Card. Mode Freq. % Mode Freq. %
—— —— —— —— —— —— —— —— —— ——
—— —— —— —— —— —— —— —— —— ——
—— —— —— —— —— —— —— —— —— ——
The following slides show a portion of the ABT that has been
developed for the motor insurance claims fraud detection.
A portion of the ABT developed for this solution is shown first.

Table: Portions of the ABT for the motor insurance claims fraud
detection problem.
N UM % C LAIM
M ARITAL N UM I NJURY H OSPITAL C LAIM TOTAL N UM S OFT S OFT A MT F RAUD
ID T YPE I NC. S TATUS C LMNTS . T YPE S TAY A MNT. C LAIMED C LAIMS T ISS . T ISS . R CVD. F LAG
1 CI 0 2 Soft Tissue No 1,625 3250 2 2 1.0 0 1
2 CI 0 2 Back Yes 15,028 60,112 1 0 15,028 0
3 CI 54,613 Married 1 Broken Limb No -99,999 0 0 0 0 572 0
4 CI 0 4 Broken Limb Yes 5,097 11,661 1 1 1.0 7,864 0
5 CI 0 4 Soft Tissue No 8869 0 0 0 0 0 1
6 CI 0 1 Broken Limb Yes 17,480 0 0 0 0 17,480 0
7 CI 52,567 Single 3 Broken Limb No 3,017 18,102 2 1 0.5 0 1
8 CI 0 2 Back Yes 7463 0 0 0 0 7,463 0
9 CI 0 1 Soft Tissue No 2,067 0 0 0 0 2,067 0
10 CI 42,300 Married 4 Back No 2,260 0 0 0 0 2,260 0
. . .
. . .
. . .
300 CI 0 2 Broken Limb No 2,244 0 0 0 0 2,244 0
301 CI 0 1 Broken Limb No 1,627 92,283 3 0 0 1,627 0
302 CI 0 3 Serious Yes 270,200 0 0 0 0 270,200 0
303 CI 0 1 Soft Tissue No 7,668 92,806 3 0 0 7,668 0
304 CI 46,365 Married 1 Back No 3,217 0 0 0 1,653 0
. . .
. . .
. . .
458 CI 48,176 Married 3 Soft Tissue Yes 4,653 8,203 1 0 0 4,653 0
459 CI 0 1 Soft Tissue Yes 881 51,245 3 0 0 0 1
460 CI 0 3 Back No 8,688 729,792 56 5 0.08 8,688 0
461 CI 47,371 Divorced 1 Broken Limb Yes 3,194 11,668 1 0 0 3,194 0
462 CI 0 1 Soft Tissue No 6,821 0 0 0 0 0 1
. . .
. . .
. . .
491 CI 40,204 Single 1 Back No 75,748 11,116 1 0 0 0 1
492 CI 0 1 Broken Limb No 6,172 6,041 1 0 6,172 0
493 CI 0 1 Soft Tissue Yes 2,569 20,055 1 0 0 2,569 0
494 CI 31,951 Married 1 Broken Limb No 5,227 22,095 1 0 0 5,227 0
495 CI 0 2 Back No 3,813 9,882 3 0 0 0 1
496 CI 0 1 Soft Tissue No 2,118 0 0 0 0 0 1
497 CI 29,280 Married 4 Broken Limb Yes 3,199 0 0 0 0 0 1
498 CI 0 1 Broken Limb Yes 32,469 0 0 0 0 16,763 0
499 CI 46,683 Married 1 Broken Limb No 179,448 0 0 0 179,448 0
500 CI 0 1 Broken Limb No 8,259 0 0 0 0 0 1
Table: A data quality report for the motor insurance claims fraud
detection ABT
(a) Continuous Features

% 1st 3rd Std.
Feature Count Miss. Card. Min Qrt. Mean Median Qrt. Max Dev.
I NCOME 500 0.0 171 0.0 0.0 13,740.0 0.0 33,918.5 71,284.0 20,081.5
N UM C LAIMANTS 500 0.0 4 1.0 1.0 1.9 2 3.0 4.0 1.0
C LAIM A MOUNT 500 0.0 493 -99,999 3,322.3 16,373.2 5,663.0 12,245.5 270,200.0 29,426.3
TOTAL C LAIMED 500 0.0 235 0.0 0.0 9,597.2 0.0 11,282.8 729,792.0 35,655.7
N UM C LAIMS 500 0.0 7 0.0 0.0 0.8 0.0 1.0 56.0 2.7
N UM S OFT T ISSUE 500 2.0 6 0.0 0.0 0.2 0.0 0.0 5.0 0.6
% S OFT T ISSUE 500 0.0 9 0.0 0.0 0.2 0.0 0.0 2.0 0.4
A MOUNT R ECEIVED 500 0.0 329 0.0 0.0 13,051.9 3,253.5 8,191.8 295,303.0 30,547.2
F RAUD F LAG 500 0.0 2 0.0 0.0 0.3 0.0 1.0 1.0 0.5
Table: A data quality report for the motor insurance claims fraud
detection ABT.
(a) Categorical Features

2nd 2nd
% Mode Mode 2nd Mode Mode
Feature Count Miss. Card. Mode Freq. % Mode Freq. %
I NSURANCE T YPE 500 0.0 1 CI 500 1.0 – – –
M ARITAL S TATUS 500 61.2 4 Married 99 51.0 Single 48 24.7
I NJURY T YPE 500 0.0 4 Broken Limb 177 35.4 Soft Tissue 172 34.4
H OSPITAL S TAY 500 0.0 2 No 354 70.8 Yes 146 29.2
0.4
0.00020
0.3
Density
Density
0.2
0.00010
0.1
0.00000
0.0
0 10000 30000 50000 70000 1 2 3 4
Income Num. Claimants
(a) I NCOME (b) N UM C LAIMANTS

6e−05
3e−05
4e−05
2e−05
Density
Density
2e−05
1e−05
0e+00
0e+00
−1e+05 0e+00 1e+05 2e+05 0e+00 2e+05 4e+05 6e+05

Claim Amount Total Claimed
(c) C LAIM A MOUNT (d) TOTAL C LAIMED
Figure: Visualizations of the continuous and categorical features in

the motor insurance claims fraud detection ABT in Table 2 [7] .
0.8
0.5
0.4
0.6
0.3
Density
Density
0.4
0.2
0.2
0.1
0.0
0.0
0 1 2 3 4 5 56 0 1 2 3 5
Num. Claims Num. Soft Tissue
(a) N UM C LAIMS (b) N UM S OFT T ISSUE
8e−05
6e−05
3
Density
Density
4e−05
2
2e−05
1
0e+00
0
0.0 0.5 1.0 1.5 2.0 0 50000 150000 250000

% Soft Tissue Amount Received
(c) % S OFT T ISSUE (d) A MOUNT R ECEIVED

0.7
0.6
0.30
0.6
0.5
0.5
0.4
0.20
0.4
Density
Density
Density
0.3
0.3
0.2
0.10
0.2
0.1
0.1
0.00
0.0
0.0
Missing Married Single Divorced Broken Limb Soft Tissue Back Serious No Yes
Marital Status Injury Type Hospital Stay
(a) M ARITAL S TATUS (b) I NJURY T YPE (c) H OSPITAL S TAY

1.0
0.6
0.8
0.5
0.4
0.6
Density
Density
0.3
0.4
0.2
0.2
0.1
0.0
0.0
0 1 CI
Fraud Flag Insurance Type
(a) F RAUD F LAG (b) I NSURANCE

T YPE

Getting To Know The Data

For categorical features, we should:

Examine the mode, 2nd mode, mode %, and 2nd mode %
as these tell us the most common levels within these
features and will identify if any levels dominate the dataset.
For continuous features we should:
Examine the mean and standard deviation of each feature
to get a sense of the central tendency and variation of the
values within the dataset for the feature.
Examine the minimum and maximum values to understand
the range that is possible for each feature.
When we generate histograms of features there are a

number of common, well understood shapes that we
should look out for.
(a) Uniform (b) Normal (Unimodal) (c) Unimodal (skewed right)
Figure: Histograms for different sets of data each of which exhibit

well-known, common characteristics.
(a) Unimodal (skewed left) (b) Exponential (c) Multimodal
Figure: Histograms for different sets of data each of which exhibit

well-known, common characteristics.
A uniform distribution indicates

that a feature is equally likely to
take a value in any of the ranges
present.
Uniform
Features following a normal

distribution are characterized by
a strong tendency towards a
central value and symmetrical
variation to either side of this.
Normal (Unimodal)
Skew is simply a tendency

towards very high (right skew) or
very low (keywordleft skew)
values.
Unimodal (skewed left)
Unimodal (skewed right)

In a feature following an
exponential distribution the
likelihood of occurrence of a
small number of low values is
very high, but sharply diminishes
as values increase.
Exponential
A feature characterized by a
multimodal distribution has two
or more very commonly occurring
ranges of values that are clearly
separated.
Multimodal
The probability density function for the normal distribution

(or Gaussian distribution) is
(x − µ)2
1 −
N(x, µ, σ) = √ e 2σ 2 (1)
σ 2π
where x is any value, and µ and σ are parameters that
define the shape of the distribution: the population mean
and population standard deviation.
0.4
µ=0, σ=1
µ=−2, σ=1
µ=+2, σ=1
0.3
Density
0.2
0.1
0.0
−6 −4 −2 0 2 4 6
Value
Figure: Three normal distributions with different means but identical

standard deviations.
0.4
µ=0, σ=1
µ=0, σ=2
µ=0, σ=6
0.3
Density
0.2
0.1
0.0
−6 −4 −2 0 2 4 6
Value
Figure: Three normal distributions with identical means but different

standard deviations.
The 68 − 95 − 99.7 rule is a useful characteristic of the

normal distribution.
The rule states that approximately:
68% of the observations will be within one σ of µ
95% of observations will be within two σ of µ
99.7% of observations will be within three σ of µ.
µ−3σ µ−2σ µ−σ µ µ+σ µ+2σ µ+3σ
Figure: An illustration of the 68 − 95 − 99.7 percentage rule that a

normal distribution defines as the expected distribution of
observations. The grey region defines the area where 95% of
observations are expected.

Examine the data quality report for the motor insurance fraud
prediction scenario and comment on the central tendency and
variation of each feature.
Identifying Data Quality Issues

A data quality issue is loosely defined as anything

unusual about the data in an ABT.
The most common data quality issues are:
missing values
irregular cardinality
outliers
The data quality issues we identify from a data quality

report will be of two types:
Data quality issues due to invalid data.
Data quality issues due to valid data.
Table: The structure of a data quality plan.

Feature Data Quality Issue Potential Handling Strategies
—— —— ——
—— —— ——
—— —— ——
—— —— ——
Table: The data quality plan for the motor insurance fraud prediction
ABT.

N UM S OFT T ISSUE Missing values (2%)
C LAIM A MOUNT Outliers (high)
A MOUNT R ECEIVED Outliers (high)
Handling Data Quality Issues

Approach 1: Drop any features that have missing value.

Approach 2: Apply complete case analysis.
Approach 3: Derive a missing indicator feature from
features with missing value.
Imputation replaces missing feature values with a

plausible estimated value based on the feature values that
are present.
The most common approach to imputation is to replace
missing values for a feature with a measure of the central
tendency of that feature.
We would be reluctant to use imputation on features
missing in excess of 30% of their values and would
strongly recommend against the use of imputation on
features missing in excess of 50% of their values.
Handling Outliers
The easiest way to handle outliers is to use a clamp

transformation that clamps all values above an upper
threshold and below a lower threshold to these threshold
values, thus removing the offending outliers

lower if ai < lower

ai = upper if ai > upper (2)

ai otherwise

where ai is a specific value of feature a, and lower and

upper are the lower and upper thresholds.

What handling strategies would you recommend for the data
quality issues found in the motor Insurance fraud ABT?
Table: The data quality plan for the motor insurance fraud prediction
ABT.

N UM S OFT T ISSUE Missing values (2%) Imputation
(median: 0.0)
C LAIM A MOUNT Outliers (high) Clamp transformation
(manual: 0, 80 000)
A MOUNT R ECEIVED Outliers (high) Clamp transformation
(manual: 0, 80 000)
Summary
The key outcomes of the data exploration process are

that the practitioner should
1 Have gotten to know the features within the ABT, especially
their central tendencies, variations, and distributions.
2 Have identified any data quality issues within the ABT, in
particular missing values, irregular cardinality, and
outliers.
3 Have corrected any data quality issues due to invalid data.
4 Have recorded any data quality issues due to valid data in
a data quality plan along with potential handling strategies.
5 Be confident that enough good quality data exists to
continue with a project.
1 The Data Quality Report

2 Getting To Know The Data

3 Identifying Data Quality Issues

4 Handling Data Quality Issues

Handling Outliers
5 Summary

BookSlides 3A Data Exploration

Uploaded by

Copyright:

Available Formats

BookSlides 3A Data Exploration

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BookSlides 3A Data Exploration

Uploaded by

Copyright:

Available Formats

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality

uality Issues Summary

Fundamentals of Machine Learning for

John Kelleher and Brian Mac Namee and Aoife D’Arcy

john.d.kelleher@dit.ie brian.macnamee@ucd.ie aoife@theanalyticsstore.com

1 The Data Quality Report

2 Getting To Know The Data

3 Identifying Data Quality Issues

4 Handling Data Quality Issues

The Data Quality Report

A data quality report includes tabular reports that describe

Table: The structures of the tables included in a data quality report to

(a) Continuous Features

(b) Categorical Features

A portion of the ABT developed for this solution is shown first.

(a) Continuous Features

(a) Categorical Features

(a) I NCOME (b) N UM C LAIMANTS

−1e+05 0e+00 1e+05 2e+05 0e+00 2e+05 4e+05 6e+05

(c) C LAIM A MOUNT (d) TOTAL C LAIMED

Figure: Visualizations of the continuous and categorical features in

(a) N UM C LAIMS (b) N UM S OFT T ISSUE

0.0 0.5 1.0 1.5 2.0 0 50000 150000 250000

(c) % S OFT T ISSUE (d) A MOUNT R ECEIVED

Figure: Visualizations of the continuous and categorical features in

(a) M ARITAL S TATUS (b) I NJURY T YPE (c) H OSPITAL S TAY

Figure: Visualizations of the continuous and categorical features in

(a) F RAUD F LAG (b) I NSURANCE

Figure: Visualizations of the continuous and categorical features in

Getting To Know The Data

For categorical features, we should:

When we generate histograms of features there are a

(a) Uniform (b) Normal (Unimodal) (c) Unimodal (skewed right)

Figure: Histograms for different sets of data each of which exhibit

(a) Unimodal (skewed left) (b) Exponential (c) Multimodal

Figure: Histograms for different sets of data each of which exhibit

A uniform distribution indicates

Features following a normal

Skew is simply a tendency

Unimodal (skewed left)

Unimodal (skewed right)

The probability density function for the normal distribution

Figure: Three normal distributions with different means but identical

Figure: Three normal distributions with identical means but different

The 68 − 95 − 99.7 rule is a useful characteristic of the

Figure: An illustration of the 68 − 95 − 99.7 percentage rule that a

Case Study: Motor Insurance Fraud

Case Study: Motor Insurance Fraud

Identifying Data Quality Issues

A data quality issue is loosely defined as anything

The data quality issues we identify from a data quality

Table: The structure of a data quality plan.

Case Study: Motor Insurance Fraud

Feature Data Quality Issue Potential Handling Strategies

Handling Data Quality Issues

Handling Missing Values

Approach 1: Drop any features that have missing value.

Handling Missing Values

Imputation replaces missing feature values with a