3 Data Preprocessing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Data Pre-processing

Source: Chapter 3
book: Data Mining by Han and Kamber

CSE25.8: Elective-I
MCA II Sem
March-June 2022

Content
• Data cleaning
• Data integration
• Data reduction
• Data transformation
• Data discretization

1
Data pre-processing
• How can the data be preprocessed in order

• to help improve the quality of the data and,


consequently, of the mining results?

• How can the data be preprocessed so as to improve the


efficiency and ease of the mining process.

Data Quality
• There are many factors comprising data quality,
including:
– Accuracy
– Completeness
– Consistency
– Timeliness
– Believability
– interpretability.

• Inaccurate, incomplete, and inconsistent data are


commonplace properties of large real-world databases
and data warehouses.

2
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation

Data Cleaning
• Missing Values
• Noisy data

3
Data Integration
• Data integration:
– Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id  B.cust-#
– Integrate metadata from different sources
• Entity identification problem:
– Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
• Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different
sources are different
– Possible reasons: different representations, different scales, e.g.,
metric vs. British units

Issues of Information Integration


• e.g. An automobile company has 1000 dealers, each of
which maintains a database of their cars in stock. This
company wants to create and integrated database
containing information of all 1000 sources.
• The integrated database will help dealers locate a
particular model if they don’t have one in stock. It can also
be used by corporate analysts to predict market and
adjust production to provide models most likely to sell.
• However, all 1000 dealers do not use same database
schema.
• Cars(serialNo, model, color, autoTrans, cdPlayer,…..)
• Autos(serial, model, color) and Options(serial, option)

4
Issues of Information Integration
• Data type differences
– Serial numbers may be represented by characters, numbers, fixed
v/s varying length at different sources
• Value differences
– The same concept may be represented by different constants at
different sources. E.g. color black can be given a code BL, and BL
may mean blue color at another source, etc.
• Semantic differences
– Terms may be given different interpretations at different sources.
E.g. One dealer may include trucks in Cars relation, another may
include station wagons and mini vans, etc.
• Missing values
– A source might not record information of a type that most of other
sources provide. E.g. one dealer might not record some minor
accessories at all, etc.

Data Transformation
• A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
• Methods
– Smoothing: Remove noise from data
– Attribute/feature construction
• New attributes constructed from the given ones
– Aggregation: Summarization, data cube construction
– Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
– Discretization: Concept hierarchy climbing

5
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission
error
incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?

Incomplete (Missing) Data


• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the
time of entry
– not register history or changes of the data
• Missing data may need to be inferred

6
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably.
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the same
class: smarter
– the most probable value: inference-based such as
Bayesian formula or decision tree

Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data

7
How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human (e.g.,
deal with possible outliers)

Binning

• In smoothing by bin boundaries, the minimum and maximum


values in a given bin are identified as the bin boundaries.
• Each bin value is then replaced by the closest boundary value.

8
Binning
• In general, the larger the width, the greater the effect of
the smoothing.
• Alternatively, bins may be equal width, where the interval
range of values in each bin is constant.

• Binning is also used as a discretization technique.

Outlier detection with Clustering

A 2-D customer data plot showing three data clusters.


Outliers may be detected as values that fall outside of the cluster
sets.

9
Data discrepancy detection
• Discrepancies can be caused by several factors
– Poorly designed data entry forms that have many optional fields
– Human error/deliberate error in data entry
– Data decay (e.g. outdated addresses)
– Inconsistent data representations (e.g. where a given attribute
can have different names in different databases)
• In order deal with data discrepancy, use knowledge we
may already have regarding properties of the data i.e.
meta data
• Meta data includes:
– domain and data type of each attribute
– Acceptable values for each attribute
– Range of length of values, do all values fall within the range; e.g.
values that are more than 2 standard deviations away from mean
for an attribute may be flagged as potential outliers.

Field overloading
• Field overloading is another source of errors that
typically results when developers squeeze new attribute
definitions into unused bit portions of already defined
attributes

• e.g. using an unused bit of attribute whose value range


uses say only 31 out of 32 bits.

10
Applying various rules
• A unique rule says that each value of given attribute
must be different from all other values for that attribute.
• A consecutive rule says there can be no missing values
between lowest and highest values for the attribute, and
all values must also be unique (e.g. check numbers)
• A null rule specifies use of blanks, question marks,
special characters, or other strings that may indicate null
condition (e.g. where a value for a given attribute is not
available)

Data Cleaning as a Process


• Data discrepancy detection
– Use metadata (e.g., domain, range, dependency, distribution)
– Check field overloading
– Check uniqueness rule, consecutive rule and null rule
– Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections
• Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and clustering)
• Data migration and integration
– Data migration tools: allow transformations to be specified e.g.
replace string “gender” by “sex”
– ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
• Integration of the two processes
– Iterative and interactive (e.g., Potter’s Wheels)

11
Mitosheet

• Mitosheet is simply a spreadsheet that


exists inside of your JupyterLab notebook.
• This allows you to very quickly bounce
back and forth between doing data
analysis in Python and exploring /
manipulating your data in the Mitosheet.

https://docs.trymito.io/getting-started/installing-mito

Data Reduction
• Data reduction strategies include dimensionality
reduction, numerosity reduction, and data
compression.
• Dimensionality reduction is the process of reducing
the number of random variables or attributes under
consideration. Dimensionality reduction methods
include wavelet transform etc.

12
Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple
databases
– Object identification: The same attribute or object may
have different names in different databases
– Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
• Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
• Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality

Data Normalization
• The measurement unit used can affect the data analysis.
• For example, changing measurement units from meters to
inches for height, or from kilograms to pounds for weight, may
lead to very different results.
• In general, expressing an attribute in smaller units will lead to
a larger range for that attribute, and thus tend to give such an
attribute greater effect or “weight.”
• To help avoid dependence on the choice of measurement
units, the data should be normalized or standardized.
• Normalizing the data attempts to give all
attributes an equal weight.

13
Data Normalisation
• where the attribute data are scaled so as
to fall within a smaller range, such as -1.0
to 1.0, or 0.0 to 1.0.
• to bring train and test data at same page

Data Normalization
• Normalization is particularly useful for classification algorithms
involving neural networks or distance measurements such as
nearest-neighbor classification and clustering.
• If using the neural network backpropagation algorithm for
classification, normalizing the input values for each attribute
measured in the training tuples will help speed up the learning
phase.
• For distance-based methods, normalization helps prevent
attributes with initially large ranges (e.g., income) from
outweighing attributes with initially smaller ranges (e.g.,
binary attributes).
• It is also useful when given no prior knowledge of the data.
• There are many methods for data normalization.

14
LP-Norm
• For normalizing data using LP norms, first we need to
calculate their norms.
• Norm of a vector represents some property of vector; e.g.
L2 norm of a vector denotes its length/magnitude.
• One important use of norm is to transform a given
vector into a unit-length vector
– i.e., making the magnitude of vector = 1, while still preserving its
direction.
• This is achieved by dividing each element in a vector by
its length i.e. its L2-norm.
• Another use of L1 & L2 norm:
– in computation of loss in regularized gradient descent algorithms.
– These are also used in famous Ridge and Lasso regression
algorithms.

LP-Norm
• The p-norm is a norm on suitable real vector spaces given
by the pth root of the sum (or integral) of the pth-powers of
the absolute values of the vector components.
• The general definition for the p-norm of a vector v that has
N elements is:

• https://www.journaldev.com/45324/norm-of-vector-python
where p is any positive real value, Inf, or -Inf. Some common
values of p are 1, 2, and Inf.
•If p is 1, then the resulting 1-norm is the sum of the absolute
values of the vector elements.
•If p is 2, then the resulting 2-norm gives the vector magnitude
or Euclidean length of the vector.
•If p is Inf, then

15
Rescaling
• Rescaling changes the distance between the
min and max values in a data set by stretching
or squeezing the points along the number line.

• The equation for rescaling data X to an arbitrary


interval [a b] is:

Example: Min-max normalization


• Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA

– Eg. Let income range $12,000 to $98,000 normalized to [0.0,


1.0]. Then $73,000 is mapped to

73,600  12,000
(1.0  0)  0  0.716
98,000  12,000
• Normalization by decimal scaling

v
v' Where j is the smallest integer such that Max(|ν’|) < 1
10 j

16
Standard Normal Distribution
• A standard normal distribution has mean = 0 and standard
deviation=1.
• Z-score normalization is useful in machine learning
settings since it can tell you:
– how far a data point is from the average of
the whole data set.
• It can be most appropriate when there are just a few
outliers, since it provides a simple way to compare a data
point to the norm.

• We calculate a z-score when comparing data sets that are


likely to be similar because of some genetic or
experimental reason, e.g.
– a physical attribute of an animal or
– results within a certain time frame.

Z-score
• z-scores measure the distance of a data point from the
mean in terms of the standard deviation.
• The standardized data set has mean=0 and standard
deviation 1, and retains the shape properties of the original
data set.

17
Z-score
• The z-scores of the data are preserved, so
the shape of the distribution remains the
same. v  A
• E.g. v'
(μ: mean, σ: standard deviation):  A

Ex. Let μ = 54,000, σ = 16,000. Then $73,000 is


mapped to

73,600  54,000
 1.225
16,000

IQR
• The interquartile range (IQR) of a data set describes the
range of the middle 50% of values when the values are
sorted.

• If the median of the data is Q2, the median of the lower


half of the data is Q1, and the median of the upper half of
the data is Q3, then IQR = Q3 - Q1.
• The IQR is generally preferred over looking at the full
range of the data when the data contains outliers (very
large or very small values) because the IQR excludes the
largest 25% and smallest 25% of values in the data.

18
Normalisation
• It is necessary to save the normalization parameters
(e.g., the mean and standard deviation etc) so that
future data can be normalized in a uniform manner.

Redundancy and Correlation Analysis


• Redundancy is another important issue in data integration.
• An attribute (such as annual revenue, for instance) may be
redundant if it can be “derived” from another attribute or set
of attributes.
• Inconsistencies in attribute or dimension naming can also
cause redundancies in the resulting data set.
• Some redundancies can be detected by
correlation analysis.

19
Redundancy and Correlation Analysis
• Given two attributes, such analysis can measure how strongly
one attribute implies the other, based on the available data.
• For nominal data, we use the χ2 (chi-square) test.
• For numeric attributes, we can use the correlation coefficient
and covariance, both of which access how one attribute’s
values vary from those of another.
Categorical Continuous-value

Categorical χ2 test T test (for 2 values)


(P-value) ANOVA (for > 2 values)

Continuous- Logistic Correlation coefficient


value regression (-1 to +1)

Correlation Analysis (Nominal Data)


• χ 2 (chi-square) test (Observed  Expected) 2
 
2

Expected

• The larger the χ 2 value, the more likely the variables are
related
• The cells that contribute the most to the χ 2 value are
those whose actual count is very different from the
expected count
• Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population

20
Expected value
• The expected value (EV) is an anticipated value for an
investment at some point in the future.
• In statistics and probability analysis, the expected
value is calculated by multiplying each of the possible
outcomes by the likelihood each outcome will occur and
then summing all of those values.

• Expected count/frequency of 2 attributes A and B is


given as:

• Cells/attributes that contribute the most to chi square


value are those whose actual count is very different from
expected.

Chi Square P-Values


• The statistic tests the null hypothesis that A and B are
independent, that is, there is no correlation between
them.
• A chi square test gives a p-value.
– The p-value tells if your test results are significant or not.
• In order to perform a chi square test and get the p-value,
you need two pieces of information:

 Degrees of freedom:
– that’s just the number of categories minus 1 i.e. (r-1) * (c-1)
 The alpha level(α):
– chosen by you, or the researcher
– usual alpha level is 0.05 (5%), but you could also have other
levels like 0.01 or 0.10.

21
e.g.
• Suppose that a group of 1500 people was surveyed.
• The gender of each person was noted.
• Each person was polled as to whether his or her
preferred type of reading material was fiction or
nonfiction.
• Thus, we have two attributes, gender and preferred
reading.
• The observed frequency (or count) of each possible joint
event is summarized in the contingency table shown in
Table 3.1

Chi-Square Calculation: An Example

• Χ2 (chi-square) calculation (numbers in parenthesis are


expected counts calculated based on the data distribution
in the two categories)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
2      507.93
90 210 360 840

22
Hypothesis that the attributes are not
correlated is rejected
• For this 2 X 2 table, the degrees of freedom are (2-1) X
(2-1).

• For 1 degree of freedom, chi square value needed to


reject the hypothesis at the 0.001 significance level is
10.828 (from any textbook on statistics).

• Since our computed value is above this, we can reject


the hypothesis that gender and preferred reading are
independent and conclude that the two attributes are
(strongly) correlated for the given group of people.

• Next slide shows distribution table.

Chi-distribution Table

23
Correlation Analysis (Numeric Data)
• Correlation coefficient (also called Pearson’s product
moment coefficient)
 
n n
(ai  A)(bi  B ) (ai bi )  n AB
rA, B  i 1
 i 1

(n  1) A B (n  1) A B

where n is the number of tuples, A and B are the respective means


of A and B, σA and σB are the respective standard deviation of A
and B, and Σ(aibi) is the sum of the AB cross-product.
• If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated

Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

48

24
Heuristic Search in Attribute Selection
• There are 2d possible attribute combinations of d attributes
• Typical heuristic attribute selection methods:
– Best single attribute under the attribute independence
assumption: choose by significance tests
– Best step-wise feature selection:
• The best single-attribute is picked first
• Then next best attribute condition to the first, ...
– Step-wise attribute elimination:
• Repeatedly eliminate the worst attribute
– Best combined attribute selection and elimination
– Optimal branch and bound:
• Use attribute elimination and backtracking

Attribute Creation (Feature Generation)


• Create new attributes (features) that can capture the
important information in a data set more effectively than
the original ones
• Three general methodologies
– Attribute extraction
• Domain-specific
– Mapping data to new space (see: data reduction)
• E.g., Fourier transformation, wavelet
transformation, manifold approaches (not covered)
– Attribute construction
• Combining features (see: discriminative frequent
patterns in Chapter 7)
• Data discretization

50

25
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
– Duplicate much or all of the information contained in
one or more other attributes
– E.g., purchase price of a product and the amount of
sales tax paid
• Irrelevant attributes
– Contain no information that is useful for the data
mining task at hand
– E.g., students' ID is often irrelevant to the task of
predicting students' GPA

Principal Component Analysis (PCA)


• Find a projection that captures the largest amount of
variation in data
• The original data are projected onto a much smaller space,
resulting in dimensionality reduction. We find the
eigenvectors of the covariance matrix, and these
eigenvectors define the new space
x2

x1

26
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal
component vectors
– The principal components are sorted in order of decreasing
“significance” or strength
– Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
• Works for numeric data only

Data Reduction 2: Numerosity Reduction


• Reduce data volume by choosing alternative, smaller
forms of data representation
• Parametric methods (e.g., regression)
– Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)
– Ex.: Log-linear models—obtain value at a point in m-
D space as the product on appropriate marginal
subspaces
• Non-parametric methods
– Do not assume models
– Major families: histograms, clustering, sampling, …

27
Parametric Data Reduction: Regression
and Log-Linear Models
• Linear regression
– Data modeled to fit a straight line
– Often uses the least-square method to fit the line
• Multiple regression
– Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector
• Log-linear model
– Approximates discrete multidimensional probability
distributions

y
Regression Analysis Y1

• Regression analysis: A collective name for


techniques for the modeling and analysis of Y1’
y=x+1
numerical data consisting of values of a
dependent variable (also called response
variable or measurement) and of one or X1 x
more independent variables (aka.
explanatory variables or predictors)
• Used for prediction
• The parameters are estimated so as to (including forecasting of
give a "best fit" of the data time-series data), inference,
hypothesis testing, and
• Most commonly the best fit is evaluated by
modeling of causal
using the least squares method, but other
relationships
criteria have also been used

28
Regress Analysis and Log-Linear
• Linear regression: Y = w X + b Models
– Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
– Using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
• Multiple regression: Y = b0 + b1 X1 + b2 X2
– Many nonlinear functions can be transformed into the above
• Log-linear models:
– Approximate discrete multidimensional probability distributions
– Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset
of dimensional combinations
– Useful for dimensionality reduction and data smoothing

Histogram Analysis
• Divide data into buckets
and store average (sum) 40
35
for each bucket
30
• Partitioning rules:
25
– Equal-width: equal 20
bucket range 15

– Equal-frequency (or 10

equal-depth) 5
0
10000 30000 50000 70000 90000

29
Clustering
• Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
• Can be very effective if data is clustered but not if data is
“smeared”
• Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
• There are many choices of clustering definitions and
clustering algorithms
• Cluster analysis will be studied in depth in Chapter 10

Discretization
• Three types of attributes
– Nominal—values from an unordered set, e.g., color, profession
– Ordinal—values from an ordered set, e.g., military or academic
rank
– Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
– Interval labels can then be used to replace actual data values
– Reduce data size by discretization
– Supervised vs. unsupervised
– Split (top-down) vs. merge (bottom-up)
– Discretization can be performed recursively on an attribute
– Prepare for further analysis, e.g., classification

30
Data Discretization Methods
• Typical methods: All the methods can be applied
recursively
– Binning
• Top-down split, unsupervised
– Histogram analysis
• Top-down split, unsupervised
– Clustering analysis (unsupervised, top-down split or
bottom-up merge)
– Decision-tree analysis (supervised, top-down split)
– Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)

Data discretization and reduction


• Binning technique reduces the number of distinct values per
attribute.
• This acts as a form of data reduction for logic-based data
mining methods, such as decision tree induction, which
repeatedly makes value comparisons on sorted data.
• Concept hierarchies are a form of data discretization that can
also be used for data smoothing.
• A concept hierarchy for price, for example, may map real
price values into inexpensive, moderately priced, and
expensive, thereby reducing the number of data values to be
handled by the mining

31
Simple Discretization: Binning
• Equal-width (distance) partitioning
– Divides the range into N intervals of equal size: uniform grid
– if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
– The most straightforward, but outliers may dominate presentation
– Skewed data is not handled well

• Equal-depth (frequency) partitioning


– Divides the range into N intervals, each containing approximately
same number of samples
– Good data scaling
– Managing categorical attributes can be tricky

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9 Binning Methods
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
for Data
* Smoothing by bin boundaries: Discretization
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

32
Summary
• Data quality: accuracy, completeness, consistency,
timeliness, believability, interpretability
• Data cleaning: e.g. missing/noisy values, outliers
• Data integration from multiple sources:
– Entity identification problem
– Remove redundancies
– Detect inconsistencies
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation

33

You might also like