R21 Unit 2

UNIT - II: Data Pre-processing
Aggregation, Sampling, Dimensionality Reduction,

Feature Subset Selection, Feature creation, Discretization
and Binarization, Variable Transformation, Measures of
Similarity and Dissimilarity.
• Preprocessing steps should be applied to make the data more
suitable for data mining.
• These items fall into two categories: selecting data objects and
attributes for the analysis or creating/changing the attributes.
Aggregation
• Aggregation is combining of two or more objects into a single

object.
• Consider a data set consisting of transactions recording the daily
sales of products in various store locations (Minneapolis,
Chicago, Paris, . . .) for different days over the course of a year.
• One way to aggregate transactions for this data set is to replace all the
transactions of a single store with a single storewide transaction.
• An obvious issue is how an aggregate transaction is created; i.e., how
the values of each attribute are combined across all the records.
• Quantitative attributes, such as price, are aggregated by taking a sum
or an average.
• A qualitative attribute, such as item, can be summarized as the set of
all the items that were sold at that location.
• There are several motivations for aggregation
• First, the smaller data sets resulting from data reduction require less
memory and processing time
• Second, aggregation can act as a change of scope by providing a
high-level view of the data instead of a low-level view.
• Finally, the behavior of groups of objects or attributes is often
more stable than that of individual objects or attributes
• This example is based on precipitation in Australia from the
period 1982 to 1993.
Histogram of average monthly precipitation

Sampling
• Sampling is a commonly used approach for selecting a subset

of the data objects to be analyzed.
• Statisticians use sampling because obtaining the entire set of
data is too expensive or time consuming.
The key principle for effective sampling is
• Using a sample will work almost as well as using the entire data set
if the sample is representative.
• In turn, a sample is representative if it has approximately the same
property as the original set of data.
• If the mean (average) of the data objects is the property, then a
sample is representative if it has mean that is close to the original
data.
Sampling Approaches
• There are many sampling techniques.
simple random sampling. For this type of sampling, there is an equal

probability of selecting any particular item.
There are two variations on random sampling
(1) sampling without replacement—
• as each item is selected, it is removed from the set of all objects
that together constitute the data set, and

(2) sampling with replacement—
• objects are not removed from the data set as they are selected for
the sample.
• In sampling with replacement, the same object can be picked more
than once.
• Stratified sampling starts with prespecified groups of objects.
• Equal numbers of objects are drawn from each group even though
the groups are of different sizes.
• Example (Sampling and Loss of Information). Once a sampling
technique is selected, it is necessary to choose the sample size.
Example of the loss of structure with sampling.

• Example (Determining the Proper Sample Size)
• Given a set of data that consists of equal sized groups, find at
least one representative point for each of the groups.

Finding representative points from 10 groups.
Progressive Sampling
• The proper sample size can be difficult to determine, so adaptive
or progressive sampling schemes are sometimes used.
• These approaches start with a small sample, and then increase the
sample size until a sample of sufficient size has been obtained.
Dimensionality Reduction
• Data sets can have a large number of features.

• Consider a set of documents, where each document is represented
by a vector whose components are the frequencies with which each
word occurs in the document.
• In such cases, there are thousands of attributes.
• As another example, consider a set of time series consisting of
daily closing price of various stocks over a period of 30 years.
There are a variety of benefits to dimensionality reduction.
• Many data mining algorithms work better
• Reduction of dimensionality can lead to a more understandable

model.
• Dimensionality reduction allow the data to be more easily

visualized.
• Finally, the amount of time and memory required by the data

mining algorithm is reduced.
• In curse of dimensionality, data analysis become harder as the
dimensionality of the data increases.
• For classification, there are not enough data objects to allow the
creation of a model.
• For clustering, distance between points, which are critical for
clustering, become less meaningful.
Linear Algebra Techniques for Dimensionality Reduction
• Principal component analysis, or PCA, is a dimensionality
reduction method, used to reduce the dimensionality of data sets,
by transforming a large set of variables into a smaller one.
• Principal Components Analysis (PCA) is a linear algebra technique for
continuous attributes that finds new attributes (principal components)
that
(1) are linear combinations of the original attributes,

(2) are orthogonal (perpendicular) to each other, and
(3) capture the maximum amount

of variation in the data.
Feature Subset Selection
• Another way to reduce the dimensionality is to use only a subset
of the features.
• This method is used ,if redundant and irrelevant features are
present.
• Redundant and irrelevant features can reduce classification
accuracy.
• The ideal approach to feature selection is to try all possible
subsets of features as input to the data mining algorithm.
• There are three standard approaches to feature selection:
embedded, filter, and wrapper.
• Embedded approaches::Feature selection occurs as part of the data
mining algorithm.
• Specifically, during the operation of the data mining algorithm,
the algorithm itself decides which attributes to use and which to
ignore.
• Ex:: decision tree classifiers
• Filter approaches::Features are selected before the data mining
algorithm is run, using some approach.
• Wrapper approaches These methods use the target data mining
algorithm as a black box to find the best subset of attributes.
An Architecture for Feature Subset Selection
• The feature selection process contains four parts: a measure for
evaluating a subset, a search strategy that controls the generation
of a new subset of features, a stopping criterion, and a validation
procedure.
• Filter methods and wrapper methods differ only in the way in
which they evaluate a subset of features.
Flowchart of a feature subset selection process
• feature subset selection is a search over all possible subsets of features.
• Different types of search strategies can be used, but the search strategy
should be inexpensive and should find optimal sets of features.
• Next is, evaluation step to judge how the subset of features are
considered.
• This requires an evaluation measure to determine the goodness of a
subset of attributes with respect to a data mining task
• The number of subsets is impractical to examine them all, some
sort of stopping criterion is necessary.
• This strategy is based on one or more conditions : the number of
iterations, whether a subset of a certain size has been obtained.
• Finally, once a subset of features has been selected, the results of
the data mining algorithm on the selected subset should be
validated.
• A straightforward evaluation approach is to run the algorithm with
the full set of features and compare the full results to results
obtained using the subset of features.
• Feature Weighting is an alternative to keeping or eliminating
features.
• More important features are assigned a higher weight, while less
important features are given a lower weight.
• These weights are assigned based on domain knowledge about the
relative importance of features
Feature Creation
• It is possible to create, from the original attributes, a new set of
attributes that captures the important information in a data set.
• The number of new attributes can be smaller than the number of
original attributes.
• Three methods for creating new attributes are : feature extraction,
mapping the data to a new space, and feature construction.
Feature Extraction
• The creation of a new set of features from the original raw data
is known as feature extraction.
• Consider a set of photographs, where each photograph is to be
classified according to whether or not it contains a human face.
• The raw data is a set of pixels.
Mapping the Data to a New Space
• A totally different view of the data can reveal important features.
• Consider, for example, time series data, which contains periodic
patterns.
• If there is only a single periodic pattern and not much noise, then
the pattern is easily detected.
• Such patterns can be detected by applying a Fourier transform to
the time series to change to a representation in which frequency
information is used.
• The Fourier transform is a mathematical formula that transforms a
signal sampled in time or space to the same signal sampled in
temporal or spatial frequency.
Application of the Fourier transform in time series data.
Feature Construction
• Features in the original data sets have the necessary information,
but it is not suitable for the data mining algorithm.
• In this situation, one or more new features constructed from
original features, can be more useful than the original features.
Example (Density).
• consider a data set consisting of information about historical
artifacts, contains the volume and mass of each artifact.
• Assume that these artifacts are made of materials (bronze, gold)
and we want to classify the artifacts with respect to the material.
• In this case, a density feature constructed from the mass and
volume features, i.e., density = mass/volume.
Discretization and Binarization
• Some classification algorithms, require that the data be in the form

of categorical or binary attributes.
• Thus, it is often necessary to transform a continuous attribute into
a categorical attribute (discretization), and
• both continuous and discrete attributes may need to be transformed
into one or more binary attributes (binarization).
Binarization
• A simple technique to binarize a categorical attribute is: If there are
m categorical values, then uniquely assign each original value to an
integer in the interval [0,m − 1].
• If the attribute is ordinal, then order must be maintained.
Conversion of a categorical attribute to three binary attributes.
• For example, in Table, attributes x2 and x3 are correlated
because information about the good value is encoded using both
attributes.
Conversion of a categorical attribute to five asymmetric
binary attributes
Discretization of Continuous Attributes
• Discretization is applied to attributes that are used in classification
or association analysis.
• Transformation of a continuous attribute to a categorical attribute
involves two subtasks: deciding how many categories to have and
determining how to map the values of the continuous attribute to
these categories.
• In the first step, the values of the continuous attribute are sorted,
they are then divided into n intervals by specifying n−1 split
points.
• In the second, all the values in one interval are mapped to the same
categorical value.
• Unsupervised Discretization A basic distinction between
discretization methods for classification is whether class
information is used (supervised) or not (unsupervised).
• Equal width approach divides the range of the attribute into a
user-specified number of intervals each having the same width.
Such an approach can be affected by outliers.
• an equal frequency (equal depth) approach, which tries to put
the same number of objects into each interval.
• As another example of unsupervised discretization, a clustering
method, such as K-means can be used.
Different discretization techniques
Supervised Discretization
• A conceptually simple approach is to place the splits in a way
that maximizes the purity of the intervals.
• Entropy based approaches are used in discretization.
• Let k be the number of different class labels, mi be the number of
values in the ith interval of a partition, and mij be the number of

values of class j in interval i.
• Then the entropy ei of the ith interval is given by
where pij = mij/mi is the probability (fraction of values) of class j in the ith
interval.
• The total entropy, e, of the partition is the weighted average of the
individual interval entropies,
• where m is the number of values, wi = mi/m is the fraction of values

in the ith interval, and n is the number of intervals.
• If an interval contains only values of one class (is perfectly pure),
then the entropy is 0.
Example (Discretization of Two Attributes)
• This method was used to independently discretize both the x and
y attributes of the two dimensional data.
• In the first discretization, the x and y attributes were both split
into three intervals.
• In the second discretization, the x and y attributes were both split
into five intervals.
Discretizing x and y attributes for four groups (classes) of points.
Variable Transformation
• variable transformation refers to a transformation that is

applied to all the values of a variable.
• two important types of variable transformations: simple
functional transformations and normalization.
Simple Functions
• For this type of variable transformation, a simple mathematical
function is applied to each value individually.
• If x is a variable, then examples of such transformations include
log x, ex, √x, 1/x, sin x, or |x|.
• In statistics, variable transformations, especially sqrt, log, and 1/x,
are often used to transform data
• Suppose the variable of interest is the number of data bytes in a
session, and the number of bytes ranges from 1 to 1 billion.
• This is a huge range, and it may be advantageous to compress it
by using a log10 transformation.
• Variable transformations should be applied with caution since they
change the nature of the data.
• For instance, the transformation 1/x reduces the magnitude of
values that are 1 or larger, but increases the magnitude of values
between 0 and 1.
• To illustrate, the values {1, 2, 3} go to {1, 1/ 2 , 1/ 3 }, but the
values {1, 1 /2 , 1 /3 } go to {1, 2, 3}.
Normalization or Standardization
• Another type of transformation is the normalization of a variable.
• The goal of normalization is to make an entire set of values have a
particular property.
• If is the mean (average) of the attribute values and s x is their
standard deviation, then the transformation x1 = (x − ) / sx

• The mean and standard deviation are strongly affected by
outliers.
• First, the mean is replaced by the median, Second, the standard
deviation is replaced by the absolute standard deviation.
• Specifically, if x is a variable, then the absolute standard
deviation of x is given by
where xi is the ith value of the variable, m is the number of

objects, and µ is the median.
Measures of Similarity and Dissimilarity
• The similarity between two objects is a numerical measure of the

degree to which the two objects are alike.
• Similarities are higher for pairs of objects that are more alike.
• Similarities are between 0 (no similarity) and 1 (complete
similarity).
• The dissimilarity between two objects is a numerical measure of
the degree to which the two objects are different.
• Dissimilarities are lower for more similar pairs of objects.
• Frequently, the term distance is used as a synonym for
dissimilarity.
• Dissimilarities sometimes fall in the interval [0, 1], but it is also
common for them to range from 0 to ∞.
• Transformations are applied to convert a similarity to a
dissimilarity, or vice versa, or to transform a proximity measure to
fall within a range, such as [0,1].
• For example, if the similarities between objects range from 1 (not at
all similar) to 10 (completely similar), we can make them fall within
the range [0, 1] by using the transformation
s1 = (s−1)/9,
where s and s1 are the original and new similarity values, respectively.
• In the more general case, the transformation of similarities to the
interval [0, 1] is given by
s = (s − min_s)/(max_s − min_s),
where max_s and min_s are the maximum and minimum similarity
values.
• Likewise, dissimilarity measures with a range can be mapped to the
interval [0,1] by
d = (d − min_d)/(max_d − min_d).
• If the similarity (or dissimilarity) falls in the interval [0,1], then
the dissimilarity can be defined as d = 1− s (s = 1 − d).
• then transformations such as
s = e−d, or
Similarity and Dissimilarity between Simple Attributes
• Consider objects described by one nominal attribute.
• Hence, in this case similarity is traditionally defined as 1,if
attribute values match, and as 0 otherwise.
• A dissimilarity would be defined in the opposite way: 0 if the
attribute values match, and 1 if they do not.
• x and y are two objects that have one attribute.
• Also, d(x, y) and s(x, y) are the dissimilarity and similarity
between x and y.
Similarity and Dissimilarity for simple attributes
Dissimilarities(Distances) between Data Objects
• The Euclidean distance, d, between two points, x and y, in
one-, two-, three-, or higher dimensional space, is given by
• where n is the number of dimensions and xk and yk are the kth

attributes (components) of x and y.
• The Euclidean distance measure is generalized by the Minkowski
distance metric
where r is a parameter
• The following are the three most common examples of Minkowski
distances.
• r = 1. (Manhattan, L1 norm) distance.
• r = 2. Euclidean distance (L2 norm).
• r = ∞. Supremum (Lmax or L∞ norm) distance.

• Distances, such as Euclidean distance, have some properties.
• If d(x, y) is the distance between two points, x and y,
1. Positivity
(a) d(x, x) ≥ 0 for all x and y,
(b) d(x, y) = 0 only if x = y.

2. Symmetry
d(x, y) = d(y, x) for all x and y.
3. Triangle Inequality
d(x, z) ≤ d(x, y) + d(y, z) for all points x, y, and z.
Example
x and y coordinates of four points.

Four two-dimensional points
Euclidean distance matrix
L1 distance matrix
Similarities between Data Objects
• For similarities, the triangle inequality typically does not hold
• If s(x, y) is the similarity between points x and y
1. s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1)
2. s(x, y) = s(y, x) for all x and y. (Symmetry)
Examples of Proximity Measures
Similarity Measures for Binary Data
• Similarity measures between objects that contain only binary
attributes are called similarity coefficients, and have values
between 0 and 1.
• A value of 1 indicates that the two objects are completely similar,
while a value of 0 indicates that the objects are not at all similar.
• Let x and y be two objects that consist of n binary attributes.
• The comparison of two such objects, i.e., two binary vectors, leads
to the following four quantities:
• f00 = the number of attributes where x is 0 and y is 0

Simple Matching Coefficient
• One commonly used similarity coefficient is the simple
matching coefficient (SMC), which is defined as
This measure counts both presences and absences equally

Jaccard Coefficient
• Suppose that x and y are data objects that represent two rows (two
transactions) of a transaction matrix.
• If each asymmetric binary attribute corresponds to an item in a
store, then a 1 indicates that the item was purchased, while a 0
indicates that the product was not purchased.
• Jaccard coefficient is frequently used to handle objects consisting
of asymmetric binary attributes.
• The Jaccard coefficient, which is often symbolized by J, is given
by
Example
• the difference between these two similarity measures, we
calculate SMC and J for the following two binary vectors.
Cosine Similarity
• Documents are often represented as vectors, where each attribute
represents the frequency with which a particular term (word) occurs
in the document.
• documents have thousands or tens of thousands of attributes (terms),
each document is sparse since it has relatively few non-zero
attributes.
• The cosine similarity, defined next, is one of the most common
measure of document similarity.
where · indicates the vector dot product,,
and is the length of vector x,

Example
Extended Jaccard Coefficient (Tanimoto Coefficient)
• The extended Jaccard coefficient can be used for document data
and that reduces to the Jaccard coefficient in the case of binary
attributes.
Correlation (similarly)
• The correlation between two data objects that have binary or
continuous variables is a measure of the linear relationship
between the attributes of the objects.
• Pearson’s correlation coefficient between two data objects, x and
y, is defined by
Issues in Proximity Calculation
1) how to handle the case in which attributes are correlated,

2) how to calculate proximity between objects that are composed
of different types of attributes.
3) and how to handle proximity calculation when attributes have
different weights.
Standardization and Correlation for Distance Measures
• How to handle the situation when attributes do not have the same
range of values.
• Euclidean distance was used to measure the distance between
people based on two attributes: age and income.
• Unless these two attributes are standardized, the distance
between two people will be dominated by income.
• A related issue is how to compute distance when there is
correlation between some of the attributes.
• A generalization of Euclidean distance, the Mahalanobis
distance, is useful when attributes are correlated.
• the Mahalanobis distance between two objects (vectors) x and y
is defined as
• where Σ−1 is the inverse of the covariance matrix of the data.

• Covariance matrix Σ is the matrix, whose ijth entry is the
covariance of the ith and jth attributes.
Combining Similarities for Heterogeneous Attributes
• One approach is to compute the similarity between each attribute
separately, and then combine these similarities ,that results in a
similarity between 0 and 1.
• The overall similarity is defined as the average of all the
individual attribute similarities.
• Unfortunately, this approach does not work well if some of the
attributes are asymmetric attributes.
Using Weights
• all attributes were treated equally when computing proximity.
• This is not desirable when some attributes are more important
than others.
Selecting the Right Proximity Measure
• First, the type of proximity measure should fit the type of data. For
many types of dense, continuous data, distance measures such as
Euclidean distance are often used.
• For sparse data, which consists of asymmetric attributes, we
typically use similarity measures that ignore 0–0 matches.
• The cosine, Jaccard, and extended Jaccard measures are appropriate
for such data.
• For example, we are interested in comparing time series. If the
magnitude of the time series is important (for example, each time
series represent total sales of the same organization for a different
year), then we could use Euclidean distance.

• If the time series represent different quantities (for example, blood
pressure and oxygen consumption), then we usually want to
determine if the time series have the same shape.
• Then Correlation, would be more appropriate.

R21 Unit 2

Uploaded by

Copyright:

Available Formats

R21 Unit 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

R21 Unit 2

Uploaded by

Copyright:

Available Formats

UNIT - II: Data Pre-processing

Aggregation, Sampling, Dimensionality Reduction,

• Aggregation is combining of two or more objects into a single

Histogram of average monthly precipitation

• Sampling is a commonly used approach for selecting a subset

simple random sampling. For this type of sampling, there is an equal

that together constitute the data set, and

Example of the loss of structure with sampling.

• Given a set of data that consists of equal sized groups, find at

least one representative point for each of the groups.

• Data sets can have a large number of features.

• Many data mining algorithms work better

• Reduction of dimensionality can lead to a more understandable

• Dimensionality reduction allow the data to be more easily

• Finally, the amount of time and memory required by the data

(1) are linear combinations of the original attributes,

(3) capture the maximum amount

• Some classification algorithms, require that the data be in the form

• Discretization is applied to attributes that are used in classification

• Transformation of a continuous attribute to a categorical attribute

involves two subtasks: deciding how many categories to have and

determining how to map the values of the continuous attribute to

values in the ith interval of a partition, and mij be the number of

• Then the entropy ei of the ith interval is given by

• where m is the number of values, wi = mi/m is the fraction of values

• variable transformation refers to a transformation that is

• If is the mean (average) of the attribute values and s x is their

standard deviation, then the transformation x1 = (x − ) / sx

where xi is the ith value of the variable, m is the number of

• The similarity between two objects is a numerical measure of the

• where n is the number of dimensions and xk and yk are the kth

• r = 1. (Manhattan, L1 norm) distance.

• r = 2. Euclidean distance (L2 norm).

• r = ∞. Supremum (Lmax or L∞ norm) distance.

• If d(x, y) is the distance between two points, x and y,

(a) d(x, x) ≥ 0 for all x and y,

(b) d(x, y) = 0 only if x = y.

x and y coordinates of four points.

• f00 = the number of attributes where x is 0 and y is 0

• f01 = the number of attributes where x is 0 and y is 1

• f10 = the number of attributes where x is 1 and y is 0

• f11 = the number of attributes where x is 1 and y is 1

This measure counts both presences and absences equally

where · indicates the vector dot product,,

and is the length of vector x,

1) how to handle the case in which attributes are correlated,

• where Σ−1 is the inverse of the covariance matrix of the data.

magnitude of the time series is important (for example, each time

series represent total sales of the same organization for a different

year), then we could use Euclidean distance.

You might also like