R21 Unit 2
R21 Unit 2
R21 Unit 2
or association analysis.
these categories.
• In the first step, the values of the continuous attribute are sorted,
they are then divided into n intervals by specifying n−1 split
points.
• In the second, all the values in one interval are mapped to the same
categorical value.
• Unsupervised Discretization A basic distinction between
discretization methods for classification is whether class
information is used (supervised) or not (unsupervised).
• Equal width approach divides the range of the attribute into a
user-specified number of intervals each having the same width.
Such an approach can be affected by outliers.
• an equal frequency (equal depth) approach, which tries to put
the same number of objects into each interval.
• As another example of unsupervised discretization, a clustering
method, such as K-means can be used.
Different discretization techniques
Supervised Discretization
• A conceptually simple approach is to place the splits in a way
that maximizes the purity of the intervals.
• Entropy based approaches are used in discretization.
• Let k be the number of different class labels, mi be the number of
where pij = mij/mi is the probability (fraction of values) of class j in the ith
interval.
• The total entropy, e, of the partition is the weighted average of the
individual interval entropies,
s = e−d, or
Similarity and Dissimilarity between Simple Attributes
• Consider objects described by one nominal attribute.
• Hence, in this case similarity is traditionally defined as 1,if
attribute values match, and as 0 otherwise.
• A dissimilarity would be defined in the opposite way: 0 if the
attribute values match, and 1 if they do not.
• x and y are two objects that have one attribute.
• Also, d(x, y) and s(x, y) are the dissimilarity and similarity
between x and y.
Similarity and Dissimilarity for simple attributes
Dissimilarities(Distances) between Data Objects
• The Euclidean distance, d, between two points, x and y, in
one-, two-, three-, or higher dimensional space, is given by
where r is a parameter
• The following are the three most common examples of Minkowski
distances.
1. Positivity
L1 distance matrix
Similarities between Data Objects
• For similarities, the triangle inequality typically does not hold
• If s(x, y) is the similarity between points x and y
1. s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1)
2. s(x, y) = s(y, x) for all x and y. (Symmetry)
Examples of Proximity Measures
Similarity Measures for Binary Data
• Similarity measures between objects that contain only binary
attributes are called similarity coefficients, and have values
between 0 and 1.
• A value of 1 indicates that the two objects are completely similar,
while a value of 0 indicates that the objects are not at all similar.
• Let x and y be two objects that consist of n binary attributes.
• The comparison of two such objects, i.e., two binary vectors, leads
to the following four quantities: