Unit Iv
Unit Iv
Unit Iv
• Classification:
– Credit approval
– Target marketing
– Medical diagnosis
– Fraud detection
–
Classification—A Two-Step Process
• Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
CPT shows the conditional probability for each possible combination of its parents
Discarding one or more subtrees and replacing them with leaves simplify a decision tree, and
that is the main task in decision-tree pruning. In replacing the subtree with a leaf, the algorithm
expects to lower the predicted error rate and increase the quality of a classification model. But
computation of error rate is not simple. An error rate based only on a training data set does not
provide a suitable estimate. One possibility to estimate the predicted error rate is to use a new,
additional set of test samples if they are available, or to use the cross-validation techniques. This
technique divides initially available samples into equal sized blocks and, for each block, the tree
is constructed from all samples except this block and tested with a given block of samples. With
the available training and testing samples, the basic idea of decision tree-pruning is to remove
parts of the tree (subtrees) that do not contribute to the classification accuracy of unseen testing
samples, producing a less complex and thus more comprehensible tree. There are two ways in
which the recursive-partitioning method can be modified:
1. Deciding not to divide a set of samples any further under some conditions. The stopping
criterion is usually based on some statistical tests, such as the χ2 test: If there are no
significant differences in classification accuracy before and after division, then represent
a current node as a leaf. The decision is made in advance, before splitting, and therefore
this approach is called prepruning.
2. Removing restrospectively some of the tree structure using selected accuracy criteria. The
decision in this process of postpruning is made after the tree has been built.
C4.5 follows the postpruning approach, but it uses a specific technique to estimate the predicted
error rate. This method is called pessimistic pruning. For every node in a tree, the estimation of
the upper confidence limit ucf is computed using the statistical tables for binomial distribution
(given in most textbooks on statistics). Parameter Ucf is a function of ∣Ti∣ and E for a given node.
C4.5 uses the default confidence level of 25%, and compares U25% (∣Ti∣/E) for a given node Ti
with a weighted confidence of its leaves. Weights are the total number of cases for every leaf. If
the predicted error for a root node in a subtree is less than weighted sum of U25% for the leaves
(predicted error for the subtree), then a subtree will be replaced with its root node, which
becomes a new leaf in a pruned tree.
Let us illustrate this procedure with one simple example. A subtree of a decision tree is
given in Figure, where the root node is the test x1 on three possible values {1, 2, 3} of the
attribute A. The children of the root node are leaves denoted with corresponding classes and
(∣Ti∣/E) parameters. The question is to estimate the possibility of pruning the subtree and
replacing it with its root node as a new, generalized leaf node.
To analyze the possibility of replacing the subtree with a leaf node it is necessary to
compute a predicted error PE for the initial tree and for a replaced node. Using default
confidence of 25%, the upper confidence limits for all nodes are collected from statistical tables:
U25% (6, 0) = 0.206, U25%(9, 0) = 0.143, U25%(1, 0) = 0.750, and U25%(16, 1) = 0.157. Using these
values, the predicted errors for the initial tree and the replaced node are
Since the existing subtree has a higher value of predicted error than the replaced node, it
is recommended that the decision tree be pruned and the subtree replaced with the new leaf node.
◼ Weakness
◼ Long training time
◼ Require a number of parameters typically best determined empirically, e.g., the
network topology or ``structure."
◼ Poor interpretability: Difficult to interpret the symbolic meaning behind the
learned weights and of ``hidden units" in the network
◼ Strength
◼ High tolerance to noisy data
◼ Ability to classify untrained patterns
◼ Well-suited for continuous-valued inputs and outputs
◼ Successful on a wide array of real-world data
◼ Algorithms are inherently parallel
◼ Techniques have recently been developed for the extraction of rules from trained
neural networks
A Neuron (= a perceptron)
◼ The n-dimensional input vector x is mapped into variable y by means of the scalar
product and a nonlinear function mapping
◼ Iteratively process a set of training tuples & compare the network's prediction with the
actual known target value
◼ For each training tuple, the weights are modified to minimize the mean squared error
between the network's prediction and the actual target value
◼ Modifications are made in the “backwards” direction: from the output layer, through each
hidden layer down to the first hidden layer, hence “backpropagation”
◼ Steps
◼ Initialize weights (to small random #s) and biases in the network
◼ Propagate the inputs forward (by applying activation function)
◼ Backpropagate the error (by updating weights and biases)
◼ Terminating condition (when error is very small, etc.)
◼ Efficiency of backpropagation: Each epoch (one interaction through the training set)
takes O(|D| * w), with |D| tuples and w weights, but # of epochs can be exponential to n,
the number of inputs, in the worst case
◼ Sensitivity analysis: assess the impact that a given input variable has on a network output.
The knowledge gained from this analysis can be represented in rules
SVM—Linearly Separable
◼ A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
◼ For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
◼ The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
◼ Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
◼ This becomes a constrained (convex) quadratic optimization problem: Quadratic
objective function and linear constraints → Quadratic Programming (QP) → Lagrangian
multipliers
Associative Classification
◼ Associative classification
◼ Association rules are generated and analyzed for use in classification
◼ Search for strong associations between frequent patterns (conjunctions of
attribute-value pairs) and class labels
◼ Classification: Based on evaluating a set of rules in the form of
P1 ^ p2 … ^ pl → “Aclass = C” (conf, sup)
◼ Why effective?
◼ It explores highly confident associations among multiple attributes and may
overcome some constraints introduced by decision-tree induction, which
considers only one attribute at a time
In many studies, associative classification has been found to be more accurate than some
traditional classification methods, such as C4.
Associative Classification May Achieve High Accuracy and Efficiency (Cong et al.
SIGMOD05)
◼ Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of
membership (such as using fuzzy membership graph)
◼ Attribute values are converted to fuzzy values
◼ e.g., income is mapped into the discrete categories {low, medium, high} with
fuzzy values calculated
◼ For a given new sample, more than one fuzzy value may apply
◼ Each applicable rule contributes a vote for membership in the categories
◼ Typically, the truth values for each predicted category are summed, and these sums are
combined
CLUSTER ANALYSIS
What is Cluster Analysis?
• Pattern Recognition
• Spatial Data Analysis
– create thematic maps in GIS by clustering feature spaces
– detect spatial clusters and explain them in spatial data mining
• Image Processing
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar access patterns
• Marketing: Help marketers discover distinct groups in their customer bases, and then use
this knowledge to develop targeted marketing programs
• Land use: Identification of areas of similar land use in an earth observation database
• Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost
• City-planning: Identifying groups of houses according to their house type, value, and
geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered along continent
faults
Interval-valued variables
• Standardize data
– Calculate the mean absolute deviation:
Where mf = 1
n (x1 f + x2 f + ... + xnf ) .
– Calculate the standardized measurement (z-score)
xif − m f
zif = sf
• Using mean absolute deviation is more robust than using standard deviation
• Distances are normally used to measure the similarity or dissimilarity between two data
objects
• Some popular ones include: Minkowski distance:
d (i, j) = q (| x − x |q + | x − x |q +...+ | x − x |q )
i1 j1 i2 j 2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q
is a positive integer
• If q = 1, d is Manhattan distance
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 i p jp
• If q = 2, d is Euclidean distance:
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp
sf = 1
n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
– Properties
• d(i,j) 0
• d(i,i) = 0
• d(i,j) = d(j,i)
• Also one can use weighted distance, parametric Pearson product moment correlation, or
other disimilarity measures.
Binary Variables
Object j
1 0 sum
1 a b a +b
Object i
0 c d c+d
•
• sum a + c b + d p
d (i, j) = p −pm
• Method 2: use a large number of binary variables
Ordinal Variables
Ratio-Scaled Variables
pf = 1 ij( f )dij( f )
d (i, j) =
pf = 1 ij( f )
– f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
– f is interval-based: use the normalized distance
– f is ordinal or ratio-scaled
• compute ranks rif and
• and treat zif as interval-scaled
z r
=
if − 1
if M f −1
Types of Clustering
HIERARCHICAL CLUSTERING
A Partitional Clustering
Original Points
p1
p3 p4
p2
p1 p2 p3 p4
Dendrogram 1
p1
p3 p4
p2
p1 p2 p3 p4
Dendrogram 2
Decompose data objects into a several levels of nested partitioning (tree of clusters), called a
dendrogram. A clustering of the data objects is obtained by cutting the dendrogram at the desired
level, then each connected component forms a cluster,
AGNES (Agglomerative Nesting)
Birch: Balanced Iterative Reducing and Clustering using Hierarchies (Zhang, Ramakrishnan &
Livny, SIGMOD’96)
Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for
multiphase clustering
Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data
that tries to preserve the inherent clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree
Scales linearly: finds a good clustering with a single scan and improves the quality with a
few additional scans
Weakness: handles only numeric data, and sensitive to the order of the data record.
N 2
SS: i=1=Xi
CF-Tree in BIRCH
Clustering feature:
Summary of the statistics for a given subcluster: the 0-th, 1st and 2nd moments of the sub
cluster from the statistical point of view.
Registers crucial measurements for computing cluster and utilizes storage efficiently
A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering
Solve the following with K=2 {2, 25, 10, 15, 5, 20, 4, 40}
Step 1: Break the set into 2 clusters. Randomly assign data into two clusters and
calculate the mean value:
K1: {2,10,5,4} Mean = 5.25
K2: {25,15,20,40} Mean = 25
Step 2: Reassign the values to the numbers which are close to the mean of K1 (5.25) to
K1, mean of K2 (25) to K2.
K1:{2,5,4} Mean = 3.66
K2:{10,25,15,20,40} Mean = 22
Step 3: Reassign K1:{2,5,4} Mean=3.66 K2:{10,25,15,20,40} Mean=22
Both the clusters are the same as in Step 2 So, the final answer is
K1:{2,5,4} K2:{10,25,15,20,40}
K-mean Example 2
• K-Medoids: Instead of taking the mean value of the object in a cluster as a reference
point, medoids can be used, which is the most centrally located object in a cluster.
The K-Medoids Clustering Method
• Find representative objects, called medoids, in clusters
• PAM (Partitioning Around Medoids, 1987)
– starts from an initial set of medoids and iteratively replaces one of the medoids by
one of the non-medoids if it improves the total distance of the resulting clustering
– PAM works effectively for small data sets, but does not scale well for large data
sets
• CLARA (Kaufmann & Rousseeuw, 1990)
• CLARANS (Ng & Han, 1994): Randomized sampling
• Focusing + spatial data structure (Ester et al., 1995)
– Each cell at a high level is partitioned into a number of smaller cells in the next lower level
– Statistical info of each cell is calculated and stored beforehand and is used to answer queries
– Parameters of higher level cells can be easily calculated from parameters of lower level cell
• count, mean, s, min, max
• type of distribution—normal, uniform, etc.
– Use a top-down approach to answer spatial data queries
– Start from a pre-selected layer—typically with a small number of cells
– For each cell in the current level compute the confidence interval
Comments on STING
• Remove the irrelevant cells from further consideration
• When finish examining the current layer, proceed to the next lower level
• Repeat this process until the bottom layer is reached
• Advantages:
– Query-independent, easy to parallelize, incremental update
– O(K), where K is the number of grid cells at the lowest level
• Disadvantages:
– All the cluster boundaries are either horizontal or vertical, and no diagonal
boundary is detected
WaveCluster: Clustering by Wavelet Analysis (1998)
• Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
• A multi-resolution clustering approach which applies wavelet transform to the
feature space
• How to apply wavelet transform to find clusters
– Summarizes the data by imposing a multidimensional grid structure onto
data space
– These multidimensional spatial data objects are represented in a n-
dimensional feature space
– Apply wavelet transform on feature space to find the dense regions in the
feature space
– Apply wavelet transform multiple times which result in clusters at different
scales from fine to coarse
Wavelet Transform
• Wavelet transform: A signal processing technique that decomposes a signal into
different frequency sub-band (can be applied to n-dimensional signals)
• Data are transformed to preserve relative distance between objects at different
levels of resolution
• Allows natural clusters to become more distinguishable
• Input parameters
– # of grid cells for each dimension
– the wavelet, and the # of applications of wavelet transform
• Why is wavelet transformation useful for clustering?
– Use hat-shape filters to emphasize region where points cluster, but simultaneously
suppress weaker information in their boundary
– Effective removal of outliers, multi-resolution, cost effective
• Major features:
– Complexity O(N)
– Detect arbitrary shaped clusters at different scales
– Not sensitive to noise, not sensitive to input order
– Only applicable to low dimensional data
• Both grid-based and density-based
Clustering High-Dimensional Data
• Clustering high-dimensional data
– Many applications: text documents, DNA micro-array data
– Major challenges:
• Many irrelevant dimensions may mask clusters
• Distance measure becomes meaningless—due to equi-distance
• Clusters may exist only in some subspaces
• Methods
– Feature transformation: only effective if most dimensions are relevant
• PCA & SVD useful only when features are highly correlated/redundant
– Feature selection: wrapper or filter approaches
• useful to find a subspace where the data have nice clusters
– Subspace-clustering: find clusters in all the possible subspaces
• CLIQUE, ProClus, and frequent pattern-based clustering
CLIQUE (Clustering In QUEst)
• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
• Automatically identifying subspaces of a high dimensional data space that allow better
clustering than original space
• CLIQUE can be considered as both density-based and grid-based
– It partitions each dimension into the same number of equal length interval
– It partitions an m-dimensional data space into non-overlapping rectangular units
– A unit is dense if the fraction of total data points contained in the unit exceeds the
input model parameter
– A cluster is a maximal set of connected dense units within a subspace
CLIQUE: The Major Steps
• Partition the data space and find the number of points that lie inside each cell of the
partition.
• Identify the subspaces that contain clusters using the Apriori principle
• Identify clusters
– Determine dense units in all subspaces of interests
– Determine connected dense units in all subspaces of interests.
• Generate minimal description for the clusters
– Determine maximal regions that cover a cluster of connected dense units for each
cluster
– Determination of minimal cover for each cluster
Strength and Weakness of CLIQUE
• Strength
– automatically finds subspaces of the highest dimensionality such that high density
clusters exist in those subspaces
– insensitive to the order of records in input and does not presume some canonical
data distribution
– scales linearly with the size of input and has good scalability as the number of
dimensions in the data increases
• Weakness
– The accuracy of the clustering result may be degraded at the expense of simplicity
of the method
Clustering with User-Specified Constraints
• Example: Locating k delivery centers, each serving at least m valued customers and n
ordinary ones
• Proposed approach
– Find an initial “solution” by partitioning the data set into k groups and satisfying
user-constraints
– Iteratively refine the solution by micro-clustering relocation (e.g., moving δ μ-
clusters from cluster Ci to Cj) and “deadlock” handling (break the microclusters
when necessary)
– Efficiency is improved by micro-clustering
• How to handle more complicated constraints?
– E.g., having approximately same number of valued customers in each cluster?! —
Can you solve it?
What Is Outlier Discovery?
• What are outliers?
– The set of objects are considerably dissimilar from the remainder of the data
– Example: Sports: Michael Jordon, Wayne Gretzky, ...
• Problem: Define and find outliers in large data sets
• Applications:
– Credit card fraud detection
– Telecom fraud detection
– Customer segmentation
– Medical analysis