Unit Iv

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

UNIT IV CLASSIFICATION AND CLUSTERING

• Classification:

– predicts categorical class labels


– classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data
• Typical applications

– Credit approval
– Target marketing
– Medical diagnosis
– Fraud detection

Classification—A Two-Step Process

• Model construction: describing a set of predetermined classes


– Each tuple/sample is assumed to belong to a predefined class, as determined by
the class label attribute
– The set of tuples used for model construction: training set
– The model is represented as classification rules, decision trees, or mathematical
formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the classified result from
the model
• Accuracy rate is the percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set, otherwise over-fitting will occur

Process (1): Model Construction


Process (2): Using the Model in Prediction

Supervised vs. Unsupervised Learning


◼ Supervised learning (classification)
◼ Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
◼ New data is classified based on the training set
◼ Unsupervised learning (clustering)
◼ The class labels of training data is unknown
◼ Given a set of measurements, observations, etc. with the aim of establishing the
existence of classes or clusters in the data

Classification by Decision Tree Induction


Decision tree

– A flow-chart-like tree structure


– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree
Training Dataset

This follows an example from Quinlan’s ID3

income student credit_rating buys_computer


high no fair no
high no excellent no
high no fair yes
medium no fair yes
low yes fair yes
low yes excellent no
low yes excellent yes
medium no fair no
low yes fair yes
medium yes fair yes
medium yes excellent yes
medium no excellent yes
high yes fair yes
medium no excellent no

Algorithm for Decision Tree Induction

• Basic algorithm (a greedy algorithm)


– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized in advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf
– There are no samples left

Extracting Classification Rules from Trees

• Represent the knowledge in the form of IF-THEN rules


• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand
Example

IF age = “<=30” AND student = “no” THEN buys_computer = “no”

IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”

IF age = “31…40” THEN buys_computer = “yes”

IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”

IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”


Avoid Overfitting in Classification

• The generated tree may overfit the training data


– Too many branches, some may reflect anomalies due to noise or outliers
– Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a node if this would result
in the goodness measure falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree—get a sequence of
progressively pruned trees
• Use a set of data different from the training data to decide which is the
“best pruned tree”
Bayesian Classification:
• Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most
practical approaches to certain types of learning problems
• Incremental: Each training example can incrementally increase/decrease the probability
that a hypothesis is correct. Prior knowledge can be combined with observed data.
• Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities
• Standard: Even when Bayesian methods are computationally intractable, they can provide
a standard of optimal decision making against which other methods can be measured

Estimating a-posteriori Probabilities

• Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)

• P(X) is constant for all classes


• P(C) = relative freq of class C samples
• C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
• Problem: computing P(X|C) is unfeasible!

Naïve Bayesian Classification

• Naïve assumption: attribute independence


P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)

• If i-th attribute is categorical:


P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in
class C
• If i-th attribute is continuous:
P(xi|C) is estimated thru a Gaussian density function
• Computationally easy in both cases
Bayesian Belief Networks

◼ Bayesian belief network allows a subset of the variables conditionally independent


◼ A graphical model of causal relationships
◼ Represents dependency among the variables
◼ Gives a specification of joint probability distribution

Bayesian Belief Network: An Example

The conditional probability table (CPT) for variable LungCancer:

CPT shows the conditional probability for each possible combination of its parents

Derivation of the probability of a particular combination of values of X, from CPT:


n
P( x1 ,..., xn ) =  P( xi | Parents(Y i ))
i =1
Association-Based Classification

• Several methods for association-based classification


– ARCS: Quantitative association mining and clustering of association rules (Lent
et al’97)
• It beats C4.5 in (mainly) scalability and also accuracy
– Associative classification: (Liu et al’98)
• It mines high support and high confidence rules in the form of “cond_set
=> y”, where y is a class label
– CAEP (Classification by aggregating emerging patterns) (Dong et al’99)
• Emerging patterns (EPs): the itemsets whose support increases
significantly from one class to another
• Mine Eps based on minimum support and growth rate
Pruning of decision trees

Discarding one or more subtrees and replacing them with leaves simplify a decision tree, and
that is the main task in decision-tree pruning. In replacing the subtree with a leaf, the algorithm
expects to lower the predicted error rate and increase the quality of a classification model. But
computation of error rate is not simple. An error rate based only on a training data set does not
provide a suitable estimate. One possibility to estimate the predicted error rate is to use a new,
additional set of test samples if they are available, or to use the cross-validation techniques. This
technique divides initially available samples into equal sized blocks and, for each block, the tree
is constructed from all samples except this block and tested with a given block of samples. With
the available training and testing samples, the basic idea of decision tree-pruning is to remove
parts of the tree (subtrees) that do not contribute to the classification accuracy of unseen testing
samples, producing a less complex and thus more comprehensible tree. There are two ways in
which the recursive-partitioning method can be modified:

1. Deciding not to divide a set of samples any further under some conditions. The stopping
criterion is usually based on some statistical tests, such as the χ2 test: If there are no
significant differences in classification accuracy before and after division, then represent
a current node as a leaf. The decision is made in advance, before splitting, and therefore
this approach is called prepruning.
2. Removing restrospectively some of the tree structure using selected accuracy criteria. The
decision in this process of postpruning is made after the tree has been built.

C4.5 follows the postpruning approach, but it uses a specific technique to estimate the predicted
error rate. This method is called pessimistic pruning. For every node in a tree, the estimation of
the upper confidence limit ucf is computed using the statistical tables for binomial distribution
(given in most textbooks on statistics). Parameter Ucf is a function of ∣Ti∣ and E for a given node.
C4.5 uses the default confidence level of 25%, and compares U25% (∣Ti∣/E) for a given node Ti
with a weighted confidence of its leaves. Weights are the total number of cases for every leaf. If
the predicted error for a root node in a subtree is less than weighted sum of U25% for the leaves
(predicted error for the subtree), then a subtree will be replaced with its root node, which
becomes a new leaf in a pruned tree.

Let us illustrate this procedure with one simple example. A subtree of a decision tree is
given in Figure, where the root node is the test x1 on three possible values {1, 2, 3} of the
attribute A. The children of the root node are leaves denoted with corresponding classes and
(∣Ti∣/E) parameters. The question is to estimate the possibility of pruning the subtree and
replacing it with its root node as a new, generalized leaf node.

To analyze the possibility of replacing the subtree with a leaf node it is necessary to
compute a predicted error PE for the initial tree and for a replaced node. Using default
confidence of 25%, the upper confidence limits for all nodes are collected from statistical tables:
U25% (6, 0) = 0.206, U25%(9, 0) = 0.143, U25%(1, 0) = 0.750, and U25%(16, 1) = 0.157. Using these
values, the predicted errors for the initial tree and the replaced node are

PEtree = 6.0.206 + 9.0.143 + 1.0.750 = 3.257

PEnode = 16.0.157 = 2.512

Since the existing subtree has a higher value of predicted error than the replaced node, it
is recommended that the decision tree be pruned and the subtree replaced with the new leaf node.

Rule Based Classification


Using IF-THEN Rules for Classification

◼ Represent the knowledge in the form of IF-THEN rules

R: IF age = youth AND student = yes THEN buys_computer = yes

◼ Rule antecedent/precondition vs. rule consequent


◼ Assessment of a rule: coverage and accuracy
◼ ncovers = # of tuples covered by R
◼ ncorrect = # of tuples correctly classified by R

coverage(R) = ncovers /|D| /* D: training data set */

accuracy(R) = ncorrect / ncovers

◼ If more than one rule is triggered, need conflict resolution


◼ Size ordering: assign the highest priority to the triggering rules that has the
“toughest” requirement (i.e., with the most attribute test)
◼ Class-based ordering: decreasing order of prevalence or misclassification cost per
class
◼ Rule-based ordering (decision list): rules are organized into one long priority list,
according to some measure of rule quality or by experts

Rule Extraction from a Decision Tree

◼ Rules are easier to understand than large trees


◼ One rule is created for each path from the root to a leaf
◼ Each attribute-value pair along a path forms a conjunction: the leaf holds the class
prediction
◼ Rules are mutually exclusive and exhaustive

◼ Example: Rule extraction from our buys_computer decision-tree

IF age = young AND student = no THEN buys_computer = no


IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = yes
IF age = young AND credit_rating = fair THEN buys_computer = no

Rule Extraction from the Training Data

◼ Sequential covering algorithm: Extracts rules directly from training data


◼ Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
◼ Rules are learned sequentially, each for a given class Ci will cover many tuples of Ci but
none (or few) of the tuples of other classes
◼ Steps:
◼ Rules are learned one at a time
◼ Each time a rule is learned, the tuples covered by the rules are removed
◼ The process repeats on the remaining tuples unless termination condition, e.g.,
when no more training examples or when the quality of a rule returned is below a
user-specified threshold
◼ Comp. w. decision-tree induction: learning a set of rules simultaneously
Classification by Backpropagation
◼ Backpropagation: A neural network learning algorithm
◼ Started by psychologists and neurobiologists to develop and test computational analogues
of neurons
◼ A neural network: A set of connected input/output units where each connection has a
weight associated with it
◼ During the learning phase, the network learns by adjusting the weights so as to be able
to predict the correct class label of the input tuples
◼ Also referred to as connectionist learning due to the connections between units
Neural Network as a Classifier

◼ Weakness
◼ Long training time
◼ Require a number of parameters typically best determined empirically, e.g., the
network topology or ``structure."
◼ Poor interpretability: Difficult to interpret the symbolic meaning behind the
learned weights and of ``hidden units" in the network
◼ Strength
◼ High tolerance to noisy data
◼ Ability to classify untrained patterns
◼ Well-suited for continuous-valued inputs and outputs
◼ Successful on a wide array of real-world data
◼ Algorithms are inherently parallel
◼ Techniques have recently been developed for the extraction of rules from trained
neural networks
A Neuron (= a perceptron)

◼ The n-dimensional input vector x is mapped into variable y by means of the scalar
product and a nonlinear function mapping

A Multi-Layer Feed-Forward Neural Network


◼ The inputs to the network correspond to the attributes measured for each training tuple
◼ Inputs are fed simultaneously into the units making up the input layer
◼ They are then weighted and fed simultaneously to a hidden layer
◼ The number of hidden layers is arbitrary, although usually only one
◼ The weighted outputs of the last hidden layer are input to units making up the output
layer, which emits the network's prediction
◼ The network is feed-forward in that none of the weights cycles back to an input unit or to
an output unit of a previous layer
◼ From a statistical point of view, networks perform nonlinear regression: Given enough
hidden units and enough training samples, they can closely approximate any function
Backpropagation

◼ Iteratively process a set of training tuples & compare the network's prediction with the
actual known target value
◼ For each training tuple, the weights are modified to minimize the mean squared error
between the network's prediction and the actual target value
◼ Modifications are made in the “backwards” direction: from the output layer, through each
hidden layer down to the first hidden layer, hence “backpropagation”

◼ Steps
◼ Initialize weights (to small random #s) and biases in the network
◼ Propagate the inputs forward (by applying activation function)
◼ Backpropagate the error (by updating weights and biases)
◼ Terminating condition (when error is very small, etc.)

◼ Efficiency of backpropagation: Each epoch (one interaction through the training set)
takes O(|D| * w), with |D| tuples and w weights, but # of epochs can be exponential to n,
the number of inputs, in the worst case

◼ Rule extraction from networks: network pruning


◼ Simplify the network structure by removing weighted links that have the least
effect on the trained network
◼ Then perform link, unit, or activation value clustering
◼ The set of input and activation values are studied to derive rules describing the
relationship between the input and hidden unit layers

◼ Sensitivity analysis: assess the impact that a given input variable has on a network output.
The knowledge gained from this analysis can be represented in rules

SVM—Support Vector Machines

◼ A new classification method for both linear and nonlinear data


◼ It uses a nonlinear mapping to transform the original training data into a higher
dimension
◼ With the new dimension, it searches for the linear optimal separating hyperplane (i.e.,
“decision boundary”)
◼ With an appropriate nonlinear mapping to a sufficiently high dimension, data from two
classes can always be separated by a hyperplane
◼ SVM finds this hyperplane using support vectors (“essential” training tuples) and
margins (defined by the support vectors)
◼ Features: training can be slow but accuracy is high owing to their ability to model
complex nonlinear decision boundaries (margin maximization)
◼ Used both for classification and prediction
◼ Applications:
◼ handwritten digit recognition, object recognition, speaker identification,
benchmarking time-series prediction tests
SVM—General Philosophy
SVM—Margins and Support Vectors

SVM—Linearly Separable
◼ A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
◼ For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
◼ The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
◼ Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
◼ This becomes a constrained (convex) quadratic optimization problem: Quadratic
objective function and linear constraints → Quadratic Programming (QP) → Lagrangian
multipliers

Why Is SVM Effective on High Dimensional Data?

◼ The complexity of trained classifier is characterized by the # of support vectors rather


than the dimensionality of the data
◼ The support vectors are the essential or critical training examples —they lie closest to the
decision boundary (MMH)
◼ If all other training examples are removed and the training is repeated, the same
separating hyperplane would be found
◼ The number of support vectors found can be used to compute an (upper) bound on the
expected error rate of the SVM classifier, which is independent of the data dimensionality
◼ Thus, an SVM with a small number of support vectors can have good generalization,
even when the dimensionality of the data is high

Associative Classification

◼ Associative classification
◼ Association rules are generated and analyzed for use in classification
◼ Search for strong associations between frequent patterns (conjunctions of
attribute-value pairs) and class labels
◼ Classification: Based on evaluating a set of rules in the form of
P1 ^ p2 … ^ pl → “Aclass = C” (conf, sup)
◼ Why effective?
◼ It explores highly confident associations among multiple attributes and may
overcome some constraints introduced by decision-tree induction, which
considers only one attribute at a time
In many studies, associative classification has been found to be more accurate than some
traditional classification methods, such as C4.

Typical Associative Classification Methods

◼ CBA (Classification By Association: Liu, Hsu & Ma, KDD’98)


◼ Mine association possible rules in the form of
◼ Cond-set (a set of attribute-value pairs) → class label
◼ Build classifier: Organize rules according to decreasing precedence based on
confidence and then support
◼ CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01)
◼ Classification: Statistical analysis on multiple rules
◼ CPAR (Classification based on Predictive Association Rules: Yin & Han, SDM’03)
◼ Generation of predictive rules (FOIL-like analysis)
◼ High efficiency, accuracy similar to CMAR
◼ RCBT (Mining top-k covering rule groups for gene expression data, Cong et al.
SIGMOD’05)
◼ Explore high-dimensional classification, using top-k rule groups
◼ Achieve high classification accuracy and high run-time efficiency

Associative Classification May Achieve High Accuracy and Efficiency (Cong et al.
SIGMOD05)

Other Classification Methods


The k-Nearest Neighbor Algorithm

◼ All instances correspond to points in the n-D space


◼ The nearest neighbor are defined in terms of Euclidean distance, dist(X1, X2)
◼ Target function could be discrete- or real- valued
◼ For discrete-valued, k-NN returns the most common value among the k training examples
nearest to xq
◼ k-NN for real-valued prediction for a given unknown tuple
◼ Returns the mean values of the k nearest neighbors
◼ Distance-weighted nearest neighbor algorithm
◼ Weight the contribution of each of the k neighbors according to their distance to
the query xq
◼ Give greater weight to closer neighbors
◼ Robust to noisy data by averaging k-nearest neighbors
◼ Curse of dimensionality: distance between neighbors could be dominated by irrelevant
attributes
◼ To overcome it, axes stretch or elimination of the least relevant attributes
Genetic Algorithms

◼ Genetic Algorithm: based on an analogy to biological evolution


◼ An initial population is created consisting of randomly generated rules
◼ Each rule is represented by a string of bits
◼ E.g., if A1 and ¬A2 then C2 can be encoded as 100
◼ If an attribute has k > 2 values, k bits can be used
◼ Based on the notion of survival of the fittest, a new population is formed to consist of the
fittest rules and their offspring’s
◼ The fitness of a rule is represented by its classification accuracy on a set of training
examples
◼ Offspring’s are generated by crossover and mutation
◼ The process continues until a population P evolves when each rule in P satisfies a
prespecified threshold
◼ Slow but easily parallelizable

Rough Set Approach:

◼ Rough sets are used to approximately or “roughly” define equivalent classes


◼ A rough set for a given class C is approximated by two sets: a lower approximation
(certain to be in C) and an upper approximation (cannot be described as not belonging to
C)
Finding the minimal subsets (reducts) of attributes for feature reduction is NP-hard but a
discernibility matrix (which stores the differences between attribute values for each pair of data
tuples) is used to reduce the computation intensity
Figure: A rough set approximation of the set of tuples of the class C suing lower and upper
approximation sets of C. The rectangular regions represent equivalence classes

Fuzzy Set approaches

◼ Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of
membership (such as using fuzzy membership graph)
◼ Attribute values are converted to fuzzy values
◼ e.g., income is mapped into the discrete categories {low, medium, high} with
fuzzy values calculated
◼ For a given new sample, more than one fuzzy value may apply
◼ Each applicable rule contributes a vote for membership in the categories
◼ Typically, the truth values for each predicted category are summed, and these sums are
combined

CLUSTER ANALYSIS
What is Cluster Analysis?

• Cluster: a collection of data objects


– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no predefined classes
• Typical applications
– As a stand-alone tool to get insight into data distribution
– As a preprocessing step for other algorithms

General Applications of Clustering

• Pattern Recognition
• Spatial Data Analysis
– create thematic maps in GIS by clustering feature spaces
– detect spatial clusters and explain them in spatial data mining
• Image Processing
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar access patterns

Examples of Clustering Applications

• Marketing: Help marketers discover distinct groups in their customer bases, and then use
this knowledge to develop targeted marketing programs
• Land use: Identification of areas of similar land use in an earth observation database
• Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost
• City-planning: Identifying groups of houses according to their house type, value, and
geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered along continent
faults

What Is Good Clustering?


• A good clustering method will produce high quality clusters with
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the similarity measure used by the
method and its implementation.
• The quality of a clustering method is also measured by its ability to discover some or all
of the hidden patterns.

Requirements of Clustering in Data Mining


• Scalability
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability

Type of data in clustering analysis


• Interval-scaled variables:
• Binary variables:
• Nominal, ordinal, and ratio variables:
• Variables of mixed types:

Interval-valued variables
• Standardize data
– Calculate the mean absolute deviation:

Where mf = 1
n (x1 f + x2 f + ... + xnf ) .
– Calculate the standardized measurement (z-score)
xif − m f
zif = sf
• Using mean absolute deviation is more robust than using standard deviation

Similarity and Dissimilarity Between Objects

• Distances are normally used to measure the similarity or dissimilarity between two data
objects
• Some popular ones include: Minkowski distance:

d (i, j) = q (| x − x |q + | x − x |q +...+ | x − x |q )
i1 j1 i2 j 2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q
is a positive integer

• If q = 1, d is Manhattan distance
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 i p jp
• If q = 2, d is Euclidean distance:

d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp
sf = 1
n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
– Properties

• d(i,j)  0

• d(i,i) = 0

• d(i,j) = d(j,i)

• d(i,j)  d(i,k) + d(k,j)

• Also one can use weighted distance, parametric Pearson product moment correlation, or
other disimilarity measures.

Binary Variables

A contingency table for binary data

Object j

1 0 sum
1 a b a +b
Object i
0 c d c+d

• sum a + c b + d p

• Simple matching coefficient (invariant, if the binary variable is symmetric):


• Jaccard coefficient (noninvariant if the binary variable is asymmetric):

Dissimilarity between Binary Variables


• Example

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4


Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

– gender is a symmetric attribute


– the remaining attributes are asymmetric binary
– let the values Y and P be set to 1, and the value N be set to 0
0 +1
d ( jack , mary) = = 0.33
2 + 0 +1
1+1
d ( jack , jim) = = 0.67
1+1+1
1+ 2
d ( jim, mary) = = 0.75
1+1+ 2
Nominal Variables
A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow,
blue, green
• Method 1: Simple matching
– m: # of matches, p: total # of variables

d (i, j) = p −pm
• Method 2: use a large number of binary variables

– creating a new binary variable for each of the M nominal states

Ordinal Variables

An ordinal variable can be


• discrete or continuous
• order is important, e.g., rank
• Can be treated like interval-scaled
if
r {1,...,M }
f
– replacing xif by their rank
– map the range of each variable onto [0, 1] by replacing i-th object in the f-th
variable by
rif −1
zif =

M f −1 for interval-scaled variables
compute the dissimilarity using methods

Ratio-Scaled Variables

• Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at


exponential scale,
such as AeBt or Ae-Bt
• Methods:
– treat them like interval-scaled variables — not a good choice! (why?)
– apply logarithmic transformation
yif = log(xif)
– treat them as continuous ordinal data treat their rank as interval-scaled.

Variables of Mixed Types

• A database may contain all the six types of variables


– symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio.
• One may use a weighted formula to combine their effects.

 pf = 1 ij( f )dij( f )
d (i, j) =
 pf = 1 ij( f )
– f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
– f is interval-based: use the normalized distance
– f is ordinal or ratio-scaled
• compute ranks rif and
• and treat zif as interval-scaled
z r
=
if − 1
if M f −1

Types of Clustering

◼ Important distinction between hierarchical and partitional sets of clusters


◼ Partitional Clustering
◼ A division data objects into non-overlapping subsets (clusters) such that each data
object is in exactly one subset
◼ Hierarchical clustering
◼ A set of nested clusters organized as a hierarchical tree

HIERARCHICAL CLUSTERING

A Partitional Clustering
Original Points
p1
p3 p4
p2

p1 p2 p3 p4
Dendrogram 1

p1
p3 p4
p2
p1 p2 p3 p4

Dendrogram 2

A Dendrogram Shows How the Clusters are Merged Hierarchically

Decompose data objects into a several levels of nested partitioning (tree of clusters), called a
dendrogram. A clustering of the data objects is obtained by cutting the dendrogram at the desired
level, then each connected component forms a cluster,
AGNES (Agglomerative Nesting)

• Introduced in Kaufmann and Rousseeuw (1990)


• Implemented in statistical analysis packages, e.g., Splus
• Use the Single-Link method and the dissimilarity matrix.
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster

DIANA (Divisive Analysis)

• Introduced in Kaufmann and Rousseeuw (1990)


• Implemented in statistical analysis packages, e.g., Splus
• Inverse order of AGNES
• Eventually each node forms a cluster on its own

Recent Hierarchical Clustering Methods

• Major weakness of agglomerative clustering methods


– do not scale well: time complexity of at least O(n2), where n is the number of total
objects
– can never undo what was done previously
• Integration of hierarchical with distance-based clustering
– BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters
– ROCK (1999): clustering categorical data by neighbor and link analysis
– CHAMELEON (1999): hierarchical clustering using dynamic modeling
BIRCH (1996)

Birch: Balanced Iterative Reducing and Clustering using Hierarchies (Zhang, Ramakrishnan &
Livny, SIGMOD’96)
Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for
multiphase clustering
Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data
that tries to preserve the inherent clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree
Scales linearly: finds a good clustering with a single scan and improves the quality with a
few additional scans
Weakness: handles only numeric data, and sensitive to the order of the data record.

Clustering Feature Vector in BIRCH

Clustering Feature: CF = (N, LS, SS)

N: Number of data points


N
LS:  i=1=Xi

N 2
SS:  i=1=Xi

CF-Tree in BIRCH

Clustering feature:
Summary of the statistics for a given subcluster: the 0-th, 1st and 2nd moments of the sub
cluster from the statistical point of view.

Registers crucial measurements for computing cluster and utilizes storage efficiently

A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering

– A nonleaf node in a tree has descendants or “children”


– The nonleaf nodes store sums of the CFs of their children

A CF tree has two parameters
– Branching factor: specify the maximum number of children.
– threshold: max diameter of sub-clusters stored at the leaf nodes
Clustering Categorical Data: The ROCK Algorithm

• ROCK: RObust Clustering using linKs


– S. Guha, R. Rastogi & K. Shim, ICDE’99
• Major ideas
– Use links to measure similarity/proximity
– Not distance-based
– Computational complexity: O(n 2 + nmm ma + n 2 log n)
• Algorithm: sampling-based clustering
– Draw random sample
– Cluster with links
– Label data in disk
• Experiments
– Congressional voting, mushroom data

CHAMELEON: Hierarchical Clustering Using Dynamic Modeling (1999)

• CHAMELEON: by G. Karypis, E.H. Han, and V. Kumar’99

• Measures the similarity based on a dynamic model



– Two clusters are merged only if the interconnectivity and closeness (proximity)
between two clusters are high relative to the internal interconnectivity of the
clusters and closeness of items within the clusters
– Cure ignores information about interconnectivity of the objects, Rock ignores
information about the closeness of two clusters
• A two-phase algorithm
– Use a graph partitioning algorithm: cluster objects into a large number of
relatively small sub-clusters
– Use an agglomerative hierarchical clustering algorithm: find the genuine clusters
by repeatedly combining these sub-clusters

Partitioning Algorithms: Basic Concept

• Partitioning method: Construct a partition of a database D of n objects into a set of k


clusters
• Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueen’67): Each cluster is represented by the center of the cluster
– k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each
cluster is represented by one of the objects in the cluster
The K-Means Clustering Method

• Given k, the k-means algorithm is implemented in 4 steps:


– Partition objects into k nonempty subsets
– Compute seed points as the centroids of the clusters of the current partition. The
centroid is the center (mean point) of the cluster.
– Assign each object to the cluster with the nearest seed point.
– Go back to Step 2, stop when no more new assignment.
K-mean Example 1

Solve the following with K=2 {2, 25, 10, 15, 5, 20, 4, 40}
Step 1: Break the set into 2 clusters. Randomly assign data into two clusters and
calculate the mean value:
K1: {2,10,5,4} Mean = 5.25
K2: {25,15,20,40} Mean = 25
Step 2: Reassign the values to the numbers which are close to the mean of K1 (5.25) to
K1, mean of K2 (25) to K2.
K1:{2,5,4} Mean = 3.66
K2:{10,25,15,20,40} Mean = 22
Step 3: Reassign K1:{2,5,4} Mean=3.66 K2:{10,25,15,20,40} Mean=22
Both the clusters are the same as in Step 2 So, the final answer is
K1:{2,5,4} K2:{10,25,15,20,40}

K-mean Example 2

Use K-means algorithm to create 3 clusters for a set of values {2,3,6,8,9,12,15,18,22}


Step 1:
K1: 2,8,15 – mean 8
K2: 3,9,18 – mean 10
K3: 6,12,22 – mean 13.3
Step 2:
K1: 2,3,6,8,9 – mean 5.6
K2: - mean 0
K3: - 12,15,18,22 – mean 16.75
Step 3: Reassign
K1: 6,8,9 – mean 7.6
K2: 2,3 – mean 2.5
K3: 12,15,18,22 – mean 16.75
Step 4:Reassign
K1: 6,8,9 – mean 7.6
K2: 2,3 – mean 2.5
K3: 12,15,18,22 – mean 16.75
Last two groups are same. 3 clusters are:
Cluster 1= {6,8,9}, Cluster 2 ={2,3} , Cluster 3={12,15,18,22}

What Is the Problem of the K-Means Method?

• The k-means algorithm is sensitive to outliers!


– Since an object with an extremely large value may substantially distort the
distribution of the data.

• K-Medoids: Instead of taking the mean value of the object in a cluster as a reference
point, medoids can be used, which is the most centrally located object in a cluster.
The K-Medoids Clustering Method
• Find representative objects, called medoids, in clusters
• PAM (Partitioning Around Medoids, 1987)
– starts from an initial set of medoids and iteratively replaces one of the medoids by
one of the non-medoids if it improves the total distance of the resulting clustering
– PAM works effectively for small data sets, but does not scale well for large data
sets
• CLARA (Kaufmann & Rousseeuw, 1990)
• CLARANS (Ng & Han, 1994): Randomized sampling
• Focusing + spatial data structure (Ester et al., 1995)

CLARA (Clustering Large Applications) (1990)


• CLARA (Kaufmann and Rousseeuw in 1990)
– Built in statistical analysis packages, such as S+
• It draws multiple samples of the data set, applies PAM on each sample, and gives the best
clustering as the output
• Strength: deals with larger data sets than PAM
• Weakness:
– Efficiency depends on the sample size
– A good clustering based on samples will not necessarily represent a good
clustering of the whole data set if the sample is biased

CLARANS (“Randomized” CLARA) (1994)

• CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94)


• CLARANS draws sample of neighbors dynamically
• The clustering process can be presented as searching a graph where every node is a
potential solution, that is, a set of k medoids
• If the local optimum is found, CLARANS starts with new randomly selected node in
search for a new local optimum
• It is more efficient and scalable than both PAM and CLARA
• Focusing techniques and spatial access structures may further improve its performance
(Ester et al.’95)

Density-Based Clustering Methods

• Clustering based on density (local cluster criterion), such as density-connected points


• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters as termination condition
• Several interesting studies:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– DENCLUE: Hinneburg & D. Keim (KDD’98)
– CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
DBSCAN: The Algorithm

• Arbitrary select a point p


• Retrieve all points density-reachable from p w.r.t. Eps and MinPts.
• If p is a core point, a cluster is formed.
• If p is a border point, no points are density-reachable from p and DBSCAN visits the next
point of the database.
• Continue the process until all of the points have been processed.

OPTICS: A Cluster-Ordering Method (1999)

• OPTICS: Ordering Points To Identify the Clustering Structure


– Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
– Produces a special order of the database wrt its density-based clustering structure
– This cluster-ordering contains info equiv to the density-based clusterings
corresponding to a broad range of parameter settings
– Good for both automatic and interactive cluster analysis, including finding
intrinsic clustering structure
– Can be represented graphically or using visualization techniques
DENCLUE: Technical Essence
• Uses grid cells but only keeps information about grid cells that do actually contain data
points and manages these cells in a tree-based access structure
• Influence function: describes the impact of a data point within its neighborhood
• Overall density of the data space can be calculated as the sum of the influence function of
all data points
• Clusters can be determined mathematically by identifying density attractors
• Density attractors are local maximal of the overall density function

Grid-Based Clustering Method


• Using multi-resolution grid data structure
• Several interesting methods
– STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz
(1997)
– WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
• A multi-resolution clustering approach using wavelet method
– CLIQUE: Agrawal, et al. (SIGMOD’98)
• On high-dimensional data (thus put in the section of clustering high-
dimensional data
STING: A Statistical Information Grid Approach

• Wang, Yang and Muntz (VLDB’97)


• The spatial area area is divided into rectangular cells
• There are several levels of cells corresponding to different levels of
resolution

The STING Clustering Method

– Each cell at a high level is partitioned into a number of smaller cells in the next lower level
– Statistical info of each cell is calculated and stored beforehand and is used to answer queries
– Parameters of higher level cells can be easily calculated from parameters of lower level cell
• count, mean, s, min, max
• type of distribution—normal, uniform, etc.
– Use a top-down approach to answer spatial data queries
– Start from a pre-selected layer—typically with a small number of cells
– For each cell in the current level compute the confidence interval
Comments on STING
• Remove the irrelevant cells from further consideration
• When finish examining the current layer, proceed to the next lower level
• Repeat this process until the bottom layer is reached
• Advantages:
– Query-independent, easy to parallelize, incremental update
– O(K), where K is the number of grid cells at the lowest level
• Disadvantages:
– All the cluster boundaries are either horizontal or vertical, and no diagonal
boundary is detected
WaveCluster: Clustering by Wavelet Analysis (1998)
• Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
• A multi-resolution clustering approach which applies wavelet transform to the
feature space
• How to apply wavelet transform to find clusters
– Summarizes the data by imposing a multidimensional grid structure onto
data space
– These multidimensional spatial data objects are represented in a n-
dimensional feature space
– Apply wavelet transform on feature space to find the dense regions in the
feature space
– Apply wavelet transform multiple times which result in clusters at different
scales from fine to coarse
Wavelet Transform
• Wavelet transform: A signal processing technique that decomposes a signal into
different frequency sub-band (can be applied to n-dimensional signals)
• Data are transformed to preserve relative distance between objects at different
levels of resolution
• Allows natural clusters to become more distinguishable

The WaveCluster Algorithm

• Input parameters
– # of grid cells for each dimension
– the wavelet, and the # of applications of wavelet transform
• Why is wavelet transformation useful for clustering?
– Use hat-shape filters to emphasize region where points cluster, but simultaneously
suppress weaker information in their boundary
– Effective removal of outliers, multi-resolution, cost effective
• Major features:
– Complexity O(N)
– Detect arbitrary shaped clusters at different scales
– Not sensitive to noise, not sensitive to input order
– Only applicable to low dimensional data
• Both grid-based and density-based
Clustering High-Dimensional Data
• Clustering high-dimensional data
– Many applications: text documents, DNA micro-array data
– Major challenges:
• Many irrelevant dimensions may mask clusters
• Distance measure becomes meaningless—due to equi-distance
• Clusters may exist only in some subspaces
• Methods
– Feature transformation: only effective if most dimensions are relevant
• PCA & SVD useful only when features are highly correlated/redundant
– Feature selection: wrapper or filter approaches
• useful to find a subspace where the data have nice clusters
– Subspace-clustering: find clusters in all the possible subspaces
• CLIQUE, ProClus, and frequent pattern-based clustering
CLIQUE (Clustering In QUEst)
• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
• Automatically identifying subspaces of a high dimensional data space that allow better
clustering than original space
• CLIQUE can be considered as both density-based and grid-based
– It partitions each dimension into the same number of equal length interval
– It partitions an m-dimensional data space into non-overlapping rectangular units
– A unit is dense if the fraction of total data points contained in the unit exceeds the
input model parameter
– A cluster is a maximal set of connected dense units within a subspace
CLIQUE: The Major Steps
• Partition the data space and find the number of points that lie inside each cell of the
partition.
• Identify the subspaces that contain clusters using the Apriori principle
• Identify clusters
– Determine dense units in all subspaces of interests
– Determine connected dense units in all subspaces of interests.
• Generate minimal description for the clusters
– Determine maximal regions that cover a cluster of connected dense units for each
cluster
– Determination of minimal cover for each cluster
Strength and Weakness of CLIQUE
• Strength
– automatically finds subspaces of the highest dimensionality such that high density
clusters exist in those subspaces
– insensitive to the order of records in input and does not presume some canonical
data distribution
– scales linearly with the size of input and has good scalability as the number of
dimensions in the data increases
• Weakness
– The accuracy of the clustering result may be degraded at the expense of simplicity
of the method
Clustering with User-Specified Constraints

• Example: Locating k delivery centers, each serving at least m valued customers and n
ordinary ones
• Proposed approach
– Find an initial “solution” by partitioning the data set into k groups and satisfying
user-constraints
– Iteratively refine the solution by micro-clustering relocation (e.g., moving δ μ-
clusters from cluster Ci to Cj) and “deadlock” handling (break the microclusters
when necessary)
– Efficiency is improved by micro-clustering
• How to handle more complicated constraints?
– E.g., having approximately same number of valued customers in each cluster?! —
Can you solve it?
What Is Outlier Discovery?
• What are outliers?
– The set of objects are considerably dissimilar from the remainder of the data
– Example: Sports: Michael Jordon, Wayne Gretzky, ...
• Problem: Define and find outliers in large data sets
• Applications:
– Credit card fraud detection
– Telecom fraud detection
– Customer segmentation
– Medical analysis

You might also like