Data Mining - UNIT-IV

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Data Mining

UNIT-IV
Types Of Data Used In Cluster Analysis Are:
 Interval-Scaled variables
 Binary variables
 Nominal, Ordinal, and Ratio variables
 Variables of mixed types
Main memory-based clustering algorithms typically operate on either of the
following two data structures.
Types of data structures in cluster analysis are
 Data Matrix (or object by variable structure)
 Dissimilarity Matrix (or object by object structure)
Interval-Scaled Variables:
Interval-scaled variables are continuous measurements of a roughly linear
scale.
Typical examples include weight and height, latitude and longitude coordinates
(e.g., when clustering houses), and weather temperature.
The measurement unit used can affect the clustering analysis. For example,
changing measurement units from meters to inches for height, or from
kilograms to pounds for weight, may lead to a very different clustering
structure.
In general, expressing a variable in smaller units will lead to a larger range for
that variable, and thus a larger effect on the resulting clustering structure.
To help avoid dependence on the choice of measurement units, the data should
be standardized. Standardizing measurements attempts to give all variables an
equal weight.
Binary Variables:
A binary variable is a variable that can take only 2 values.
For example, generally, gender variables can take 2 variables male and female.
Contingency Table for Binary Data
Let us consider binary values 0 and 1
Nominal or Categorical Variables:
A generalization of the binary variable in that it can take more than 2 states,
e.g., red, yellow, blue, green.
Method 1: Simple matching
The dissimilarity between two objects i and j can be computed based on the
simple matching.
m: Let m be no of matches (i.e., the number of variables for which i and j are in
the same state).
p: Let p be total no of variables.

Method 2: use a large number of binary variables


Creating a new binary variable for each of the M nominal states.
Ordinal Variables:
An ordinal variable can be discrete or continuous. In this order is important,
e.g., rank. It can be treated like interval-scaled By replacing xif by their rank,

By mapping the range of each variable onto [0, 1] by replacing the i-th object in
the f-th variable by,

Ratio-Scaled Intervals:
Ratio-scaled variable: It is a positive measurement on a nonlinear scale,
approximately at an exponential scale, such as Ae^Bt or A^e-Bt.
Methods:
 First, treat them like interval-scaled variables — not a good choice!
(why?)
 Then apply logarithmic transformation i.e.y = log(x)
 Finally, treat them as continuous ordinal data treat their rank as
interval-scaled.
Variables Of Mixed Type:
A database may contain all the six types of variables
symmetric binary, asymmetric binary, nominal, ordinal, interval, and ratio.
And those combinedly called as mixed-type variables.
Cluster Analysis:
Clustering is the process of grouping a set of data objects into multiple groups
or clusters so that objects within a cluster have high similarity, but are very
dissimilar to objects in other clusters. Dissimilarities and similarities are
assessed based on the attribute values describing the objects and often involve
distance measures.

Clustering as a data mining tool has its roots in many application areas such
as biology, security, business intelligence, and Web search. Clustering
techniques, organized into the following categories: partitioning methods,
hierarchical methods and density-based methods.

Cluster analysis or simply clustering is the process of partitioning a set of data


objects (or observations) into subsets. Each subset is a cluster, such that
objects in a cluster are similar to one another, yet dissimilar to objects in other
clusters. The set of clusters resulting from a cluster analysis can be referred to
as a clustering. The partitioning is not performed by humans, but by the
clustering algorithm. Hence, clustering is useful in that it can lead to the
discovery of previously unknown groups within the data.

Because a cluster is a collection of data objects that are similar to one another
within the cluster and dissimilar to objects in other clusters, a cluster of data
objects can be treated as an implicit class. In this sense, clustering is
sometimes called automatic classification. Again, a critical difference here is
that clustering can automatically find the groupings. This is a distinct
advantage of cluster analysis.

Clustering is also called data segmentation in some applications because


clustering partitions large data sets into groups according to their similarity.
Clustering can also be used for outlier detection, where outliers (values that
are “far away” from any cluster) may be more interesting than common cases.
Applications of outlier detection include the detection of credit card fraud and
the monitoring of criminal activities in electronic commerce.
Clustering is known as unsupervised learning because the class label
information is not present. For this reason, clustering is a form of learning by
observation, rather than learning by examples.

Requirements for Cluster Analysis:


The following are typical requirements of clustering in data mining.

 Scalability: Many clustering algorithms work well on small datasets


containing fewer than several hundred data objects; however, a large
database may contain millions or even billions of objects, particularly in
Web search scenarios. Clustering on only a sample of a given large data
set may lead to biased results. Therefore, highly scalable clustering
algorithms are needed.
 Ability to deal with different types of attributes: Many algorithms are
designed to cluster numeric (interval-based) data. However, applications
may require clustering other data types, such as binary, nominal
(categorical), and ordinal data, or mixtures of these data types. Recently,
more and more applications need clustering techniques for complex data
types such as graphs, sequences, images, and documents.
 Discovery of clusters with arbitrary shape: Many clustering algorithms
determine clusters based on Euclidean or Manhattan distance measures.
Algorithms based on such distance measures tend to find spherical
clusters with similar size and density. However, a cluster could be of any
shape. Consider sensors, for example, which are often deployed for
environment surveillance. Cluster analysis on sensor readings can detect
interesting phenomena.
 Requirements for domain knowledge to determine input parameters:
Many clustering algorithms require users to provide domain knowledge
in the form of input parameters such as the desired number of clusters.
Consequently, the clustering results may be sensitive to such
parameters. Parameters are often hard to determine, especially for high-
dimensionality data sets and where users have yet to grasp a deep
understanding of their data.
 Ability to deal with noisy data: Most real-world data sets contain
outliers and/or missing, unknown, or erroneous data. Sensor readings,
for example, are often noisy—some readings may be inaccurate due to
the sensing mechanisms, and some readings may be erroneous due to
interferences from surrounding transient objects. Clustering algorithms
can be sensitive to such noise and may produce poor-quality clusters.
Therefore, we need clustering methods that are robust to noise.
 Incremental clustering and insensitivity to input order: In many
applications, incremental updates (representing newer data) may arrive
at any time. Some clustering algorithms cannot incorporate incremental
updates into existing clustering structures and, instead, have to re
compute a new clustering from scratch. Clustering algorithms may also
be sensitive to the input data order. That is, given a set of data objects,
clustering algorithms may return dramatically different clusterings
depending on the order in which the objects are presented. Incremental
clustering algorithms and algorithms that are insensitive to the input
order are needed.
 Capability of clustering high-dimensionality data: A dataset can
contain numerous dimensions or attributes. When clustering documents,
for example, each keyword can be regarded as a dimension, and there
are often thousands of keywords. Most clustering algorithms are good at
handling low-dimensional data such as data sets involving only two or
three dimensions. Finding clusters of data objects in a high dimensional
space is challenging, especially considering that such data can be very
sparse and highly skewed.
 Constraint-based clustering: Real-world applications may need to
perform clustering under various kinds of constraints. Suppose that your
job is to choose the locations for a given number of new automatic teller
machines (ATMs) in a city. To decide upon this, you may cluster
households while considering constraints such as the city’s rivers and
highway networks and the types and number of customers per cluster. A
challenging task is to find data groups with good clustering behavior that
satisfy specified constraints.
 Interpretability and usability: Users want clustering results to be
interpretable, comprehensible, and usable. That is, clustering may need
to be tied in with specific semantic interpretations and applications. It is
important to study how an application goal may influence the selection of
clustering features and clustering methods.
The following are orthogonal aspects with which clustering methods can be
compared:

 The partitioning criteria: In some methods, all the objects are


partitioned so that no hierarchy exists among the clusters. That is, all
the clusters are at the same level conceptually. Such a method is useful,
for example, for partitioning customers into groups so that each group
has its own manager. Alternatively, other methods partition data objects
hierarchically, where clusters can be formed at different semantic levels.
 Separation of clusters: Some methods partition data objects into
mutually exclusive clusters. When clustering customers into groups so
that each group is taken care of by one manager, each customer may
belong to only one group. In some other situations, the clusters may not
be exclusive, that is, a data object may belong to more than one cluster.
 Similarity measure: Some methods determine the similarity between
two objects by the distance between them. Such a distance can be
defined on Euclidean space, a road network, a vector space, or any other
space. In other methods, the similarity may be defined by connectivity
based on density or contiguity, and may not rely on the absolute distance
between two objects. Similarity measures play a fundamental role in the
design of clustering methods. While distance-based methods can often
take advantage of optimization techniques, density- and continuity-based
methods can often find clusters of arbitrary shape.
 Clustering space: Many clustering methods search for clusters within the
entire given data space. These methods are useful for low-dimensionality
data sets. With high dimensional data, however, there can be many
irrelevant attributes, which can make similarity measurements
unreliable. Consequently, clusters found in the full space are often
meaningless.

Partitioning methods:

Given a set of n objects, a partitioning method constructs k partitions of the


data, where each partition represents a cluster and k≤n. That is, it divides the
data into k groups such that each group must contain at least one object. In
other words, partitioning methods conduct one-level partitioning on data sets.
The basic partitioning methods typically adopt exclusive cluster separation.
That is, each object must belong to exactly one group.

Most partitioning methods are distance-based. Given k, the number of


partitions to construct, a partitioning method creates an initial partitioning. It
then uses an iterative relocation technique that attempts to improve the
partitioning by moving objects from one group to another. The general criterion
of a good partitioning is that objects in the same cluster are “close” or related to
each other, whereas objects in different clusters are “far apart” or very
different.

Achieving global optimality in partitioning-based clustering is often


computationally prohibitive, potentially requiring an exhaustive enumeration of
all the possible partitions. Instead, most applications adopt popular heuristic
methods, such as greedy approaches like the k-means and the k-medoids
algorithms, which progressively improve the clustering quality and approach a
local optimum. These heuristic clustering methods work well for finding
spherical-shaped clusters in small- to medium-size databases. To find clusters
with complex shapes and for very large data sets, partitioning-based methods
need to be extended.

Hierarchical methods: A hierarchical method creates a hierarchical


decomposition of the given set of data objects. A hierarchical method can be
classified as being either agglomerative or divisive, based on how the
hierarchical decomposition is formed. The agglomerative approach, also
called the bottom-up approach, starts with each object forming a separate
group. It successively merges the objects or groups close to one another, until
all the groups are merged into one (the topmost level of the hierarchy), or a
termination condition holds. The divisive approach, also called the top-down
approach, starts with all the objects in the same cluster. In each successive
iteration, a cluster is split into smaller clusters, until eventually each object is
in one cluster, or a termination condition holds.

Hierarchical methods suffer from the fact that once a step (merge or split) is
done, it can never be undone. This rigidity is useful in that it leads to smaller
computation costs by not having to worry about a combinatorial number of
different choices.
Density-based methods: Most partitioning methods cluster objects based on
the distance between objects. Such methods can find only spherical-shaped
clusters and encounter difficulty in discovering clusters of arbitrary shapes.
Other clustering methods have been developed based on the notion of density.
Their general idea is to continue growing a given cluster as long as the density
(number of objects or data points) in the “neighborhood” exceeds some
threshold. For example, for each data point within a given cluster, the
neighborhood of a given radius has to contain at least a minimum number of
points. Such a method can be used to filter out noise or outliers and discover
clusters of arbitrary shape.

Density-based methods can divide a set of objects into multiple exclusive


clusters, or a hierarchy of clusters.
These methods are briefly summarized in Figure:

Let D be a data set of n objects to be clustered. An object is described by d


variables, where each variable is also called an attribute or a dimension, and
therefore may also be referred to as apoint in a d-dimensional object space.
Objects are represented in bold italic font (e.g., p).
k-Means: A Centroid-Based Technique:

Suppose a data set, D, contains n objects in Euclidean space. Partitioning


methods distribute the objects in D into k clusters, C1,...,Ck, that is, Ci ⊂D
and Ci∩Cj =∅ for (1≤i,j≤k). An objective function is used to assess the
partitioning quality so that objects within a cluster are similar to one another
but dissimilar to objects in other clusters. This is, the objective function aims
for high intra cluster similarity and low inter cluster similarity. A centroid-
based partitioning technique uses the centroid of a cluster, Ci, to represent
that cluster. Conceptually, the centroid of a cluster is its center point. The
centroid can be defined in various ways such as by the mean or medoid of the
objects (or points) assigned to the cluster. The difference between an object p ∈
Ci and ci, the representative of the cluster, is measured by dist (p,ci), where
dist(x,y) is the Euclidean distance between two points x and y. The quality of
cluster Ci can be measured by the within cluster variation, which is the sum of
squared error between all objects in Ci and the centroid ci, defined as

where E is the sum of the squared error for all objects in the data set; p is the
point in space representing a given object; and ci is the centroid of cluster Ci
(both p and ci are multidimensional). In other words, for each object in each
cluster, the distance from the object to its cluster center is squared, and the
distances are summed. This objective function tries to make the resulting k
clusters as compact and as separate as possible.

The k-means algorithm defines the centroid of a cluster as the mean value of
the points within the cluster. It proceeds as follows. First, it randomly selects k
of the objects in D, each of which initially represents a cluster mean or center.
For each of the remaining objects, an object is assigned to the cluster to which
it is the most similar, based on the Euclidean distance between the object and
the cluster mean. The k-means algorithm then iteratively improves the within-
cluster variation. For each cluster, it computes the new mean using the objects
assigned to the cluster in the previous iteration. All the objects are then
reassigned using the updated means as the new cluster centers. The iterations
continue until the assignment is stable, that is, the clusters formed in the
current round are the same as those formed in the previous round.

The k-means partitioning algorithm:

Algorithm: k-means. The k-means algorithm for partitioning, where each


cluster’s center is represented by the mean value of the objects in the cluster.
Input:

 k: the number of clusters,


 D: a data set containing n objects.

Output: A set of k clusters. Method:

(1) arbitrarily choose k objects from D as the initial cluster centers;

(2) repeat

(3) (re)assign each object to the cluster to which the object is the most
similar, based on the mean value of the objects in the cluster;

(4) update the cluster means, that is, calculate the mean value of the objects
for each cluster;

(5) until no change;


Figure: Clustering of a set of objects using the k-means method; for (b) update
cluster centers and reassign objects accordingly (the mean of each
cluster is marked by a+).

The k-means method can be applied only when the mean of a set of objects is
defined. This may not be the case in some applications such as when data with
nominal attributes are involved. The k-modes method is a variant of k-means,
which extends the k-means paradigm to cluster nominal data by replacing the
means of clusters with modes. It uses new dissimilarity measures to deal with
nominal objects and a frequency-based method to update modes of clusters.
The k-means and the k-modes methods can be integrated to cluster data with
mixed numeric and nominal values.

The necessity for users to specify k, the number of clusters, in advance can be
seen as a disadvantage. The k-means method is not suitable for discovering
clusters with nonconvex shapes or clusters of very different size. Moreover, it is
sensitive to noise and outlier data points because a small number of such data
can substantially influence the mean value.

k-Medoids: A Representative Object-Based Technique:

The k-means algorithm is sensitive to outliers because such objects are far
away from the majority of the data, and thus, when assigned to a cluster, they
can dramatically distort the mean value of the cluster.

The partitioning method is then performed based on the principle of minimizing


the sum of the dissimilarities between each object p and its corresponding
representative object. That is, an absolute-error criterion is used, defined as
where E is the sum of the absolute error for all objects p in the data set, and oi
is the representative object of Ci. This is the basis for the k-medoids method,
which groups n objects into k clusters by minimizing the absolute error.

The Partitioning Around Medoids (PAM) algorithm is a popular realization of


k-medoids clustering. It tackles the problem in an iterative, greedy way. Like
the k-means algorithm, the initial representative objects (called seeds) are
chosen arbitrarily. We consider whether replacing a representative object by a
non representative object would improve the clustering quality. All the possible
replacements are tried out. The iterative process of replacing representative
objects by other objects continues until the quality of the resulting clustering
cannot be improved by any replacement. This quality is measured by a cost
function of the average dissimilarity between an object and the representative
object of its cluster.

Specifically, let o1,...,ok be the current set of representative objects (i.e.,


medoids ). To determine whether a non representative object, denoted by
orandom, is a good replacement for a current medoid oj (1≤j≤k), we calculate the
distance from every object p to the closest object in the set
{o1,...,oj−1,orandom,oj+1,...,ok}, and use the distance to update the cost function.

Figure: Four cases of the cost function for k-medoids clustering.

A typical k-medoids partitioning algorithm like PAM works effectively for small
datasets, but does not scale well for large datasets. To deal with larger
datasets, a sampling-based method called CLARA (Clustering LARge
Applications) can be used. Instead of taking the whole data set into
consideration, CLARA uses a random sample of the dataset. The PAM
algorithm is then applied to compute the best medoids from the sample.
Ideally, the sample should closely represent the original data set.

CLARA builds clusterings from multiple random samples and returns the best
clustering as the output. The complexity of computing the medoids on a
random sample is O(ks2+k(n−k)), where s is the size of the sample, k is the
number of clusters, and n is the total number of objects. CLARA can deal with
larger data sets than PAM.

PAM, a k-medoids partitioning algorithm:


Algorithm: k-medoids. PAM, a k-medoids algorithm for partitioning based on
medoid or central objects.

Input:
 k: the number of clusters,
 D: a data set containing n objects.
Output: A set of k clusters.

Method:
(1) arbitrarily choose k objects in D as the initial representative objects or
seeds;
(2) repeat
(3) assign each remaining object to the cluster with the nearest
representative object;
(4) randomly select a nonrepresentative object, orandom;
(5) compute the total cost, S, of swapping representative object, oj,
with orandom;
(6) if S < 0 then swap oj with orandom to form the new set of k
representative objects;
(7) until no change;
Hierarchical Methods:
While partitioning methods meet the basic clustering requirement of organizing
a set of objects into a number of exclusive groups, in some situations we may
want to partition our data into groups at different levels such as in a hierarchy.
A hierarchical clustering method works by grouping data objects into a
hierarchy or “tree” of clusters.
Representing data objects in the form of a hierarchy is useful for data
summarization and visualization. For example, as the manager of human
resources at AllElectronics, you may organize your employees into major
groups such as executives, managers, and staff. You can further partition
these groups into smaller subgroups. For instance, the general group of staff
can be further divided into subgroups of senior officers, officers,
andtrainees.Allthesegroupsformahierarchy.Wecaneasilysummarizeorcharacteriz
e the data that are organized into a hierarchy, which can be used to find, say,
the average salary of managers and of officers.
Hierarchical clustering methods can encounter difficulties regarding the
selection of merge or split points. Such a decision is critical, because once a
group of objects is merged or split, the process at the next step will operate on
the newly generated clusters. It will neither undo what was done previously,
nor perform object swapping between clusters. Thus, merge or split decisions,
if not well chosen, may lead to low-quality clusters. Moreover, the methods do
not scale well because each decision of merge or split needs to examine and
evaluate many objects or clusters.
A promising direction for improving the clustering quality of hierarchical
methods is to integrate hierarchical clustering with other clustering
techniques, resulting in multiple-phase (or multiphase) clustering. We
introduce two such methods, namely BIRCH and Chameleon. BIRCH begins
by partitioning objects hierarchically using tree structures, where the leaf or
low-level nonleaf nodes can be viewed as “microclusters” depending on the
resolution scale. It then applies other clustering algorithms to perform
macroclustering on the microclusters. Chameleon (Section 10.3.4) explores
dynamic modeling in hierarchical clustering.
There are several orthogonal ways to categorize hierarchical clustering
methods. For instance, they may be categorized into algorithmic methods,
probabilistic methods, and Bayesian methods. Agglomerative, divisive, and
multiphase methods are algorithmic, meaning they consider data objects as
deterministic and compute clusters according to the deterministic distances
between objects.
Agglomerative versus Divisive Hierarchical Clustering:
A hierarchical clustering method can be either agglomerative or divisive,
depending on whether the hierarchical decomposition is formed in a bottom-up
(merging) or topdown (splitting) fashion. Let’s have a closer look at these
strategies.
An agglomerative hierarchical clustering method uses a bottom-up
strategy. It typically starts by letting each object form its own cluster and
iteratively merges clusters into larger and larger clusters, until all the objects
are in a single cluster or certain termination conditions are satisfied. The single
cluster becomes the hierarchy’s root. For the merging step, it finds the two
clusters that are closest to each other (according to some similarity measure),
and combines the two to form one cluster. Because two clusters are merged per
iteration, where each cluster contains at least one object, an agglomerative
method requires at most n iterations.
A divisive hierarchical clustering method employs a top-down
strategy. It starts by placing all objects in one cluster, which is the hierarchy’s
root. It then divides the root cluster into several smaller subclusters, and
recursively partitions those clusters into smaller ones. The partitioning process
continues until each cluster at the lowest level is coherent enough—either
containing only one object, or the objects within a cluster are sufficiently
similar to each other.
In either agglomerative or divisive hierarchical clustering, a user can
specify the desired number of clusters as a termination condition.
Example: Agglomerative versus divisive hierarchical clustering. Figure 10.6
shows the application of AGNES (AGglomerative NESting), an agglomerative
hierarchical clustering method, and, a divisive hierarchical clustering method,
on a dataset of five objects, {a,b,c,d,e}. Initially, AGNES, the agglomerative
method, places each object in to a clus DIANA (DIvisive ANAlysis)ter of its
own. The clusters are then merged step-by-step according to some criterion.
For example,clustersC1 andC2 maybemergedifanobjectinC1 and an object in
C2 form the minimum Euclidean distance between any two objects from
different clusters.
Figure: Agglomerative and divisive hierarchical clustering on data objects {a,b,c,d,e}.

This is a single-linkage approach in that each cluster is represented by all the


objects in the cluster, and the similarity between two clusters is measured by
the similarity of the closest pair of data points belonging to different clusters.
The cluster-merging process repeats until all the objects are eventually merged
to form one cluster.
DIANA, the divisive method, proceeds in the contrasting way. All the objects are
used to form one initial cluster. The cluster is split according to some principle
such as the maximum Euclidean distance between the closest neighboring
objects in the cluster. The cluster-splitting process repeats until, eventually,
each new cluster contains only a single object.

Figure: Dendrogram representation for hierarchical clustering of data objects {a,b,c,d,e}.


A tree structure called a dendrogram is commonly used to represent the
process of hierarchical clustering. It shows how objects are grouped together
(in an agglomerative method) or partitioned (in a divisive method) step-by-step.
Figure shows a dendrogram for the five objects presented in Figure 10.6, where
l=0 shows the five objects as singleton clusters at level 0. At l=1, objects a and
b are grouped together to form the first cluster, and they stay together at all
subsequent levels. We can also use a vertical axis to show the similarity scale
between clusters. For example, when the similarity of two groups of objects,
{a,b} and {c,d,e}, is roughly 0.16, they are merged together to form a single
cluster.
A challenge with divisive methods is how to partition a large cluster into
several smaller ones. For example, there are 2n−1−1 possible ways to partition a
set of n objects into two exclusive subsets, where n is the number of objects.
When n is large, it is computationally prohibitive to examine all possibilities.
Distance Measures in Algorithmic Methods:
Whether using anagglomerative method or a divisive method, a core need is to
measure the distance between two clusters, where each cluster is generally a
set of objects.
Four widely used measures for distance between clusters areas follows,
where jp-p0j is the distance between two objects or points, p and p0; mi is the
mean for cluster, Ci; and ni is the number of objects in Ci. They are also known
as linkage measures.
When an algorithm uses the minimum distance, dmin(Ci,Cj), to measure
the distance between clusters, it is sometimes called a nearest-neighbor
clustering algorithm. Moreover, if the clustering process is terminated when
the distance between nearest clusters exceeds a user-defined threshold, it is
called a single-linkage algorithm.
When an algorithm uses the maximum distance, dmax(Ci,Cj), to measure
the distance between clusters, it is sometimes called a farthest-neighbor
clustering algorithm. If the clustering process is terminated when the
maximum distance between nearest clusters exceeds a user-defined threshold,
it is called a complete-linkage algorithm.
The previous minimum and maximum measures represent two extremes
in measuring the distance between clusters. They tend to be overly sensitive to
outliers or noisy data. The use of mean or average distance is a compromise
between the minimum and maximum distances and overcomes the outlier
sensitivity problem. Whereas the mean distance is the simplest to compute, the
average distances is advantageous in that it can handle categoric as well as
numeric data. The computation of the mean vector for categoric data can be
difficult or impossible to define.
Example: Single versus complete linkages. Let us apply hierarchical clustering
to the data set of Figure (a). Figure (b) shows the dendrogram using single
linkage. Figure (c) shows the case using complete linkage, where the edges
between clusters {A,B,J,H} and {C,D,G,F,E} are omitted for ease of presentation.
This example shows that by using single linkages we can find hierarchical
clusters defined by local proximity, whereas complete linkage tends to find
clusters opting for global closeness.

Figure: Hierarchical clustering using single and complete linkages.


Density-Based Methods:
Partitioning and hierarchical methods are designed to find spherical-
shaped clusters. They have difficulty finding clusters of arbitrary shape such as
the “S” shape and oval clusters in Figure. Given such data, they would likely
inaccurately identify convex regions, where noise or outliers are included in the
clusters.
To find clusters of arbitrary shape, alternatively, we can model clusters
as dense regions in the data space, separated by sparse regions. This is the
main strategy behind density-based clustering methods, which can discover
clusters of nonspherical shape.
DBSCAN: Density-Based Clustering Based on Connected
Regions with High Density:
The density of an object o can be measured by the number of objects
close to o. DBSCAN (Density-Based Spatial Clustering of Applications with
Noise) finds core objects, that is, objects that have dense neighborhoods. It
connect score objects and their neighborhoods to form dense regions as
clusters.
A user-specified parameter € > 0 is used to specify the radius of a
neighborhood we consider for every object. The €-neighborhood of an object o is
the space within a radius € centered at o. Due to the fixed neighborhood size
parameterized by €, the density of a neighborhood can be measured simply by
the number of objects in the neighborhood. To determine whether a
neighborhood is dense or not, DBSCAN uses another user-specified parameter,
MinPts, which specifies the density threshold of dense regions. An object is a
coreobject if the -neighborhood of the object contains at least MinPts objects.
Core objects are the pillars of dense regions.

Figure: Clusters of arbitrary shape.


Algorithm: DBSCAN: a density-based clustering algorithm.
Input:
 D: a data set containing n objects,
 €: the radius parameter, and MinPts: the neighborhood density
threshold.
Output: A set of density-based clusters.
Method:
(1) mark all objects as unvisited;
(2) do
(3) randomly select an unvisited object p;
(4) mark p as visited;
(5) if the €-neighborhood of p has at least MinPts objects
(6) create a new cluster C, and add p to C;
(7) let N be the set of objects in the €-neighborhood of p;
(8) for each point p0 in N
(9) if p0 is unvisited
(10) mark p0 as visited;
(11) if the €-neighborhood of p0 has at least MinPts points,
add those points to N;
(12) if p0 is not yet a member of any cluster, add p0 to C;
(13) endfor
(14) output C;
(15) else mark p as noise;
(16) until no object is unvisited;
Evaluation of Clustering:
Cluster evaluation assesses the feasibility of clustering analysis on a data set
and the quality of the results generated by a clustering method. The major
tasks of clustering evaluation include the following:
 Assessing clustering tendency. In this task, for a given data set, we
assess whether a nonrandom structure exists in the data. Blindly
applying a clustering method on a data set will return clusters; however,
the clusters mined may be misleading. Clustering analysis on a data set
is meaningful only when there is a nonrandom structure in the data.
 Determining the number of clusters in a data set. A few algorithms,
such as k-means, require the number of clusters in a data set as the
parameter. Moreover, the number of clusters can be regarded as an
interesting and important summary statistic of a data set. Therefore, it is
desirable to estimate this number even before a clustering algorithm is
used to derive detailed clusters.
 Measuring clustering quality. After applying a clustering method on a
data set, we want to assess how good the resulting clusters are. A
number of measures can be used. Some methods measure how well the
clusters fit the data set, while others measure how well the clusters
match the ground truth, if such truth is available. There are also
measures that score clusterings and thus can compare two sets of
clustering results on the same data set.
Measuring Clustering Quality:
Suppose you have assessed the clustering tendency of a given data set. You
may have also tried to predetermine the number of clusters in the set. You can
now apply one or multiple clustering methods to obtain clusterings of the data
set.
We have a few methods to choose from for measuring the quality of a
clustering. In general, these methods can be categorized into two groups
according to whether ground truth is available. Here, ground truth is the ideal
clustering that is often built using human experts.
If ground truth is available, it can be used by extrinsic methods, which
compare the clustering against the group truth and measure. If the ground
truth is unavailable, we can use intrinsic methods, which evaluate the
goodness of a clustering by considering how well the clusters are separated.
Ground truth can be considered as supervision in the form of “cluster labels.”
Hence, extrinsic methods are also known as supervised methods, while
intrinsic methods are unsupervised methods.

Short Questions

1. Define Clustering?
2. Explain the meaning of cluster analysis?
3. What are the applications of clustering?
4. State K-Means method?
5. Define Outlier Detection?
6. State the K-medoids algorithm?
7. Give the limitations of K-means algorithm.
8. Discuss the requirements of clustering in data mining.
9. State hierarchical method?
10. Differentiate agglomerative and divisive hierarchical clustering?
11. Compare and contrasts K-means and K-medoids
12. What are the key issues in hierarchical clustering?
13. Give the strengths and weakness of K-means algorithm.
14. Explain a technique to evaluate the quality of a cluster.

Essay Questions

1. How to access the quality of a cluster?


2. Give the limitations of K-means algorithm.
3. How can you say that K-medoids clustering is better than K-means algorithm.
4. Describe the different types of hierarchical methods?
5. Explain about semi-supervised cluster analysis?
6. Discuss about the density-based outlier detection?
7. What is the goal of clustering? How can achieve the goal explain by using a clustering
algorithm?
8. Write algorithms for k-means and k-medoids? Explain?
9. Explain the categories of major clustering methods?
10. Discuss the efficiency of k-medoids algorithm on large data sets.
11. How does k-means algorithm works? Explain about with an example.
12. State K-Means algorithm. Apply K-means algorithm with two iterations to form two
clusters by taking initial cluster centers as subjects 1 and subject 4.
13. What are outliers? Discuss the method adopted for outlier detection?
14. What is a distance-based outlier? What are efficient algorithms for mining distance-
based algorithm? How are outliers determined in this method?
15. Differentiate between AGNES and DIANA algorithms.
16. What is outlier detection? Explain distance based outlier detection.
17. Write partitioning around mediods algorithm.
18. State K-means algorithm. Apply k-means algorithm with two iterations to form two
clusters by taking initial centroid as Subject 1 and 4 respectively.

Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
Outlier Analysis:
Outlier is a data object that deviates significantly from the rest of the data
objects and behaves in a different manner. An outlier is an object that
deviates significantly from the rest of the objects. They can be caused by
measurement or execution errors. The analysis of outlier data is referred to as
outlier analysis or outlier mining.
An outlier cannot be termed as a noise or error. Instead, they are suspected
of not being generated by the same method as the rest of the data objects.
Outliers are of three types, namely
 Global (or Point) Outliers
 Collective Outliers
 Contextual (or Conditional) Outliers
Global Outliers:
They are also known as Point Outliers. These are the simplest form of outliers. If, in a
given dataset, a data point strongly deviates from all the rest of the data points, it is known
as a global outlier. Mostly, all of the outlier detection methods are aimed at finding global
outliers.
For example, In Intrusion Detection System, if a large number of packages are broadcast
in a very short span of time, then this may be considered as a global outlier and we can
say that that particular system has been potentially hacked.

Collective Outliers
As the name suggests, if in a given dataset, some of the data points, as a whole, deviate
significantly from the rest of the dataset, they may be termed as collective outliers. Here,
the individual data objects may not be outliers, but when seen as a whole, they may
behave as outliers. To detect these types of outliers, we might need background
information about the relationship between those data objects showing the behavior of
outliers.
For example: In an Intrusion Detection System, a DOS (denial-of-service) package from
one computer to another may be considered as normal behavior. However, if this happens
with several computers at the same time, then this may be considered as abnormal
behavior and as a whole they can be termed as collective outliers.

Contextual Outliers
They are also known as Conditional Outliers. Here, if in a given dataset, a data object
deviates significantly from the other data points based on a specific context or condition
only. A data point may be an outlier due to a certain condition and may show normal
behavior under another condition. Therefore, a context has to be specified as part of the
problem statement in order to identify contextual outliers. Contextual outlier analysis
provides flexibility for users where one can examine outliers in different contexts, which
can be highly desirable in many applications.
For example: A temperature reading of 40°C may behave as an outlier in the context of a
“winter season” but will behave like a normal data point in the context of a “summer
season”.

You might also like