Data Mining - UNIT-IV
Data Mining - UNIT-IV
Data Mining - UNIT-IV
UNIT-IV
Types Of Data Used In Cluster Analysis Are:
Interval-Scaled variables
Binary variables
Nominal, Ordinal, and Ratio variables
Variables of mixed types
Main memory-based clustering algorithms typically operate on either of the
following two data structures.
Types of data structures in cluster analysis are
Data Matrix (or object by variable structure)
Dissimilarity Matrix (or object by object structure)
Interval-Scaled Variables:
Interval-scaled variables are continuous measurements of a roughly linear
scale.
Typical examples include weight and height, latitude and longitude coordinates
(e.g., when clustering houses), and weather temperature.
The measurement unit used can affect the clustering analysis. For example,
changing measurement units from meters to inches for height, or from
kilograms to pounds for weight, may lead to a very different clustering
structure.
In general, expressing a variable in smaller units will lead to a larger range for
that variable, and thus a larger effect on the resulting clustering structure.
To help avoid dependence on the choice of measurement units, the data should
be standardized. Standardizing measurements attempts to give all variables an
equal weight.
Binary Variables:
A binary variable is a variable that can take only 2 values.
For example, generally, gender variables can take 2 variables male and female.
Contingency Table for Binary Data
Let us consider binary values 0 and 1
Nominal or Categorical Variables:
A generalization of the binary variable in that it can take more than 2 states,
e.g., red, yellow, blue, green.
Method 1: Simple matching
The dissimilarity between two objects i and j can be computed based on the
simple matching.
m: Let m be no of matches (i.e., the number of variables for which i and j are in
the same state).
p: Let p be total no of variables.
By mapping the range of each variable onto [0, 1] by replacing the i-th object in
the f-th variable by,
Ratio-Scaled Intervals:
Ratio-scaled variable: It is a positive measurement on a nonlinear scale,
approximately at an exponential scale, such as Ae^Bt or A^e-Bt.
Methods:
First, treat them like interval-scaled variables — not a good choice!
(why?)
Then apply logarithmic transformation i.e.y = log(x)
Finally, treat them as continuous ordinal data treat their rank as
interval-scaled.
Variables Of Mixed Type:
A database may contain all the six types of variables
symmetric binary, asymmetric binary, nominal, ordinal, interval, and ratio.
And those combinedly called as mixed-type variables.
Cluster Analysis:
Clustering is the process of grouping a set of data objects into multiple groups
or clusters so that objects within a cluster have high similarity, but are very
dissimilar to objects in other clusters. Dissimilarities and similarities are
assessed based on the attribute values describing the objects and often involve
distance measures.
Clustering as a data mining tool has its roots in many application areas such
as biology, security, business intelligence, and Web search. Clustering
techniques, organized into the following categories: partitioning methods,
hierarchical methods and density-based methods.
Because a cluster is a collection of data objects that are similar to one another
within the cluster and dissimilar to objects in other clusters, a cluster of data
objects can be treated as an implicit class. In this sense, clustering is
sometimes called automatic classification. Again, a critical difference here is
that clustering can automatically find the groupings. This is a distinct
advantage of cluster analysis.
Partitioning methods:
Hierarchical methods suffer from the fact that once a step (merge or split) is
done, it can never be undone. This rigidity is useful in that it leads to smaller
computation costs by not having to worry about a combinatorial number of
different choices.
Density-based methods: Most partitioning methods cluster objects based on
the distance between objects. Such methods can find only spherical-shaped
clusters and encounter difficulty in discovering clusters of arbitrary shapes.
Other clustering methods have been developed based on the notion of density.
Their general idea is to continue growing a given cluster as long as the density
(number of objects or data points) in the “neighborhood” exceeds some
threshold. For example, for each data point within a given cluster, the
neighborhood of a given radius has to contain at least a minimum number of
points. Such a method can be used to filter out noise or outliers and discover
clusters of arbitrary shape.
where E is the sum of the squared error for all objects in the data set; p is the
point in space representing a given object; and ci is the centroid of cluster Ci
(both p and ci are multidimensional). In other words, for each object in each
cluster, the distance from the object to its cluster center is squared, and the
distances are summed. This objective function tries to make the resulting k
clusters as compact and as separate as possible.
The k-means algorithm defines the centroid of a cluster as the mean value of
the points within the cluster. It proceeds as follows. First, it randomly selects k
of the objects in D, each of which initially represents a cluster mean or center.
For each of the remaining objects, an object is assigned to the cluster to which
it is the most similar, based on the Euclidean distance between the object and
the cluster mean. The k-means algorithm then iteratively improves the within-
cluster variation. For each cluster, it computes the new mean using the objects
assigned to the cluster in the previous iteration. All the objects are then
reassigned using the updated means as the new cluster centers. The iterations
continue until the assignment is stable, that is, the clusters formed in the
current round are the same as those formed in the previous round.
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most
similar, based on the mean value of the objects in the cluster;
(4) update the cluster means, that is, calculate the mean value of the objects
for each cluster;
The k-means method can be applied only when the mean of a set of objects is
defined. This may not be the case in some applications such as when data with
nominal attributes are involved. The k-modes method is a variant of k-means,
which extends the k-means paradigm to cluster nominal data by replacing the
means of clusters with modes. It uses new dissimilarity measures to deal with
nominal objects and a frequency-based method to update modes of clusters.
The k-means and the k-modes methods can be integrated to cluster data with
mixed numeric and nominal values.
The necessity for users to specify k, the number of clusters, in advance can be
seen as a disadvantage. The k-means method is not suitable for discovering
clusters with nonconvex shapes or clusters of very different size. Moreover, it is
sensitive to noise and outlier data points because a small number of such data
can substantially influence the mean value.
The k-means algorithm is sensitive to outliers because such objects are far
away from the majority of the data, and thus, when assigned to a cluster, they
can dramatically distort the mean value of the cluster.
A typical k-medoids partitioning algorithm like PAM works effectively for small
datasets, but does not scale well for large datasets. To deal with larger
datasets, a sampling-based method called CLARA (Clustering LARge
Applications) can be used. Instead of taking the whole data set into
consideration, CLARA uses a random sample of the dataset. The PAM
algorithm is then applied to compute the best medoids from the sample.
Ideally, the sample should closely represent the original data set.
CLARA builds clusterings from multiple random samples and returns the best
clustering as the output. The complexity of computing the medoids on a
random sample is O(ks2+k(n−k)), where s is the size of the sample, k is the
number of clusters, and n is the total number of objects. CLARA can deal with
larger data sets than PAM.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects in D as the initial representative objects or
seeds;
(2) repeat
(3) assign each remaining object to the cluster with the nearest
representative object;
(4) randomly select a nonrepresentative object, orandom;
(5) compute the total cost, S, of swapping representative object, oj,
with orandom;
(6) if S < 0 then swap oj with orandom to form the new set of k
representative objects;
(7) until no change;
Hierarchical Methods:
While partitioning methods meet the basic clustering requirement of organizing
a set of objects into a number of exclusive groups, in some situations we may
want to partition our data into groups at different levels such as in a hierarchy.
A hierarchical clustering method works by grouping data objects into a
hierarchy or “tree” of clusters.
Representing data objects in the form of a hierarchy is useful for data
summarization and visualization. For example, as the manager of human
resources at AllElectronics, you may organize your employees into major
groups such as executives, managers, and staff. You can further partition
these groups into smaller subgroups. For instance, the general group of staff
can be further divided into subgroups of senior officers, officers,
andtrainees.Allthesegroupsformahierarchy.Wecaneasilysummarizeorcharacteriz
e the data that are organized into a hierarchy, which can be used to find, say,
the average salary of managers and of officers.
Hierarchical clustering methods can encounter difficulties regarding the
selection of merge or split points. Such a decision is critical, because once a
group of objects is merged or split, the process at the next step will operate on
the newly generated clusters. It will neither undo what was done previously,
nor perform object swapping between clusters. Thus, merge or split decisions,
if not well chosen, may lead to low-quality clusters. Moreover, the methods do
not scale well because each decision of merge or split needs to examine and
evaluate many objects or clusters.
A promising direction for improving the clustering quality of hierarchical
methods is to integrate hierarchical clustering with other clustering
techniques, resulting in multiple-phase (or multiphase) clustering. We
introduce two such methods, namely BIRCH and Chameleon. BIRCH begins
by partitioning objects hierarchically using tree structures, where the leaf or
low-level nonleaf nodes can be viewed as “microclusters” depending on the
resolution scale. It then applies other clustering algorithms to perform
macroclustering on the microclusters. Chameleon (Section 10.3.4) explores
dynamic modeling in hierarchical clustering.
There are several orthogonal ways to categorize hierarchical clustering
methods. For instance, they may be categorized into algorithmic methods,
probabilistic methods, and Bayesian methods. Agglomerative, divisive, and
multiphase methods are algorithmic, meaning they consider data objects as
deterministic and compute clusters according to the deterministic distances
between objects.
Agglomerative versus Divisive Hierarchical Clustering:
A hierarchical clustering method can be either agglomerative or divisive,
depending on whether the hierarchical decomposition is formed in a bottom-up
(merging) or topdown (splitting) fashion. Let’s have a closer look at these
strategies.
An agglomerative hierarchical clustering method uses a bottom-up
strategy. It typically starts by letting each object form its own cluster and
iteratively merges clusters into larger and larger clusters, until all the objects
are in a single cluster or certain termination conditions are satisfied. The single
cluster becomes the hierarchy’s root. For the merging step, it finds the two
clusters that are closest to each other (according to some similarity measure),
and combines the two to form one cluster. Because two clusters are merged per
iteration, where each cluster contains at least one object, an agglomerative
method requires at most n iterations.
A divisive hierarchical clustering method employs a top-down
strategy. It starts by placing all objects in one cluster, which is the hierarchy’s
root. It then divides the root cluster into several smaller subclusters, and
recursively partitions those clusters into smaller ones. The partitioning process
continues until each cluster at the lowest level is coherent enough—either
containing only one object, or the objects within a cluster are sufficiently
similar to each other.
In either agglomerative or divisive hierarchical clustering, a user can
specify the desired number of clusters as a termination condition.
Example: Agglomerative versus divisive hierarchical clustering. Figure 10.6
shows the application of AGNES (AGglomerative NESting), an agglomerative
hierarchical clustering method, and, a divisive hierarchical clustering method,
on a dataset of five objects, {a,b,c,d,e}. Initially, AGNES, the agglomerative
method, places each object in to a clus DIANA (DIvisive ANAlysis)ter of its
own. The clusters are then merged step-by-step according to some criterion.
For example,clustersC1 andC2 maybemergedifanobjectinC1 and an object in
C2 form the minimum Euclidean distance between any two objects from
different clusters.
Figure: Agglomerative and divisive hierarchical clustering on data objects {a,b,c,d,e}.
Short Questions
1. Define Clustering?
2. Explain the meaning of cluster analysis?
3. What are the applications of clustering?
4. State K-Means method?
5. Define Outlier Detection?
6. State the K-medoids algorithm?
7. Give the limitations of K-means algorithm.
8. Discuss the requirements of clustering in data mining.
9. State hierarchical method?
10. Differentiate agglomerative and divisive hierarchical clustering?
11. Compare and contrasts K-means and K-medoids
12. What are the key issues in hierarchical clustering?
13. Give the strengths and weakness of K-means algorithm.
14. Explain a technique to evaluate the quality of a cluster.
Essay Questions
Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
Outlier Analysis:
Outlier is a data object that deviates significantly from the rest of the data
objects and behaves in a different manner. An outlier is an object that
deviates significantly from the rest of the objects. They can be caused by
measurement or execution errors. The analysis of outlier data is referred to as
outlier analysis or outlier mining.
An outlier cannot be termed as a noise or error. Instead, they are suspected
of not being generated by the same method as the rest of the data objects.
Outliers are of three types, namely
Global (or Point) Outliers
Collective Outliers
Contextual (or Conditional) Outliers
Global Outliers:
They are also known as Point Outliers. These are the simplest form of outliers. If, in a
given dataset, a data point strongly deviates from all the rest of the data points, it is known
as a global outlier. Mostly, all of the outlier detection methods are aimed at finding global
outliers.
For example, In Intrusion Detection System, if a large number of packages are broadcast
in a very short span of time, then this may be considered as a global outlier and we can
say that that particular system has been potentially hacked.
Collective Outliers
As the name suggests, if in a given dataset, some of the data points, as a whole, deviate
significantly from the rest of the dataset, they may be termed as collective outliers. Here,
the individual data objects may not be outliers, but when seen as a whole, they may
behave as outliers. To detect these types of outliers, we might need background
information about the relationship between those data objects showing the behavior of
outliers.
For example: In an Intrusion Detection System, a DOS (denial-of-service) package from
one computer to another may be considered as normal behavior. However, if this happens
with several computers at the same time, then this may be considered as abnormal
behavior and as a whole they can be termed as collective outliers.
Contextual Outliers
They are also known as Conditional Outliers. Here, if in a given dataset, a data object
deviates significantly from the other data points based on a specific context or condition
only. A data point may be an outlier due to a certain condition and may show normal
behavior under another condition. Therefore, a context has to be specified as part of the
problem statement in order to identify contextual outliers. Contextual outlier analysis
provides flexibility for users where one can examine outliers in different contexts, which
can be highly desirable in many applications.
For example: A temperature reading of 40°C may behave as an outlier in the context of a
“winter season” but will behave like a normal data point in the context of a “summer
season”.