Unit 3 Updated Notes
Unit 3 Updated Notes
Unit 3 Updated Notes
1
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
2
UNIT 3 – UNSUPERVISED LEARNING
Clustering
Clustering is the task of dividing the population or data points into a number of groups such that data
points in the same groups are more similar to other data points in the same group than those in other
groups. In simple words, the aim is to segregate groups with similar traits and assign them into
clusters.
Let’s understand this with an example. Suppose, you are the head of a rental store and wish to
understand preferences of your costumers to scale up your business. Is it possible for you to look at
details of each costumer and devise a unique business strategy for each one of them? Definitely not.
But, what you can do is to cluster all of your costumers into say 10 groups based on their purchasing
habits and use a separate strategy for costumers in each of these 10 groups. And this is what we call
clustering.
Now, that we understand what is clustering. Let’s take a look at the types of clustering.
Types of Clustering
Broadly speaking, clustering can be divided into two subgroups :
Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not.
For example, in the above example each customer is put into one group out of the 10 groups.
Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a
probability or likelihood of that data point to be in those clusters is assigned. For example, from the
above scenario each costumer is assigned a probability to be in either of 10 clusters of the retail store.
Since the task of clustering is subjective, the means that can be used for achieving this goal are plenty.
Every methodology follows a different set of rules for defining the ‘similarity’ among data points. In
3
fact, there are more than 100 clustering algorithms known. But few of the algorithms are used
popularly, let’s look at them in detail:
Connectivity models: As the name suggests, these models are based on the notion that the data points
closer in data space exhibit more similarity to each other than the data points lying farther away.
These models can follow two approaches. In the first approach, they start with classifying all data
points into separate clusters & then aggregating them as the distance decreases. In the second
approach, all data points are classified as a single cluster and then partitioned as the distance
increases. Also, the choice of distance function is subjective. These models are very easy to interpret
but lacks scalability for handling big datasets. Examples of these models are hierarchical clustering
algorithm and its variants.
Centroid models: These are iterative clustering algorithms in which the notion of similarity is
derived by the closeness of a data point to the centroid of the clusters. K-Means clustering algorithm
is a popular algorithm that falls into this category. In these models, the no. of clusters required at the
end have to be mentioned beforehand, which makes it important to have prior knowledge of the
dataset. These models run iteratively to find the local optima.
Distribution models: These clustering models are based on the notion of how probable is it that all
data points in the cluster belong to the same distribution (For example: Normal, Gaussian). These
models often suffer from overfitting. A popular example of these models is Expectation-
maximization algorithm which uses multivariate normal distributions.
Density Models: These models search the data space for areas of varied density of data points in the
data space. It isolates various different density regions and assign the data points within these regions
in the same cluster. Popular examples of density models are DBSCAN and OPTICS.
Now I will be taking you through two of the most popular clustering algorithms in detail – K Means
clustering and Hierarchical clustering. Let’s begin.
K Means Clustering
K means is an iterative clustering algorithm that aims to find local maxima in each iteration. This
algorithm works in these 5 steps :
4
1. Specify the desired number of clusters K : Let us choose k=2 for these 5 data points in 2-D
space.
2. Randomly assign each data point to a cluster : Let’s assign three points in cluster 1 shown
using red color and two points in cluster 2 shown using grey color.
3. Compute cluster centroids : The centroid of data points in the red cluster is shown using red
cross and those in grey cluster using grey cross.
5
4. Re-assign each point to the closest cluster centroid : Note that only the data point at the bottom
is assigned to the red cluster even though its closer to the centroid of grey cluster. Thus, we
assign that data point into grey cluster
5. Re-compute cluster centroids : Now, re-computing the centroids for both the clusters.
6
6. Repeat steps 4 and 5 until no improvements are possible :
Similarly, we’ll repeat the 4th and 5th steps until we’ll
reach global optima. When there will be no further
switching of data points between two clusters for two
successive repeats. It will mark the termination of the
algorithm if not explicitly mentioned.
7
Density-Based Clustering - Background
MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that point.
A point i is considered as the directly density reachable from a point k with respect to Eps,
MinPts if
i belongs to NEps(k)
Density reachable:
A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if there is a
sequence chain of a point i1,…., in, i1 = j, pn = i such that ii + 1 is directly density reachable from
ii.
8
Density connected:
A point i refers to density connected to a point j with respect to Eps, MinPts if there is a point o
such that both i and j are considered as density reachable from o with respect to Eps and MinPts.
Suppose a set of objects is denoted by D', we can say that an object I is directly density
reachable form the object j only if it is located within the ε neighborhood of j, and j is a core
object.
An object i is density reachable form the object j with respect to ε and MinPts in a given set of
objects, D' only if there is a sequence of object chains point i1,…., in, i1 = j, pn = i such that i i + 1
is directly density reachable from ii with respect to ε and MinPts.
An object i is density connected object j with respect to ε and MinPts in a given set of objects, D'
only if there is an object o belongs to D such that both point i and j are density reachable from o
with respect to ε and MinPts.
9
Major Features of Density-Based Clustering
o It is a scan method.
o It requires density parameters as a termination condition.
o It is used to manage noise in data clusters.
o Density-based clustering is used to identify clusters of arbitrary size.
DBSCAN
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It depends on a
density-based notion of cluster. It also identifies clusters of arbitrary size in the spatial database
with outliers.
OPTICS
OPTICS stands for Ordering Points To Identify the Clustering Structure. It gives a significant order
of database with respect to its density-based clustering structure. The order of the cluster
comprises information equivalent to the density-based clustering related to a long range of
parameter settings. OPTICS methods are beneficial for both automatic and interactive cluster
analysis, including determining an intrinsic clustering structure.
DENCLUE
10
Hierarchical Clustering
Hierarchical clustering, as the name suggests is an algorithm that builds
hierarchy of clusters. This algorithm starts with all the data points assigned to a
cluster of their own. Then two nearest clusters are merged into the same cluster.
In the end, this algorithm terminates when there is only a single cluster left.
11
At the bottom, we start with 25 data points, each assigned to separate clusters. Two closest clusters
are then merged till we have just one cluster at the top. The height in the dendrogram at which two
clusters are merged represents the distance between two clusters in the data space.
The decision of the no. of clusters that can best depict different groups can be chosen by observing
the dendrogram. The best choice of the no. of clusters is the no. of vertical lines in the dendrogram
cut by a horizontal line that can transverse the maximum distance vertically without intersecting a
cluster.
In the above example, the best choice of no. of clusters will be 4 as the red horizontal line in the
dendrogram below covers maximum vertical distance AB.
Dendrogram Distance
Two important things that you should know about hierarchical clustering are:
This algorithm has been implemented above using bottom up approach. It is also possible to follow
top-down approach starting with all data points assigned in the same cluster and recursively
performing splits till each data point is assigned a separate cluster.
The decision of merging two clusters is taken on the basis of closeness of these clusters. There are
multiple metrics for deciding the closeness of two clusters :
12
Applications of Clustering
Clustering has a large no. of applications spread across various domains. Some of the most
popularapplications of clustering are:
Recommendation engines
Market segmentation
Social network analysis
Search result grouping
Medical imaging
Image segmentation
Anomaly detection
Partitioning methods
K-means clustering is a partitioning method and as anticipated, this method decomposes a
dataset into a set of disjoint clusters. Given a dataset, a partitioning method constructs several
partitions of this data, with each partition representing a cluster. In this type, the dataset is
divided into a set of k groups, where K is used to define the number of pre-defined groups. The
cluster center is created in such a way that the distance between the data points of one cluster is
minimum as compared to another cluster centroid. Partitional clustering decomposes a data set
into a set of disjoint clusters. Given a data set of N points, a partitioning method constructs K
(N ≥ K) partitions of the data, with each partition representing a cluster.
13
CLIQUE, which defines a grid-and density-based approach for clustering in high-dimensional
data space.
STING is a grid-based multiresolution clustering method in which the spatial area is divided
into rectangular cells. There are generally several levels of such rectangular cells
corresponding to multiple levels of resolution, and these cells form a hierarchical mechanism
each cell at a high level is separation to form several cells at the next lower level. Statistical
data regarding the attributes in each grid cell (including the mean, maximum, and minimum
values) is precomputed and stored.
Statistical parameters of higher-level cells can simply be calculated from the parameters of the
lower-level cells. These parameters contain the following: the attribute-independent parameter,
count, and the attribute-dependent parameters, mean, stdev (standard deviation), min
(minimum), max (maximum); and the type of distribution that the attribute value in the cell
follows, including normal, uniform, exponential, or none (if the distribution is anonymous).
When the records are loaded into the database, the parameters count, mean, stdev, min, and a
max of the bottom-level cells are computed directly from the records. The value of distribution
can be assigned by the user if the distribution type is known beforehand or obtained by
hypothesis tests including the χ2 test.
The kind of distribution of a higher-level cell that can be computed depends on the majority of
distribution types of its corresponding lower-level cells in conjunction with a threshold
filtering procedure. If the distributions of the lower-level cells disagree with each other and
decline the threshold test, the distribution type of the high-level cell is set to none.
The statistical parameters can be used in top-down, grid-based approaches as follows. First, a
layer within the hierarchical architecture is decided from which the query-answering
procedure is to start. This layer generally includes a small number of cells. For every cell in
the current layer, it can compute the confidence interval (or estimated range of probability)
reflecting the cell’s relevancy to the given query.
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-
dimensional space into a low-dimensional space so that the low-dimensional representation
retains some meaningful properties of the original data, ideally close to its intrinsic dimension.
Working in high-dimensional spaces can be undesirable for many reasons; raw data are
often sparse as a consequence of the curse of dimensionality, and analyzing the data is
usually computationally intractable (hard to control or deal with). Dimensionality reduction is
common in fields that deal with large numbers of observations and/or large numbers of
variables, such as signal processing, speech recognition, neuroinformatics, and bioinformatics.
Methods are commonly divided into linear and nonlinear approaches. Approaches can also be
divided into feature selection and feature extraction. Dimensionality reduction can be used
for noise reduction, data visualization, cluster analysis, or as an intermediate step to facilitate
other analyses.
LDA is closely related to analysis of variance (ANOVA) and regression analysis, which also
attempt to express one dependent variable as a linear combination of other features or
measurements.[1][2] However, ANOVA uses categorical independent variables and
a continuous dependent variable, whereas discriminant analysis has continuous independent
variables and a categorical dependent variable (i.e. the class label).[3] Logistic
regression and probit regression are more similar to LDA than ANOVA is, as they also explain
a categorical variable by the values of continuous independent variables. These other methods
are preferable in applications where it is not reasonable to assume that the independent
variables are normally distributed, which is a fundamental assumption of the LDA method.
LDA is also closely related to principal component analysis (PCA) and factor analysis in that
they both look for linear combinations of variables which best explain the data. [4] LDA
explicitly attempts to model the difference between the classes of data. PCA, in contrast, does
not take into account any difference in class, and factor analysis builds the feature
combinations based on differences rather than similarities. Discriminant analysis is also
different from factor analysis in that it is not an interdependence technique: a distinction
between independent variables and dependent variables (also called criterion variables) must
be made.
LDA works when the measurements made on independent variables for each observation are
continuous quantities. When dealing with categorical independent variables, the equivalent
technique is discriminant correspondence analysis. [5][6]
Discriminant analysis is used when groups are known a priori (unlike in cluster analysis). Each
case must have a score on one or more quantitative predictor measures, and a score on a group
measure.[7] In simple terms, discriminant function analysis is classification - the act of
distributing things into groups, classes or categories of the same type.
15
16
17
18
19
20
21
22
Principal component analysis (PCA) is a popular technique for analyzing large datasets
containing a high number of dimensions/features per observation, increasing the
interpretability of data while preserving the maximum amount of information, and enabling the
visualization of multidimensional data. Formally, PCA is a statistical technique for reducing
the dimensionality of a dataset. This is accomplished by linearly transforming the data into a
new coordinate system where (most of) the variation in the data can be described with fewer
dimensions than the initial data. Many studies use the first two principal components in order
to plot the data in two dimensions and to visually identify clusters of closely related data
points. Principal component analysis has applications in many fields such as population
genetics, microbiome studies, and atmospheric science.
The principal components of a collection of points in a real coordinate space are a sequence
of p unit vectors, where the ith vector is the direction of a line that best fits the data while
being orthogonal to the first i-1 vectors. Here, a best-fitting line is defined as one that
minimizes the average squared perpendicular distance from the points to the line. These
directions constitute an orthonormal basis in which different individual dimensions of the data
are linearly uncorrelated. Principal component analysis is the process of computing the
principal components and using them to perform a change of basis on the data, sometimes
using only the first few principal components and ignoring the rest.
In data analysis, the first principal component of a set of p variables, presumed to be jointly
normally distributed, is the derived variable formed as a linear combination of the original
variables that explains the most variance. The second principal component explains the most
variance in what is left once the effect of the first component is removed, and we may proceed
through p iterations until all the variance is explained. PCA is most commonly used when
many of the variables are highly correlated with each other and it is desirable to reduce their
number to an independent set.
PCA is used in exploratory data analysis and for making predictive models. It is commonly
used for dimensionality reduction by projecting each data point onto only the first few
principal components to obtain lower-dimensional data while preserving as much of the data's
variation as possible. The first principal component can equivalently be defined as a direction
that maximizes the variance of the projected data. The i-th principal component can be taken as
a direction orthogonal to the first i-1 principal components that maximizes the variance of the
projected data.
23
For either objective, it can be shown that the principal components are eigenvectors of the
data's covariance matrix. Thus, the principal components are often computed by
eigendecomposition of the data covariance matrix or singular value decomposition of the data
matrix. PCA is the simplest of the true eigenvector-based multivariate analyses and is closely
related to factor analysis. Factor analysis typically incorporates more domain specific
assumptions about the underlying structure and solves eigenvectors of a slightly different
matrix. PCA is also related to canonical correlation analysis (CCA). CCA defines coordinate
systems that optimally describe the cross-covariance between two datasets while PCA defines
a new orthogonal coordinate system that optimally describes variance in a single dataset.
Robust and L1-norm-based variants of standard PCA have also been proposed.
24
quality, or in signal processing. The technique has successfully been
applied across a wide range of compression problems in pattern
recognition (specifically face recognition), image recognition, and
more.
3. Simplify complex business decisions. PCA has been employed to
simplify traditionally complex business decisions. For example,
traders use over 300 financial instruments to manage portfolios. The
algorithm has proven successful in the risk management of interest
rate derivative portfolios, lowering the number of financial
instruments from more than 300 to just 3-4 principal components.
4. Clarify convoluted scientific processes. The algorithm has been
applied extensively in the understanding of convoluted and
multidirectional factors, which increase the probability of neural
ensembles to trigger action potentials.
25
Let’s take a closer look at the first method - eigendecomposition of
the covariance matrix - to gain a deeper appreciation of PCA. There
are several steps in computing PCA:
Keep in mind that the majority of data scientists will not calculate
PCA by hand, but rather implement it in Python with ScikitLearn, or
use R to compute it. These mathematical foundations enrich our
26
understanding of PCA but are not necessary for its implementation.
Understanding PCA allows us to have a better idea of its advantages
and disadvantages.
Advantages of PCA:
Disadvantages of PCA:
27
To start PCA on the right foot, you will need to have the right tools
that help you collect data from multiple sources and prepare it for
machine learning models. Keboola covers all the steps, so you won't
have to think about the infrastructure, only about the added-value
your machine learning models will bring.
28
missing values with a close approximation (e.g. the mean of the
column).
29