Overview of Clustering:: UNIT-5
Overview of Clustering:: UNIT-5
Overview of Clustering:: UNIT-5
Overview of Clustering:
Clustering is the task of dividing the population or data points into a number of groups such
that data points in the same groups are more similar to other data points in the same group
and dissimilar to the data points in other groups. It is basically a collection of objects on the
basis of similarity and dissimilarity between them.
Why Clustering?
Clustering is very much important as it determines the intrinsic grouping among the unlabelled
data present. There are no criteria for good clustering. It depends on the user, what is the
criteria they may use which satisfy their need.
• Marketing: It can be used to characterize & discover customer segments for marketing
purposes.
• Biology: It can be used for classification among different species of plants and animals.
• Libraries: It is used in clustering different books on the basis of topics and information.
• Insurance: It is used to acknowledge the customers, their policies and identifying the
frauds.
K-means-
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.K-Means Clustering is
an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters.
Here K defines the number of pre-defined clusters that need to be created in the process, as if
K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
Use Cases:
Some specific applications of k-means are image processing, medical, and customer
segmentation.
→Image Processing
Video is one example of the growing volumes of unstructured data being collected. Within
each frame of a video, k-means analysis can be used to identify objects in the video. For
each frame, the task is to determine which pixels are most similar to each other. The
attributes of each pixel can include brightness, color, and location, the x and y coordinates
in the frame. With security video images, for example, successive frames are examined to
identify any changes to the clusters. These newly identified clusters may indicate
→Medical
Patient attributes such as age, height, weight, systolic and diastolic blood pressures,
cholesterol level, and other attributes can identify naturally occurring clusters. These
clusters could be used to target individuals for specific preventive measures or clinical
trial participation. Clustering, in general, is useful in biology for the classification of
→Customer Segmentation
Marketing and sales groups use k-means to better identify customers who have similar
behaviors and spending patterns. For example, a wireless provider may look at the
following customer attributes: monthly bill, number of text messages, data volume
consumed, minutes used during various daily periods, and years as a customer. The
wireless company could then look at the naturally occurring clusters and consider tactics
to increase sales or reduce the customer churn rate, the proportion of customers who end
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
The Dataset
Iris dataset consists of 50 samples from each of 3 species of Iris(Iris setosa, Iris virginica, Iris
versicolor) and a multivariate dataset introduced by British statistician and biologist Ronald
Fisher in his 1936 paper The use of multiple measurements in taxonomic problems. Four
features were measured from each sample i.e length and width of the sepals and petals and
based on the combination of these four features, Fisher developed a linear discriminant
model to distinguish the species from each other.
# Loading data
data(iris)
# Structure
str(iris)
# Installing Packages
install.packages("ClusterR")
install.packages("cluster")
# Loading package
library(ClusterR)
library(cluster)
# Confusion Matrix
cm <- table(iris$Species, kmeans.re$cluster)
cm
## Visualizing clusters
y_kmeans <- kmeans.re$cluster
clusplot(iris_1[, c("Sepal.Length", "Sepal.Width")], y_kmeans,lines = 0, shade
= TRUE, color = TRUE, labels = 2, plotchar = FALSE, span = TRUE, main =
paste("Cluster iris"),xlab = 'Sepal.Length', ylab = 'Sepal.Width')
Output:
• Model kmeans_re:
The 3 clusters are made which are of 50, 62, and 38 sizes respectively. Within the
cluster, the sum of squares is 88.4%.
• Cluster identification:
The model achieved an accuracy of 100% with a p-value of less than 1. This
indicates the model is good.
• Confusion Matrix:
The model showed 3 cluster plots with three different colors and with Sepal.length
and with Sepal.width.
• Plotting cluster centers:
In the plot, centers of clusters are marked with cross signs with the same color of
the cluster.
• Plot of clusters:
So, 3 clusters are formed with varying sepal length and sepal width. Hence, the K-Means
clustering algorithm is widely used in the industry.
Classification:
It is a supervised learning approach in which machines are trained using well "labelled" training
data, and on basis of that data, machines predict the output. The labelled data means some
input data is already tagged with the correct output.
Classification is a process of categorizing a given set of data into classes, It can be performed on
both structured or unstructured data. The process starts with predicting the class of given data
points. The classes are often referred to as target, label or categories. Such as, Yes or No, 0 or 1,
Spam or Not Spam, cat or dog, etc
Classification algorithms can be used in different places. Below are some popular use cases of
Classification Algorithms:
Decision Trees-
Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
It is a graphical representation for getting all the possible solutions to a problem/decision based
on given conditions.
• It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
• It can be very useful for solving decision-related problems.
• It helps to think about all the possible outcomes for a problem.
• There is less requirement of data cleaning compared to other algorithms.
• Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.
• Depth:The depth of a node is the minimum number of steps required to reach the node
from the root.
• Decision Stump: The simplest short tree is called a decision stump, which is a decision
tree with the root immediately connected to the leaf nodes. A decision stump makes a
prediction based on the value of just a single input variable.
Example:
Suppose there is a candidate who has a job offer and wants to decide whether he should accept
the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the
office) and one leaf node based on the corresponding labels. The next decision node further
gets split into one decision node (Cab facility) and one leaf node. Finally, the decision node
splits into two leaf nodes (Accepted offers and Declined offer). Consider the below diagram:
Multiple algorithms exist to implement decision trees, and the methods of tree construction
vary with different algorithms. Some popular algorithms include ID3 , C4.5, and CART .
→ID3 Algorithm:
ID3 (or Iterative Dichotomiser 3) is one of the first and most used decision tree algorithms,
→C4.5:
C4.5 is given a set of data representing things that are already classified.
When we generate the decision trees with the help of C4.5 algorithm, then it can be used for
classification of the dataset, and that is the main reason due to which C4.5 is also known as a
statistical classifier.
As the name suggests, CART algorithm is used to generate both, classification and regression
decision trees.
1. Select Root node(S) based on Gini Index and Highest Information Gain.
2 . On each iteration of an algorithms it calculate the Gini Index and Information gain,
considering that every node is unused
5. An algorithms continuous to recur on each subset and make sure that attributes are fresh
and Creates the decision Tree.
Decision trees use greedy algorithms, in that they always choose the option that seems the
best available at that moment. At each step, the algorithm selects which attribute to use for
splitting the remaining records. This selection may not be the best overall, but it is
guaranteed to be the best at that step. This characteristic reinforces the efficiency of
decision trees. However, once a bad split is taken, it is propagated through the rest of the
tree.
First, evaluate whether the splits of the tree make sense. Conduct sanity checks by validating
the decision rules with domain experts and determine if the decision rules are sound.
Next, look at the depth and nodes of the tree. Having too many layers and obtaining nodes with
few members might be signs of overfitting. In overfitting, the model fits the training set well,
but it performs poorly on the new samples in the testing set.
For decision tree learning, overfitting can be caused by either the lack of training data or the
biased data in the training set. Two approaches can help avoid overfitting in decision tree
learning.
• Stop growing the tree early before it reaches the point where all the training data is
• perfectly classified.
• Grow the full tree, and then post-prune the tree with methods such as reduced-error
• pruning and rule-based post pruning.
The x-axis represents the amount of data, and the y axis represents the errors. The blue curve
is the training set, and the red curve is the testing set. The left side of the gray vertical line
shows that the model predicts well on the testing set. But on the right side of the gray line, the
model performs worse and worse on the
Decision trees are computationally inexpensive, and it is easy to classify the data.
Decision trees are able to handle both numerical and categorical attributes and are robust
Decision trees are not a good choice if the dataset contains many irrelevant variables.
Although decision trees are able to handle correlated variables, decision trees are not well
suited when most of the variables in the training set are correlated, since overfitting is likely to
occur.
For binary decisions, a decision tree works better if the training dataset consists of records with
an even probability of each result. In other words, the root of the tree has a 50% chance of
either classification.
When using methods such as logistic regression on a dataset with many variables, decision
trees can help determine which variables are the most useful to select based on
information gain. Then these variables can be selected for the logistic regression. Decision
Decision Tree in R:
The data set used is the most widely used “readingSkills” dataset
library(datasets)
library(caTools)
library(party)
library(dplyr)
library(magrittr)
#creation
plot(model)
Bayes ‘Theorem:
Example:
Applications of Bayes’ Theorem
In the real world, there are plenty of applications of the Bayes’ Theorem. Some applications are
given below :
• It can also be used as a building block and starting point for more complex
methodologies, For example, The popular Bayesian networks.
• Used in classification problems and other probability-related questions.
• Bayesian inference, a particular approach to statistical inference.
• In genetics, Bayes’ theorem can be used to calculate the probability of an individual
having a specific genotype.
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and
used for solving classification problems.
Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which
helps in building the fast machine learning models that can make quick predictions.
It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and
classifying articles.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described
as:
Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases of
color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence
each feature individually contributes to identify that it is an apple without depending on each
other.
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So
using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:
Problem: If the weather is sunny, then the Player should play or not?
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
• Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
• It can be used for Binary as well as Multi-class Classifications.
• It performs well in Multi-class predictions as compared to the other Algorithms.
• It is the most popular choice for text classification problems.
• Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.
Smoothing:
If one of the attribute values does not appear with one of the class labels within the training
set, the corresponding will equal zero. When this happens, the resulting from multiplying all the
immediately becomes zero regardless of how large some of the conditional probabilities are.
Therefore,overfitting occurs. Smoothing techniques can be employed to adjust these
probabilities and to ensure a nonzero value.
A smoothing technique assigns a small nonzero probability to rare events not included in the
training dataset. Also, the smoothing addresses the possibility of taking the logarithm of zero
that may occur.
There are various smoothing techniques. Among them is the Laplace smoothing (or add-one)
technique that pretends to see every outcome once more than it actually appears.
One problem of the Laplace smoothing is that it may assign too much probability to unseen
events. To address this problem, Laplace smoothing can be generalized to use the following ε
belongs to [0,1].
Smoothing techniques are available in most standard software packages for naïve Bayes
classifiers. However, if for some reason (like performance concerns) the naïve Bayes classifier
needs to be coded directly into an application, the smoothing and logarithm calculations should
be incorporated into the implementation.
Naïve Bayes in R:
# Loading data
data(iris)
# Structure
str(iris)
# Installing Packages
install.packages("e1071")
install.packages("caTools")
install.packages("caret")
# Loading package
library(e1071)
library(caTools)
library(caret)
# Feature Scaling
# to training dataset
classifier_cl
Output:
• Model classifier_cl:
• The Conditional probability for each feature or variable is created by model
separately. The apriori probabilities are also calculated which indicates the
distribution of our data.