Overview of Clustering:: UNIT-5

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

UNIT-5

Overview of Clustering:

It is basically a type of unsupervised learning method. An unsupervised learning method is a


method in which we draw references from datasets consisting of input data without labeled
responses.

Clustering is the task of dividing the population or data points into a number of groups such
that data points in the same groups are more similar to other data points in the same group
and dissimilar to the data points in other groups. It is basically a collection of objects on the
basis of similarity and dissimilarity between them.

Why Clustering?

Clustering is very much important as it determines the intrinsic grouping among the unlabelled
data present. There are no criteria for good clustering. It depends on the user, what is the
criteria they may use which satisfy their need.

Applications of Clustering in different fields

• Marketing: It can be used to characterize & discover customer segments for marketing
purposes.
• Biology: It can be used for classification among different species of plants and animals.
• Libraries: It is used in clustering different books on the basis of topics and information.
• Insurance: It is used to acknowledge the customers, their policies and identifying the
frauds.

K-means-

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.K-Means Clustering is
an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters.
Here K defines the number of pre-defined clusters that need to be created in the process, as if
K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.

Use Cases:

Some specific applications of k-means are image processing, medical, and customer
segmentation.

→Image Processing

Video is one example of the growing volumes of unstructured data being collected. Within

each frame of a video, k-means analysis can be used to identify objects in the video. For

each frame, the task is to determine which pixels are most similar to each other. The

attributes of each pixel can include brightness, color, and location, the x and y coordinates

in the frame. With security video images, for example, successive frames are examined to

identify any changes to the clusters. These newly identified clusters may indicate

unauthorized access to a facility.

→Medical

Patient attributes such as age, height, weight, systolic and diastolic blood pressures,

cholesterol level, and other attributes can identify naturally occurring clusters. These

clusters could be used to target individuals for specific preventive measures or clinical
trial participation. Clustering, in general, is useful in biology for the classification of

plants and animals as well as in the field of human genetics.

→Customer Segmentation

Marketing and sales groups use k-means to better identify customers who have similar

behaviors and spending patterns. For example, a wireless provider may look at the

following customer attributes: monthly bill, number of text messages, data volume

consumed, minutes used during various daily periods, and years as a customer. The

wireless company could then look at the naturally occurring clusters and consider tactics

to increase sales or reduce the customer churn rate, the proportion of customers who end

their relationship with a particular company.

Overview of the Method:

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.


Divide the following data into two different clusters
Perform K-means Analysis using R:

The Dataset

Iris dataset consists of 50 samples from each of 3 species of Iris(Iris setosa, Iris virginica, Iris
versicolor) and a multivariate dataset introduced by British statistician and biologist Ronald
Fisher in his 1936 paper The use of multiple measurements in taxonomic problems. Four
features were measured from each sample i.e length and width of the sepals and petals and
based on the combination of these four features, Fisher developed a linear discriminant
model to distinguish the species from each other.

# Loading data
data(iris)

# Structure
str(iris)

Performing K-Means Clustering on Dataset


Using K-Means Clustering algorithm on the dataset which includes 11 persons and 6 variables
or attributes

# Installing Packages
install.packages("ClusterR")
install.packages("cluster")

# Loading package
library(ClusterR)
library(cluster)

# Removing initial label of


# Species from original dataset
iris_1 <- iris[, -5]

# Fitting K-Means clustering Model


# to training dataset
set.seed(240) # Setting seed
kmeans.re <- kmeans(iris_1, centers = 3, nstart = 20)
kmeans.re

# Cluster identification for


# each observation
kmeans.re$cluster

# Confusion Matrix
cm <- table(iris$Species, kmeans.re$cluster)
cm

# Model Evaluation and visualization


plot(iris_1[c("Sepal.Length", "Sepal.Width")])
plot(iris_1[c("Sepal.Length", "Sepal.Width")],
col = kmeans.re$cluster)
plot(iris_1[c("Sepal.Length", "Sepal.Width")],
col = kmeans.re$cluster,
main = "K-means with 3 clusters")

## Plotiing cluster centers


kmeans.re$centers
kmeans.re$centers[, c("Sepal.Length", "Sepal.Width")]
# cex is font size, pch is symbol
points(kmeans.re$centers[, c("Sepal.Length", "Sepal.Width")],
col = 1:3, pch = 8, cex = 3)

## Visualizing clusters
y_kmeans <- kmeans.re$cluster
clusplot(iris_1[, c("Sepal.Length", "Sepal.Width")], y_kmeans,lines = 0, shade
= TRUE, color = TRUE, labels = 2, plotchar = FALSE, span = TRUE, main =
paste("Cluster iris"),xlab = 'Sepal.Length', ylab = 'Sepal.Width')

Output:
• Model kmeans_re:

The 3 clusters are made which are of 50, 62, and 38 sizes respectively. Within the
cluster, the sum of squares is 88.4%.

• Cluster identification:
The model achieved an accuracy of 100% with a p-value of less than 1. This
indicates the model is good.

• Confusion Matrix:

So, 50 Setosa are correctly classified as Setosa. Out of 62 Versicolor, 48 Versicolor


are correctly classified as Versicolor and 14 are classified as virginica. Out of 36
virginica, 19 virginica are correctly classified as virginica and 2 are classified as
Versicolor.

• K-means with 3 clusters plot:

The model showed 3 cluster plots with three different colors and with Sepal.length
and with Sepal.width.
• Plotting cluster centers:

In the plot, centers of clusters are marked with cross signs with the same color of
the cluster.

• Plot of clusters:

So, 3 clusters are formed with varying sepal length and sepal width. Hence, the K-Means
clustering algorithm is widely used in the industry.

Classification:

It is a supervised learning approach in which machines are trained using well "labelled" training
data, and on basis of that data, machines predict the output. The labelled data means some
input data is already tagged with the correct output.

Classification is a process of categorizing a given set of data into classes, It can be performed on
both structured or unstructured data. The process starts with predicting the class of given data
points. The classes are often referred to as target, label or categories. Such as, Yes or No, 0 or 1,
Spam or Not Spam, cat or dog, etc
Classification algorithms can be used in different places. Below are some popular use cases of
Classification Algorithms:

• Email Spam Detection


• Speech Recognition
• Identifications of Cancer tumor cells.
• Drugs Classification
• Biometric Identification, etc.

Decision Trees-

Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.

It is a graphical representation for getting all the possible solutions to a problem/decision based
on given conditions.

→Advantages of the Decision Tree

• It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
• It can be very useful for solving decision-related problems.
• It helps to think about all the possible outcomes for a problem.
• There is less requirement of data cleaning compared to other algorithms.

→Disadvantages of the Decision Tree

• The decision tree contains lots of layers, which makes it complex.


• It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
• For more class labels, the computational complexity of the decision tree may increase.
Overview of a Decision Tree:

• Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.
• Depth:The depth of a node is the minimum number of steps required to reach the node
from the root.
• Decision Stump: The simplest short tree is called a decision stump, which is a decision
tree with the root immediately connected to the leaf nodes. A decision stump makes a
prediction based on the value of just a single input variable.

Example:
Suppose there is a candidate who has a job offer and wants to decide whether he should accept
the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the
office) and one leaf node based on the corresponding labels. The next decision node further
gets split into one decision node (Cab facility) and one leaf node. Finally, the decision node
splits into two leaf nodes (Accepted offers and Declined offer). Consider the below diagram:

Decision Tree Algorithms:

Multiple algorithms exist to implement decision trees, and the methods of tree construction
vary with different algorithms. Some popular algorithms include ID3 , C4.5, and CART .

→ID3 Algorithm:

ID3 (or Iterative Dichotomiser 3) is one of the first and most used decision tree algorithms,

• Calculate the Information Gain of each feature.


• Considering that all rows don’t belong to the same class, split the dataset S into subsets
using the feature for which the Information Gain is maximum.
• Make a decision tree node using the feature with the maximum Information gain.
• If all rows belong to the same class, make the current node as a leaf node with the class
as its label.
• Repeat for the remaining features until we run out of all features, or the decision tree
has all leaf nodes.

→C4.5:

C4.5 is given a set of data representing things that are already classified.

When we generate the decision trees with the help of C4.5 algorithm, then it can be used for
classification of the dataset, and that is the main reason due to which C4.5 is also known as a
statistical classifier.

• First, notice the base


• For each attribute X, find the normalized information gain ratio by splitting between X.
• Suppose that X is an attribute with the highest normalized information gain.
• Create a decision node that splits on attribute X.
• Repeat it on the sublists obtained by splitting the attribute X, and add these nodes as
children of the node.

→CART:(C classification And Regression Tree)

As the name suggests, CART algorithm is used to generate both, classification and regression
decision trees.

1. Select Root node(S) based on Gini Index and Highest Information Gain.

2 . On each iteration of an algorithms it calculate the Gini Index and Information gain,
considering that every node is unused

3. Select node base on Lowest Gini Index or Highest I.G


4. then Splits set S to produce the subsets of data.

5. An algorithms continuous to recur on each subset and make sure that attributes are fresh
and Creates the decision Tree.

Evaluating a Decision Tree:

Decision trees use greedy algorithms, in that they always choose the option that seems the

best available at that moment. At each step, the algorithm selects which attribute to use for

splitting the remaining records. This selection may not be the best overall, but it is

guaranteed to be the best at that step. This characteristic reinforces the efficiency of

decision trees. However, once a bad split is taken, it is propagated through the rest of the

tree.

There are a few ways to evaluate a decision tree.

First, evaluate whether the splits of the tree make sense. Conduct sanity checks by validating
the decision rules with domain experts and determine if the decision rules are sound.

Next, look at the depth and nodes of the tree. Having too many layers and obtaining nodes with
few members might be signs of overfitting. In overfitting, the model fits the training set well,
but it performs poorly on the new samples in the testing set.

For decision tree learning, overfitting can be caused by either the lack of training data or the
biased data in the training set. Two approaches can help avoid overfitting in decision tree
learning.

• Stop growing the tree early before it reaches the point where all the training data is
• perfectly classified.
• Grow the full tree, and then post-prune the tree with methods such as reduced-error
• pruning and rule-based post pruning.

The x-axis represents the amount of data, and the y axis represents the errors. The blue curve
is the training set, and the red curve is the testing set. The left side of the gray vertical line
shows that the model predicts well on the testing set. But on the right side of the gray line, the
model performs worse and worse on the

testing set as more and more unseen data is introduced.


Last, many standard diagnostics tools that apply to classifiers can help evaluate overfitting.

Decision trees are computationally inexpensive, and it is easy to classify the data.

Decision trees are able to handle both numerical and categorical attributes and are robust

with redundant or correlated variables.

Decision trees are not a good choice if the dataset contains many irrelevant variables.

Although decision trees are able to handle correlated variables, decision trees are not well

suited when most of the variables in the training set are correlated, since overfitting is likely to
occur.

For binary decisions, a decision tree works better if the training dataset consists of records with
an even probability of each result. In other words, the root of the tree has a 50% chance of
either classification.

When using methods such as logistic regression on a dataset with many variables, decision
trees can help determine which variables are the most useful to select based on

information gain. Then these variables can be selected for the logistic regression. Decision

trees can also be used to prune redundant variables.

Decision Tree in R:

The data set used is the most widely used “readingSkills” dataset

library(datasets)

library(caTools)

library(party)

library(dplyr)

library(magrittr)

#splitting the data

sample_data = sample.split(readingSkills, SplitRatio = 0.8)

train_data <- subset(readingSkills, sample_data == TRUE)

test_data <- subset(readingSkills, sample_data == FALSE)

#creation

model<- ctree(nativeSpeaker ~ ., train_data)

plot(model)
Bayes ‘Theorem:

Bayes' theorem, named after 18th-century British mathematician Thomas Bayes, is a


mathematical formula for determining conditional probability. Conditional probability is the
likelihood of an outcome occurring, based on a previous outcome occurring.

Example:
Applications of Bayes’ Theorem

In the real world, there are plenty of applications of the Bayes’ Theorem. Some applications are
given below :

• It can also be used as a building block and starting point for more complex
methodologies, For example, The popular Bayesian networks.
• Used in classification problems and other probability-related questions.
• Bayesian inference, a particular approach to statistical inference.
• In genetics, Bayes’ theorem can be used to calculate the probability of an individual
having a specific genotype.

Naïve Bayes Classifier:

Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and
used for solving classification problems.

It is mainly used in text classification that includes a high-dimensional training dataset.

Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which
helps in building the fast machine learning models that can make quick predictions.
It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.

Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and
classifying articles.

→Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described
as:

Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases of
color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence
each feature individually contributes to identify that it is an apple without depending on each
other.

Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

→Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below example:

Suppose we have a dataset of weather conditions and corresponding target variable "Play". So
using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:

• Convert the given dataset into frequency tables.


• Generate Likelihood table by finding the probabilities of given features.
• Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:


Frequency table for the Weather Conditions:

Likelihood table weather condition:

Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

→Advantages of Naïve Bayes Classifier:

• Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
• It can be used for Binary as well as Multi-class Classifications.
• It performs well in Multi-class predictions as compared to the other Algorithms.
• It is the most popular choice for text classification problems.

→Disadvantages of Naïve Bayes Classifier:

• Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.

→Applications of Naïve Bayes Classifier:

• It is used for Credit Scoring.


• It is used in medical data classification.
• It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
• It is used in Text classification such as Spam filtering and Sentiment analysis.

Smoothing:

If one of the attribute values does not appear with one of the class labels within the training
set, the corresponding will equal zero. When this happens, the resulting from multiplying all the
immediately becomes zero regardless of how large some of the conditional probabilities are.
Therefore,overfitting occurs. Smoothing techniques can be employed to adjust these
probabilities and to ensure a nonzero value.

A smoothing technique assigns a small nonzero probability to rare events not included in the
training dataset. Also, the smoothing addresses the possibility of taking the logarithm of zero
that may occur.

There are various smoothing techniques. Among them is the Laplace smoothing (or add-one)
technique that pretends to see every outcome once more than it actually appears.

One problem of the Laplace smoothing is that it may assign too much probability to unseen
events. To address this problem, Laplace smoothing can be generalized to use the following ε
belongs to [0,1].
Smoothing techniques are available in most standard software packages for naïve Bayes
classifiers. However, if for some reason (like performance concerns) the naïve Bayes classifier
needs to be coded directly into an application, the smoothing and logarithm calculations should
be incorporated into the implementation.

Naïve Bayes in R:

The dataset used here is same as that of K-Means.

# Loading data

data(iris)

# Structure

str(iris)

# Installing Packages

install.packages("e1071")

install.packages("caTools")

install.packages("caret")

# Loading package

library(e1071)

library(caTools)

library(caret)

# Splitting data into train

# and test data

split <- sample.split(iris, SplitRatio = 0.7)


train_cl <- subset(iris, split == "TRUE")

test_cl <- subset(iris, split == "FALSE")

# Feature Scaling

train_scale <- scale(train_cl[, 1:4])

test_scale <- scale(test_cl[, 1:4])

# Fitting Naive Bayes Model

# to training dataset

set.seed(120) # Setting Seed

classifier_cl <- naiveBayes(Species ~ ., data = train_cl)

classifier_cl

# Predicting on test data'

y_pred <- predict(classifier_cl, newdata = test_cl)

Output:
• Model classifier_cl:
• The Conditional probability for each feature or variable is created by model
separately. The apriori probabilities are also calculated which indicates the
distribution of our data.

You might also like