Types of Classification Algorithm
Types of Classification Algorithm
Types of Classification Algorithm
Nearest Neighbor:
The k-nearest-neighbors algorithm is a classification algorithm, and it is
supervised: it takes a bunch of labelled points and uses them to learn how
to label other points. To label a new point, it looks at the labelled points
closest to that new point (those are its nearest neighbors), and has those
neighbors vote, so whichever label the most of the neighbors have is the
label for the new point (the “k” is the number of neighbors it checks).
Decision Trees:
Decision tree builds classification or regression models in the form of a tree
structure. It breaks down a data set into smaller and smaller subsets while
at the same time an associated decision tree is incrementally developed.
The final result is a tree with decision nodes and leaf nodes. A decision node
has two or more branches and a leaf node represents a classification or
decision. The topmost decision node in a tree which corresponds to the best
predictor called root node. Decision trees can handle both categorical and
numerical data.
Random Forest:
Random forests or random decision forests are an ensemble learning
method for classification, regression and other tasks, that operate by
constructing a multitude of decision trees at training time and outputting
the class that is the mode of the classes (classification) or mean prediction
(regression) of the individual trees. Random decision forests correct for
decision trees’ habit of over fitting to their training set.
Neural Network:
A neural network consists of units (neurons), arranged in layers, which
convert an input vector into some output. Each unit takes an input, applies
a (often nonlinear) function to it and then passes the output on to the next
layer. Generally the networks are defined to be feed-forward: a unit feeds its
output to all the units on the next layer, but there is no feedback to the
previous layer. Weightings are applied to the signals passing from one unit
to another, and it is these weightings which are tuned in the training phase
to adapt a neural network to the particular problem at hand.
What is Classification?
We use the training dataset to get better boundary conditions which could be used to determine each
target class. Once the boundary conditions are determined, the next task is to predict the target class.
The whole process is known as classification.
Let’s understand the concept of classification algorithms with gender classification using hair
length (by no means am I trying to stereotype by gender, this is only an example). To classify
gender (target class) using hair length as feature parameter we could train a model using any
classification algorithms to come up with some set of boundary conditions which can be used to
differentiate the male and female genders using hair length as the training feature. In gender
classification case the boundary condition could the proper hair length value. Suppose
the differentiated boundary hair length value is 15.0 cm then we can say that if hair length is less
than 15.0 cm then gender could be male or else female.
While grouping similar language type documents (Same language documents are one
group.)
While categorizing the news articles (Same news category(Sport) articles are one group )
Let’s understand the concept with clustering genders based on hair length example. To determine
gender, different similarity measure could be used to categorize male and female genders. This could
be done by finding the similarity between two hair lengths and keep them in the same group if the
similarity is less (Difference of hair length is less). The same process could continue until all the
hair length properly grouped into two categories.
Linear Classifiers
o Logistic regression
o Naive Bayes classifier
o Fisher’s linear discriminant
Support vector machines
o Least squares support vector machines
Quadratic classifiers
Kernel estimation
o k-nearest neighbor
Decision trees
o Random forests
Neural networks
Learning vector quantization
Logistic Regression
As confusing as the name might be, you can rest assured. Logistic Regression is a classification
and not a regression algorithm. It estimates discrete values (Binary values like 0/1, yes/no,
true/false) based on a given set of independent variable(s). Simply put, it basically, predicts the
probability of occurrence of an event by fitting data to a logit function. Hence, it is also known
as logit regression. The values obtained would always lie within 0 and 1 since it predicts the
probability.
Let’s say there’s a sum on your math test. It can only have 2 outcomes, right? Either you solve it or
you don’t (and let’s not assume points for method here). Now imagine, that you are being given a
wide range of sums in an attempt to understand which chapters have you understood well. The
outcome of this study would be something like this – if you are given a trigonometry based problem,
you are 70% likely to solve it. On the other hand, if it is an arithmetic problem, the probability of you
getting an answer is only 30%. This is what Logistic Regression provides you.
If I had to do the math, I would model the log odds of the outcome as a linear combination of the
predictor variables.
R-Code:
x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
logistic <- glm(y_train ~ ., data = x,family='binomial')
summary(logistic)
#Predict Output
predicted= predict(logistic,x_test)
There are many different steps that could be tried in order to improve the model:
Decision Trees
Now, the decision tree is by far, one of my favorite algorithms. With versatile features helping
actualize both categorical and continuous dependent variables, it is a type of supervised learning
algorithm mostly used for classification problems. What this algorithm does is, it splits the population
into two or more homogeneous sets based on the most significant attributes making the groups as
distinct as possible.
In the image above, you can see that population is classified into four different groups based on
multiple attributes to identify ‘if they will play or not’.
R-Code:
library(rpart)
x <- cbind(x_train,y_train)
# grow tree
fit <- rpart(y_train ~ ., data = x,method="class")
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
Naive Bayes Classifier
This is a classification technique based on an assumption of independence between predictors or
what’s known as Bayes’ theorem. In simple terms, a Naive Bayes classifier assumes that the
presence of a particular feature in a class is unrelated to the presence of any other feature.
For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in
diameter. Even if these features depend on each other or upon the existence of the other features, a
Naive Bayes Classifier would consider all of these properties to independently contribute to the
probability that this fruit is an apple.
To build a Bayesian model is simple and particularly functional in case of enormous data sets. Along
with simplicity, Naive Bayes is known to outperform sophisticated classification methods as well.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c).
The expression for Posterior Probability is as follows.
Here,
Example: Let’s work through an example to understand this better. So, here I have a training data
set of weather namely, sunny, overcast and rainy, and corresponding binary variable ‘Play’. Now, we
need to classify whether players will play or not based on weather condition. Let’s follow the below
steps to perform it.
Step 2: Create a Likelihood table by finding the probabilities like Overcast probability =
0.29 and probability of playing is 0.64.Step 3: Now, use the Naive Bayesian equation to calculate
the posterior probability for each class. The class with the highest posterior probability is the outcome
of prediction.
Problem: Players will play if the weather is sunny, is this statement correct?
We can solve it using above discussed method, so P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P
(Sunny)
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different class based on various
attributes. This algorithm is mostly used in text classification and with problems having multiple
classes.
R-Code:
library(e1071)
x <- cbind(x_train,y_train)
# Fitting model
fit <-naiveBayes(y_train ~ ., data = x)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
KNN (k- Nearest Neighbors)
K nearest neighbors is a simple algorithm used for both classification and regression problems. It
basically stores all available cases to classify the new cases by a majority vote of its k neighbors. The
case assigned to the class is most common amongst its K nearest neighbors measured by a distance
function (Euclidean, Manhattan, Minkowski, and Hamming).
While the three former distance functions are used for continuous variables, Hamming distance
function is used for categorical variables. If K = 1, then the case is simply assigned to the class of its
nearest neighbor. At times, choosing K turns out to be a challenge while performing kNN modeling.
You can understand KNN easily by taking an example of our real lives. If you have a crush on a
girl/boy in class, of whom you have no information, you might want to talk to their friends and social
circles to gain access to their information!
R-Code:
library(knn)
x <- cbind(x_train,y_train)
# Fitting model
fit <-knn(y_train ~ ., data = x,k=5)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
Things to consider before selecting KNN:
For example, if we only had two features like Height and Hair length of an individual, we’d first plot
these two variables in two-dimensional space where each point has two coordinates (these
coordinates are known as Support Vectors)
Now, we will find some line that splits the data between the two differently classified groups of data.
This will be the line such that the distances from the closest point in each of the two groups will be
farthest away.
In the example shown above, the line which splits the data into two differently classified groups is
the blue line, since the two closest points are the farthest apart from the line. This line is our classifier.
Then, depending on where the testing data lands on either side of the line, that’s what class we can
classify the new data as.
R-Code:
library(e1071)
x <- cbind(x_train,y_train)
# Fitting model
fit <-svm(y_train ~ ., data = x)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
DECISION TREE
Decision Tree is considered to be one of the most useful Machine Learning algorithms since it can be
used to solve a variety of problems. Here are a few reasons why you should use Decision Tree:
Let’s say that you hosted a huge party and you want to know how many of your guests were non-
vegetarians. To solve this problem, let’s create a simple Decision Tree.
Decision Tree Example – Decision Tree Algorithm – Edureka
In the above illustration, I’ve created a Decision tree that classifies a guest as either vegetarian or
non-vegetarian. Each node represents a predictor variable that will help to conclude whether or not a
guest is a non-vegetarian. As you traverse down the tree, you must make decisions at each node,
until you reach a dead end.
Now that you know the logic of a Decision Tree, let’s define a set of terms related to a Decision
Tree.
Root Node: The root node is the starting point of a tree. At this point, the first split is performed.
Internal Nodes: Each internal node represents a decision point (predictor variable) that
eventually leads to the prediction of the outcome.
Leaf/ Terminal Nodes: Leaf nodes represent the final class of the outcome and therefore
they’re also called terminating nodes.
Branches: Branches are connections between nodes, they’re represented as arrows. Each
branch represents a response such as yes or no.
So that is the basic structure of a Decision Tree. Now let’s try to understand the workflow of a Decision
Tree.
Step 1: Select the feature (predictor variable) that best classifies the data set into the desired classes
and assign that feature to the root node.
Step 2: Traverse down from the root node, whilst making relevant decisions at each internal node
such that each internal node best classifies the data.
Step 3: Route back to step 1 and repeat until you assign a class to the input data.
The above-mentioned steps represent the general workflow of a Decision Tree used for classification
purposes.
ID3 or the Iterative Dichotomiser 3 algorithm is one of the most effective algorithms used to build a
Decision Tree. It uses the concept of Entropy and Information Gain to generate a Decision Tree for a
given set of data.
ID3 Algorithm:
The ID3 algorithm follows the below workflow in order to build a Decision Tree:
The first step in this algorithm states that we must select the best attribute. What does that mean?
The best attribute (predictor variable) is the one that, separates the data set into different classes,
most effectively or it is the feature that best splits the data set.
Now the next question in your head must be, “How do I decide which variable/ feature best splits the
data?”
1. Information Gain
2. Entropy
What Is Entropy?
Information Gain (IG) is the most significant measure used to build a Decision
Tree. It indicates how much “information” a particular feature/ variable gives us
about the final outcome.
Information Gain is important because it used to choose the variable that best splits the data at each
node of a Decision Tree. The variable with the highest IG is used to split the data at the root node.
To better understand how Information Gain and Entropy are used to create a Decision Tree, let’s look
at an example. The below data set represents the speed of a car based on certain parameters.
Your problem statement is to study this data set and create a Decision Tree that classifies the speed
of a car (response variable) as either slow or fast, depending on the following predictor variables:
Road type
Obstruction
Speed limit
We’ll be building a Decision Tree using these variables in order to predict the speed of a car. Like I
mentioned earlier we must first begin by deciding a variable that best splits the data set and assign
that particular variable to the root node and repeat the same thing for the other nodes as well.
At this point, you might be wondering how do you know which variable best separates the data? The
answer is, the variable with the highest Information Gain best divides the data into the desired output
classes.
So, let’s begin by calculating the Entropy and Information Gain (IG) for each of the predictor variables,
starting with ‘Road type’.
In our data set, there are four observations in the ‘Road type’ column that correspond to four labels
in the ‘Speed of car’ column. We shall begin by calculating the entropy of the parent node (Speed of
car).
Step one is to find out the fraction of the two classes present in the parent node. We know that there
are a total of four values present in the parent node, out of which two samples belong to the ‘slow’
class and the other 2 belong to the ‘fast’ class, therefore:
p(slow) = no. of ‘slow’ outcomes in the parent node / total number of outcomes
p(fast) = no. of ‘fast’ outcomes in the parent node / total number of outcomes
Now that we know that the entropy of the parent node is 1, let’s see how to calculate the Information
Gain for the ‘Road type’ variable. Remember that, if the Information gain of the ‘Road type’ variable
is greater than the Information Gain of all the other predictor variables, only then the root node can
be split by using the ‘Road type’ variable.
In order to calculate the Information Gain of ‘Road type’ variable, we first need to split the root node
by the ‘Road type’ variable.
Decision Tree (Road type) – Decision Tree Algorithm – Edureka
In the above illustration, we’ve split the parent node by using the ‘Road type’ variable, the child nodes
denote the corresponding responses as shown in the data set. Now, we need to measure the entropy
of the child nodes.
The entropy of the right-hand side child node (fast) is 0 because all of the outcomes in this node
belongs to one class (fast). In a similar manner, we must find the Entropy of the left-hand side node
(slow, slow, fast).
In this node there are two types of outcomes (fast and slow), therefore, we first need to calculate the
fraction of slow and fast outcomes for this particular node.
Our final step is to substitute the above weighted average in the IG formula in order to calculate the
final IG of the ‘Road type’ variable:
Therefore,
Like I mentioned earlier, the Decision Tree Algorithm selects the variable with the highest Information
Gain to split the Decision Tree. Therefore, by using the above method you need to calculate the
Information Gain for all the predictor variables to check which variable has the highest IG.
Next
So by using the above methodology, you must get the following values for each predictor variable:
So, here we can see that the ‘Speed limit’ variable has the highest Information Gain. Therefore, the
final Decision Tree for this dataset is built using the ‘Speed limit’ variable.
Now that you know how a Decision Tree is created, let’s run a short demo that solves a real-world
problem by implementing Decision Trees.
Naive Bayes Classifier
What is a classifier?
A classifier is a machine learning model that is used to discriminate
different objects based on certain features.
Bayes Theorem:
Example:
Let us take an example to get some better intuition. Consider the problem
of playing golf. The dataset is represented as below.
We classify whether the day is suitable for playing golf, given the features of
the day. The columns represent these features and the rows represent
individual entries. If we take the first row of the dataset, we can observe
that is not suitable for playing golf if the outlook is rainy, temperature is
hot, humidity is high and it is not windy. We make two assumptions here,
one as stated above we consider that these predictors are independent. That
is, if the temperature is hot, it does not necessarily mean that the humidity
is high. Another assumption made here is that all the predictors have an
equal effect on the outcome. That is, the day being windy does not have
more importance in deciding to play golf or not.
X is given as,
Now, you can obtain the values for each by looking at the dataset and
substitute them into the equation. For all entries in the dataset, the
denominator does not change, it remain static. Therefore, the denominator
can be removed and a proportionality can be introduced.
In our case, the class variable(y) has only two outcomes, yes or no. There
could be cases where the classification could be multivariate. Therefore, we
need to find the class y with maximum probability.
Using the above function, we can obtain the class, given the predictors.
Since the way the values are present in the dataset changes, the formula for
conditional probability changes to,
Conclusion:
Naive Bayes algorithms are mostly used in sentiment analysis, spam
filtering, recommendation systems etc. They are fast and easy to implement
but their biggest disadvantage is that the requirement of predictors to be
independent. In most of the real life cases, the predictors are dependent,
this hinders the performance of the classifier.
Standardization VS Normalization
Standardization
Standardization (or Z-score normalization) is the process of rescaling
the features so that they’ll have the properties of a Gaussian distribution
with
Normalization
Normalization often also simply called Min-Max scaling basically shrinks
the range of the data such that the range is fixed between 0 and 1 (or -1 to 1
if there are negative values). It works better for cases in which the
standardization might not work so well. If the distribution is not Gaussian
or the standard deviation is very small, the min-max scaler works better.