2.unit 2 ML Q&A

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

UNIT-II

Supervised Learning (Regression/Classification): Distance based Methods


1. What is the Role of Distance Measures ML Algorithms? Illustrate. [7M]
July – 2023 Set -1[Remember]
Distance-based Methods
Machine learning offers an exciting way to establish meaningful relationships within complex
datasets. In this process, distance measurements play a pivotal role. Distance-based methods
help determine similarities and differences between data points, enabling algorithms to
perform more accurately and effectively. In this article, we will explore some of the
fundamental distance measurements commonly used in machine learning.
1. Euclidean Distance
Euclidean distance represents the shortest straight-line distance between two points. It is used
to calculate the distance between two points in a two-dimensional or three-dimensional space.
Typically favoured for continuous data.

2. Manhattan Distance:
Named after New York City’s grid-like street layout, Manhattan distance calculates the
distance between two points using only horizontal and vertical movements. Especially used in
analyses on categorical and ordinal data.
3. Minkowski Distance

Minkowski distance generalizes the Euclidean and Manhattan distances. It allows for different
distance calculations based on a special parameter ‘p’, offering flexibility for various data
types.

4. Cosine Similarity

Cosine similarity measures the similarity between two vectors based on the angle between
them. This method is particularly used in areas like text similarity and document classification.
The cosine similarity of two vectors is calculated using the cosine of the angle between them.

5. Hamming Distance
Hamming distance is primarily used with binary data. It calculates the distance by counting
the number of differing characters between two strings. This plays a significant role in data
transmission and error correction coding.
6. Jaccard Similarity
Jaccard similarity measures the ratio of the intersection to the union of two sets. This is a
useful method for understanding similarities and differences in set-based data.

Each distance measurement offers tailored solutions for different data types and analysis
requirements. As a data analyst, knowing when and how to use these measurements can lead
to success in your machine learning projects.
Nearest Neighbours
2. Explain KNN algorithm with an example. [7M] July – 2023 Set -
1[Understand]

The K-Nearest Neighbors (KNN) algorithm is a supervised machine learning method


employed to tackle classification and regression problems. Evelyn Fix and Joseph Hodges
developed this algorithm in 1951, which was subsequently expanded by Thomas Cover.
The article explores the fundamentals, workings, and implementation of the KNN
algorithm.

What is the K-Nearest Neighbors Algorithm?

KNN is one of the most basic yet essential classification algorithms in machine learning. It
belongs to the supervised learning domain and finds intense application in pattern
recognition, data mining, and intrusion detection. It is widely disposable in real-life
scenarios since it is non-parametric, meaning it does not make any underlying assumptions
about the distribution of data (as opposed to other algorithms such as GMM, which assume
a Gaussian distribution of the given data).

We are given some prior data (also called training data), which classifies coordinates into
groups identified by an attribute.

KNN Algorithm working visualization


Why do we need a KNN algorithm?
(K-NN) algorithm is a versatile and widely used machine learning algorithm that is
primarily used for its simplicity and ease of implementation. It does not require any
assumptions about the underlying data distribution. It can also handle both numerical and
categorical data, making it a flexible choice for various types of datasets in classification
and regression tasks. It is a non-parametric method that makes predictions based on the
similarity of data points in a given dataset. K-NN is less sensitive to outliers compared to
other algorithms.
Workings of KNN algorithm
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity, where it
predicts the label or value of a new data point by considering the labels or values of its K
nearest neighbors in the training dataset.
Let X be the training dataset with n data points, where each data point is represented by a d-
dimensional feature vector Xi and Y be the corresponding labels or values for each data
point in X. Given a new data point x, the algorithm calculates the distance between x and
each data point xi in X using a distance metric, such as Euclidean distance:

Step-by-Step explanation of how KNN works is discussed below:

Step 1: Selecting the optimal value of K


 K represents the number of nearest neighbors that needs to be considered while
making the prediction.
Step 2: Calculating distance
 To measure the similarity between target and training data points, Euclidean
distance is used. Distance is calculated between each of the data points in the
dataset and the target point.
Step 3: Finding Nearest Neighbors
 The k data points with the smallest distances to the target point are the nearest
neighbors.
Step 4: Voting for Classification or Taking Average for Regression
 In the classification problem, the class labels of are determined by performing
majority voting. The class with the most occurrences among the neighbors
becomes the predicted class for the target data point.
 In the regression problem, the class label is calculated by taking average of the
target values of K nearest neighbors. The calculated average value becomes the
predicted output for the target data point.
Advantages of the KNN Algorithm:
 Easy to implement as the complexity of the algorithm is not that high.
 Adapts Easily – As per the working of the KNN algorithm it stores all the data
in memory storage and hence whenever a new example or data point is added
then the algorithm adjusts itself as per that new example and has its contribution
to the future predictions as well.
 Few Hyperparameters – The only parameters which are required in the training
of a KNN algorithm are the value of k and the choice of the distance metric
which we would like to choose from our evaluation metric.
Disadvantages of the KNN Algorithm:
 Does not scale – As we have heard about this that the KNN algorithm is also
considered a Lazy Algorithm. The main significance of this term is that this
takes lots of computing power as well as data storage. This makes this algorithm
both time-consuming and resource exhausting.
 Curse of Dimensionality – There is a term known as the peaking phenomenon
according to this the KNN algorithm is affected by the curse of
dimensionality which implies the algorithm faces a hard time classifying the
data points properly when the dimensionality is too high.
 Prone to Overfitting – As the algorithm is affected due to the curse of
dimensionality it is prone to the problem of overfitting as well. Hence
generally feature selection as well as dimensionality reduction techniques are
applied to deal with this problem.

Decision Trees

3. What is the decision tree? How to choose attribute selection in decision


tree? [7M] July – 2023 Set -2[Remember]

4. Explain about Decision tree classifier with an example. [7M] July – 2023
Set - 2[Understand]

Decision Tree

Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It is a
tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.

In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.

It is a graphical representation for getting all the possible solutions to a problem/decision


based on given conditions.

In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.

A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.

Below diagram explains the general structure of a decision tree:


Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.

How does the Decision Tree algorithm Work?

Step-1: Begin the tree with the root node, says S, which contains the complete dataset.

Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).

Step-3: Divide the S into subsets that contains possible values for the best attributes.

Step-4: Generate the decision tree node, which contains the best attribute.

Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.
Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select
the best attribute for the nodes of the tree. There are two popular techniques for ASM, which
are:

o Information Gain
o Gini Index

1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:

Information Gain= Entropy(S)- [(Weighted Avg) * Entropy


(each feature)]

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies


randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

S= Total number of samples

P(yes)= probability of yes

P(no)= probability of no

2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2


Pruning: Getting an Optimal Decision Tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the learning
tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:

o Cost Complexity Pruning


o Reduced Error Pruning.

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

Naive Bayes

5. What is Bayes theorem? Explain Naïve bayes with an example. [7M] July
– 2023 Set -3[Remember]

o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes


theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training
dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can make
quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple
without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below example:

Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5

Likelihood table weather condition:


Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:


o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:


o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.
Applications of Naïve Bayes Classifier:
o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager
learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.

Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

o Gaussian: The Gaussian model assumes that features follow a normal distribution.
This means if predictors take continuous values instead of discrete, then the model
assumes that these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification problems, it
means a particular document belongs to which category such as Sports, Politics,
education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular
word is present or not in a document. This model is also famous for document
classification tasks.

Linear Models:

Linear Regression

6. Discuss about Linear regression with an example. [7M] July – 2023 Set -
4[Understand]

Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression
shows the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:

Mathematically, we can represent a linear regression as:

Y= a0+a1X+ ε
Here,

Y=DependentVariable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for X and Y variables are training datasets for Linear Regression model
representation.

Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:

o Simple Linear Regression:


If a single independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Simple Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple Linear
Regression.

Examples of Linear Regression

The weight of the person is linearly related to their height. So, this shows a linear relationship
between the height and weight of the person. According to this, as we increase the height, the
weight of the person will also increase. It is not necessary that one variable is dependent on
others, or one causes the other, but there is some critical relationship between the two
variables. In such cases, we use a scatter plot to simplify the strength of the relationship
between the variables. If there is no relation or linking between the variables then the scatter
plot does not indicate any increasing or decreasing pattern. In such cases, the linear
regression design is not beneficial to the given data.

Linear Regression Formula


As we know, linear regression shows the linear relationship between two variables. The
equation of linear regression is similar to that of the slope formula. We have learned this
formula before in earlier classes such as a linear equation in two variables. Linear Regression
Formula is given by the equation
Y= a + bX
We will find the value of a and b by using the below formula

Find a linear regression equation for the following two sets of data:

x 2 4 6 8

y 3 7 5 10
Sol: To find the linear regression equation we need to find the value of Σx, Σy, Σxand Σxy

Construct the table and find the value

x y x² xy

2 3 4 6

4 7 16 28

6 5 36 30

8 10 64 80

Σx = 20 Σy = 25 Σx² = 120 Σxy = 144

The formula of the linear equation is y=a+bx. Using the formula we will find the value of a
and b

b=0.95
Hence, we got the value of a = 1.5 and b = 0.95
The linear equation is given by
Y = a + bx
Now put the value of a and b in the equation

Hence equation of linear regression is y = 1.5 + 0.95x

Logistic Regression
7. Can Logistic regression be used for classification or regression? Discuss about
Logistic Regression algorithm. [7M] July – 2023 Set -2[Understand]

o Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight,
etc.
o Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
o Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:
The logistic regression model uses the logistic function (sigmoid function) to transform a
linear combination of input features into values between 0 and 1. The logistic function is
defined as:

The general form of the logistic regression equation for a binary classification problem
with one independent variable is:

Here:
 P(Y=1) is the probability of the dependent variable (Y) being 1 (belonging to the
positive class).
 e is the base of the natural logarithm.
 β0 is the intercept term.
 β1 is the coefficient associated with the independent variable (X).
Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the
Sigmoid function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

8. What is the purpose of sigmoid function in Logistic Regression? Explain. [7M] July –
2023 Set -3,4[Remember]
The sigmoid function in logistic regression serves a crucial purpose by transforming the
linear combination of input features into a probability value. The logistic regression model
predicts the probability that an instance belongs to a particular class, and the sigmoid function
is instrumental in achieving this. Here are the key purposes of the sigmoid function in logistic
regression:
1. Outputs Probability between 0 and 1:
-The sigmoid function ensures that the output of the logistic regression model is bounded

between 0 and 1.The formula maps any real-valued input Z to the range[0,
1].
- This is crucial because the output is interpreted as the probability that the instance belongs
to the positive class (class 1).
2. Interpretation of Probabilities:
- The output of the sigmoid function can be interpreted as the estimated probability that a
given instance belongs to the positive class. For example, if P(Y=1) is 0.8, it implies an 80%
probability that the instance is in class 1.
3. Log-Odds Transformation:
- The sigmoid function is used to transform the linear combination of input features and
coefficients in the logistic regression equation into log-odds. The log-odds represent the
logarithm of the odds of the event occurring (in this case, the odds of belonging to class 1).
- This transformation allows the linear model to capture non-linear relationships between
features and the log-odds of the dependent variable.
4. Threshold for Classification:
- The probabilities predicted by the sigmoid function can be used to make binary
classifications. By choosing a threshold (commonly 0.5), instances with predicted
probabilities above the threshold are classified as belonging to the positive class, and those
below are classified as belonging to the negative class.
5. Smooth Gradient Descent:
- When training the logistic regression model using techniques like gradient descent, the
sigmoid function provides smooth derivatives. This smoothness aids in stable and efficient
convergence during the optimization process.
Generalized Linear Models
9. What are General Linear Models? Give their parametric equations. [7M]
July – 2023 Set -1[Remember]
TheGeneralized linear models [GLM] is a unified procedure for modeling and fitting
distributions to data with systematic differences. It unifies various statistical models,
including linear regression, logistic regression, and Poisson regression.

It was introduced by John Nelder and Robert Wedderburn in 1972.


The GLM is a larger class of models. Some examples of GLM statistics models include
ANOVA, T-test, Chi-Square, and Linear Regression.
The general form of the parametric equation for a Generalized Linear Model (GLM) can be
expressed as:
g(μ)=β0+β1x1+β2x2+…+βnxn
Here, the terms in the equation:
 g(μ): The link function, which relates the expected value (μ) of the response variable
to the linear combination of predictor variables. The link function is chosen based on
the distribution of the response variable and the modeling assumptions.
 μ: The expected value of the response variable. The link function is applied to μ to
transform the linear predictor into a scale that is appropriate for the specific
distribution of the response variable.
 β0,β1,…,βn: The parameters or coefficients associated with the predictor variables x1
,x2,…,xn. These are estimated during the model fitting process.
The specific form of the link functiong(⋅) depends on the distribution of the response
variable. Here are a few common link functions for different distributions:
1. Linear Regression (Normal distribution):
 Link Function: Identity Link
 Equation: μ=β0+β1x1+β2x2+…+βnxn
2. Logistic Regression (Binomial distribution for binary outcomes):
 Link Function: Logit Link
 Equation:

3. Poisson Regression (Poisson distribution for count data):


 Link Function: Log Link

 Equation:
4. Gamma Regression (Gamma distribution for continuous, positive outcomes):
 Link Function: Inverse Link
 Equation:

In these equations, μ is the expected value of the response variable, and the link function g(⋅)
transforms the linear predictor to accommodate the specific distributional characteristics of
the response variable. The parameters β0,β1,…,βn are estimated during the model fitting
process, typically using Maximum Likelihood Estimation (MLE).

10. Explain about ANOVA in detail. [7M] July – 2023 Set -1[Understand]
Analysis of Variance (ANOVA) is a statistical method used in machine learning to compare
the means of more than two groups. It's used to determine if there's a difference in the mean
somewhere in the model, but it doesn't indicate where the difference is.
ANOVA test checks whether a difference in the average somewhere in the model or not
(checking whether there was an overall effect or not); however, this method doesn't tell us the
spot of the difference (if there is one). We can find the spot of the difference between the
group by conducting the post hoc tests.
However, in order to perform any tests, we first have to define the null and alternate
hypotheses:
1. Null Hypothesis:There is no noteworthy difference between the groups.
2. Alternate Hypothesis:There is a noteworthy difference between the groups.
We can perform an ANOVA Test by comparing two types of variations. The First variation is
between the sample means and the other one within each of the samples. The formula shown
below describes one-way ANOVA Test statistics.
The output of the ANOVA formula, the F statistic (also known as the F-ratio), enables the
analysis of the multiple sets of data in order to determine the variability among the samples
and within samples.
We can write the formula for the One-way ANOVA test as illustrated below:
Where,
yi - Sample Mean in the ith group
ni - Number of Observation in the ith group
y - Total mean of the data
k - Total number of the groups
yij - jth observation in the out of k groups
N - Overall sample size
Whenever we plot the ANOVA table, we can see all the above components in the following
format:
ANOVA Test Assumptions
Before performing an ANOVA test, we must make certain assumptions, as shown below:
1. We can obtain observations randomly and independently from the population defined
by the factor levels.
2. The data for every level of the factor is distributed generally.
3. Case Independent: The sample cases must be independent of each other.
4. Variance Homogeneity: Homogeneity signifies that the variance between the group
needs to be around equal.

Understanding the Types of ANOVA Tests


The ANOVA Tests can be classified into three major types. These types are shown below:

1. One-Way ANOVA Test


2. Two-Way ANOVA Test
3. n-Way ANOVA Test

One-Way ANOVA Test


An Analysis of Variance Test that has only one independent variable is known as the One-way
ANOVA Test.
For instance, a country can assess the differences in the cases of Coronavirus, and a Country can have
multiple categories for comparison.

Two-Way ANOVA Test


An Analysis of Variance Test that has two independent variables is known as a Two-way
ANOVA test. This test is also known as Factorial ANOVA Test.

For example, expanding the above example, a two-way ANOVA can examine the difference
in the cases of Coronavirus (the dependent variable) by Age Group (the first independent
variable) and Gender (the second independent variable).

n-Way ANOVA Test


An Analysis of Variance Test is considered an n-way ANOVA Test if a researcher uses
more than two independent variables. Here n represents the number of independent variables
we have. This Test is also known as MANOVA Test.

For example, we can examine potential differences in cases of Coronavirus using


independent variables like Country, Age group, Gender, Ethnicity, and a lot more
simultaneously.

An ANOVA Test will provide us a single (univariate) F-value; however, a MANOVA Test
will provide us a multivariate F-value.
Support Vector Machines
11. What is Overfitting? Explain about SVM algorithm to overcome it?
[7M] July – 2023 Set -4[Remember]
Overfitting in Machine Learning
A statistical model is said to be overfitted when the model does not make accurate predictions
on testing data. When a model gets trained with so much data, it starts learning from the noise
and inaccurate data entries in our data set. And When testing with test data results in High
variance. Then the model does not categorize the data correctly, because of too many details
and noise.
The causes of overfitting are the non-parametric and non-linear methods because these types
of machine learning algorithms have more freedom in building the model based on the
dataset and therefore they can build unrealistic models.
A solution to avoid overfitting is using a linear algorithm if we have linear data or using the
parameters like the maximal depth if we are using decision trees.
Overfitting is a problem where the evaluation of machine learning algorithms on training data
is different from unseen data.

Reasons for Overfitting:


1. High variance and low bias.
2. The model is too complex.
3. The size of the training data.
Techniques to Reduce Overfitting
1. Increase training data.
2. Reduce model complexity.
3. Early stopping during the training phase (have an eye over the loss over the training
period as soon as loss begins to increase stop training).
4. Ridge Regularization and Lasso Regularization.
5. Use dropout for neural networks to tackle overfitting.

Support Vector Machine Algorithm

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is
used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine. Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:

SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM

SVM can be of two types:


o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear
data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane:

There can be multiple lines/decision boundaries to segregate the classes in n-dimensional


space, but we need to find out the best decision boundary that helps to classify the data
points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if
there are 3 features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.

Binary Classification: Multiclass/Structured outputs


12. Discuss about multi class classification technique. [7M] July – 2023 Set -
3[Understand]
Classification means categorizing data and forming groups based on the similarities. In a
dataset, the independent variables or features play a vital role in classifying our data. When
we talk about multiclass classification, we have more than two classes in our dependent or
target variable, as can be seen in Fig.1:
The above picture is taken from the Iris dataset which depicts that the target variable has
three categories i.e., Virginica, setosa, and Versicolor, which are three species of Iris plant.
We might use this dataset later, as an example of a conceptual understanding of multiclass
classification.

Confusion Matrix in Multi-class Classification


A confusion matrix is table which is used in every classification problem to describe the
performance of a model on a test data.
As we know about confusion matrix in binary classification, in multiclass classification also
we can find precision and recall accuracy.

Let’s take an example to have a better idea about confusion matrix in multiclass classification
using Iris dataset which we have already seen above in this article.
Finding precision and recall from above Table.1:
Precision for Virginica class is the number of correctly predicted virginica species out of all
the predicted virginica species, which is 4/7 = 57.1%. This means that only 4/7 of the species
that our predictor classifies as Virginica are actually virginica. Similarly, we can find for other
species i.e. for Setosa and Versicolor, precision is 20% and 62.5% respectively.
Whereas, Recall for Virginica class is the number of correctly predicted virginica species out
of actual virginica species, which is 50%. This means that our classifier classified half of the
virginica species as virginica. Similarly, we can find for other species i.e. for Setosa and
Versicolor, recall is 20% and 71.4% respectively.
Multiclass Vs Multi-label
People often get confused between multiclass and multi-label classification. But these two
terms are very different and cannot be used interchangeably. We have already understood
what multiclass is all about. Let’s discuss in brief how multi-label is different from
multiclass.
Multi-label refers to a data point that may belong to more than one class. For example, you
wish to watch a movie with your friends but you have a different choice of genres that you all
enjoy. Some of your friends like comedy and others are more into action and thrill. Therefore,
you search for a movie that fulfills both the requirements and here, your movie is supposed to
have multiple labels. Whereas, in multiclass or binary classification, your data point can
belong to only a single class. Some more examples of the multi-label dataset could be protein
classification in the human body, or music categorization according to genres. It can also one
of the concepts highly used in photo classification.

MNIST
13. Explain about MNIST dataset. Describe the procedure to apply
classification technique. [7M] July – 2023 Set -2[Understand]
The MNIST database (Modified National Institute of Standards and Technology database) is
a large database of handwritten digits that is commonly used for training various image
processing systems. The database is also widely used for training and testing in the field of
machine learning.
It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that
since NIST's training dataset was taken from American Census Bureau employees, while the
testing dataset was taken from American high school students, it was not well-suited for
machine learning experiments. Furthermore, the black and white images from NIST were
normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced
grayscale levels.
The MNIST database contains 60,000 training images and 10,000 testing images. Half of the
training set and half of the test set were taken from NIST's training dataset, while the other
half of the training set and the other half of the test set were taken from NIST's testing
dataset. The original creators of the database keep a list of some of the methods tested on it.
In their original paper, they use a support-vector machine to get an error rate of 0.8%. An
extended dataset similar to MNIST called EMNIST has been published in 2017, which
contains 240,000 training mnist dataset images, and 40,000 testing mnist dataset images of
MNIST dataset of handwritten digits and characters.

What is MNIST used for?


MNIST provides a baseline for testing image processing systems. You could consider it as the
“hello world” of machine learning. Data scientists will train an algorithm on the MNIST
dataset simply to test a new architecture or framework, to ensure that they work.
Because MNIST is a labeled dataset that pairs images of hand-written numerals with the
name of the respective numeral, it can be used in supervised learning to train classifiers. It is
a good example, alongside Fei Fei Li’s ImageNet, of how a good, labeled dataset can advance
the cause of machine learning more broadly. More examples of open datasets are here.

Is MNIST data binary?


The MNIST database was constructed from NIST's Special Database 3 and Special Database
1 which contain binary images of handwritten digits. NIST originally designated SD-3 as
their training set and SD-1 as their test set. However, SD-3 is much cleaner and easier to
recognize than SD-1. The reason for this can be found on the fact that SD-3 was collected
among Census Bureau employees, while SD-1 was collected among high-school students.
Drawing sensible conclusions from learning experiments requires that the result be
independent of the choice of training set and test among the complete set of samples.
Therefore it was necessary to build a new database by mixing NIST's datasets.

How MNIST dataset is created?


The MNIST handwritten digit classification problem is a standard dataset used in computer
vision and deep learning.
Although the dataset is effectively solved, it can be used as the basis for learning and
practicing how to develop, evaluate, and use convolutional deep learning neural networks for
image classification from scratch. This includes how to develop a robust test harness for
estimating the performance of the model, how to explore improvements to the model, and
how to save the model and later load it to make predictions on new data.
Although the MNIST dataset is effectively solved, it can be a useful starting point for
developing and practicing a methodology for solving image classification tasks using
convolutional neural networks.
Instead of reviewing the literature on well-performing models on the dataset, we can develop
a new model from scratch.
The dataset already has a well-defined train and test dataset that we can use.
In order to estimate the performance of a model for a given training run, we can further split
the training set into a train and validation dataset. Performance on the train and validation
dataset over each run can then be plotted to provide learning curves and insight into how well
a model is learning the problem.

How to Develop a Baseline Model?


The first step is to develop a baseline model.
This is critical as it both involves developing the infrastructure for the test harness so that any
model we design can be evaluated on the dataset, and it establishes a baseline in model
performance on the problem, by which all improvements can be compared.
The design of the test harness is modular, and we can develop a separate function for each
piece. This allows a given aspect of the test harness to be modified or inter-changed, if we
desire, separately from the rest.
We can develop this test harness with five key elements. They are the loading of the dataset,
the preparation of the dataset, the definition of the model, the evaluation of the model, and
the presentation of results.
Load Dataset
We know some things about the dataset.
For example, we know that the images are all pre-aligned (e.g. each image only contains a
hand-drawn digit), that the images all have the same square size of 28×28 pixels, and that the
images are grayscale.
Therefore, we can load the images and reshape the data arrays to have a single color channel.
Why is MNIST a good dataset?
MNIST dataset is an extremely good database for people who want to try machine
learning techniques and pattern recognition methods on real-world data while spending
minimal time and effort on data preprocessing and formatting. Its simplicity and ease of use
are what make this dataset so widely used and deeply understood.

Ranking
14. What is ranking in binary classification in Machine Learning? What is
the best algorithm for raking? [7M] July – 2023 Set -3[Remember]
In this question, by “ranking” we mean sorting documents by relevance to find contents of
interest with respect to a query. This is a fundamental problem of Information Retrieval, but
this task also arises in many other applications:
1. Search Engines — Given a user profile (location, age, sex, …) a textual query, sort
web pages results by relevance.
2. Recommender Systems — Given a user profile and purchase history, sort the other
items to find new potentially interesting products for the user.
3. Travel Agencies — Given a user profile and filters (check-in/check-out dates, number
and age of travelers, …), sort available rooms by relevance.

Ranking models typically work by predicting a relevance score s = f(x) for each input x = (q,
d) where q is a query and d is a document. Once we have the relevance of each document, we
can sort (i.e. rank) the documents according to those scores.
Ranking models rely on a scoring function.
The scoring model can be implemented using various approaches.
 Vector Space Models – Compute a vector embedding (e.g. using Tf-Idf or BERT) for
each query and document, and then compute the relevance score f(x) = f(q, d) as the
cosine similarity between the vectors embeddings of q and d.
 Learning to Rank – The scoring model is a Machine Learning model that learns to
predict a score s given an input x = (q, d) during a training phase where some sort of
ranking loss is minimized.
In this article we focus on the latter approach, and we show how to implement Machine
Learning models for Learning to Rank.

Ranking Evaluation Metrics


Before analyzing various ML models for Learning to Rank, we need to define which metrics
are used to evaluate ranking models. These metrics are computed on the predicted documents
ranking, i.e. the k-th top retrieved document is the k-th document with highest predicted
score s.
Mean Average Precision (MAP)

MAP — Mean Average Precision.

Mean Average Precision is used for tasks with binary relevance, i.e. when the true score y of a
document d can be only 0 (non relevant) or 1 (relevant).
For a given query q and corresponding documents D = {d₁, …, dₙ}, we check how many of
the top k retrieved documents are relevant (y=1) or not (y=0)., in order to
compute precision Pₖ and recall Rₖ. For k = 1…n we get different Pₖ and Rₖ values that define
the precision-recall curve: the area under this curve is the Average Precision (AP).
Finally, by computing the average of AP values for a set of m queries, we obtain the Mean
Average Precision (MAP).
Discounted Cumulative Gain (DCG)

DCG — Discounted Cumulative Gain.

Discounted Cumulative Gain is used for tasks with graded relevance, i.e. when the true
score y of a document d is a discrete value in a scale measuring the relevance w.r.t. a query q.
A typical scale is 0 (bad), 1 (fair), 2 (good), 3 (excellent), 4 (perfect).
For a given query q and corresponding documents D = {d₁, …, dₙ}, we consider the the k-th
top retrieved document. The gain Gₖ = 2^yₖ – 1 measures how useful is this document (we
want documents with high relevance!), while the discount Dₖ = 1/log(k+1) penalizes
documents that are retrieved with a lower rank (we want relevant documents in the top
ranks!).
The sum of the discounted gain terms GₖDₖ for k = 1…n is the Discounted Cumulative Gain
(DCG). To make sure that this score is bound between 0 and 1, we can divide the measured
DCG by the ideal score IDCG obtained if we ranked documents by the true value yₖ. This
gives us the Normalized Discounted Cumulative Gain (NDCG), where NDCG = DCG/IDCG.
Finally, as for MAP, we usually compute the average of DCG or NDCG values for a set
of m queries to obtain a mean value.
Machine Learning Models for Learning to Rank
To build a Machine Learning model for ranking, we need to define inputs, outputs and loss
function.
 Input – For a query q we have n documents D = {d₁, …, dₙ} to be ranked by
relevance. The elements xᵢ = (q, dᵢ) are the inputs to our model.
 Output – For a query-document input xᵢ = (q, dᵢ), we assume there exists a
true relevance score yᵢ. Our model outputs a predicted score sᵢ = f(xᵢ).
All Learning to Rank models use a base Machine Learning model (e.g. Decision
Tree or Neural Network) to compute s = f(x). The choice of the loss function is the distinctive
element for Learning to Rank models. In general, we have 3 approaches, depending on how
the loss is computed.
1. Pointwise Methods – The total loss is computed as the sum of loss terms defined
on each document dᵢ (hence pointwise) as the distance between the predicted
score sᵢ and the ground truth yᵢ, for i=1…n. By doing this, we transform our task into
a regression problem, where we train a model to predict y.
2. Pairwise Methods – The total loss is computed as the sum of loss terms defined
on each pair of documents dᵢ, dⱼ (hence pairwise) , for i, j=1…n. The objective on
which the model is trained is to predict whether yᵢ > yⱼ or not, i.e. which of two
documents is more relevant. By doing this, we transform our task into a binary
classification problem.
3. Listwise Methods – The loss is directly computed on the whole list of documents
(hence listwise) with corresponding predicted ranks. In this way, ranking metrics can
be more directly incorporated into the loss.

Machine Learning approaches to Learning to Rank: pointwise, pairwise, listwise.

A first approach is to use an iterative method where ranking metrics are used to re-
weight instances at each iteration. This is the approach used
by LambdaRank and LambdaMART, which are indeed between the pairwise and the listwise
approach.
A second approach is to approximate the objective to make it differentiable, which is the idea
behind SoftRank. Instead of predicting a deterministic score s = f(x), we predict
a smoothened probabilistic score s~ 𝒩(f(x), σ²). The ranks k are non-continuous functions
of predicted scores s, but thanks to the smoothening we can compute probability distributions
for the ranks of each document. Finally, we optimize SoftNDCG, the expected NDCG over
this rank distribution, which is a smooth function.
A third approach is consider that each ranked list corresponds to a permutation, and define
a loss over space of permutations. In ListNet, given a list of scores s we define the
probability of any permutation using the Plackett-Luce model. Then, our loss is easily
computed as the Binary Cross-Entropy distance between true and predicted probability
distributions over the space of permutations.

Probability of various permutation using Plackett-Luce model in ListNet.

Finally, the LambdaLoss paper introduced a new perspective on this problem, and created
a generalized framework to define new listwise loss functions and achieve state-of-the-art
accuracy. The main idea is to frame the problem in a rigorous and general way, as a mixture
model where the ranked list π is treated as a hidden variable. Then, the loss is defined as the
negative log likelihood of such model.

LambdaLoss loss function.

LambdaLoss loss function. (Image by author)


The authors of the LambdaLoss framework proved two essential results.
1. All other listwise methods (RankNet, LambdaRank, SoftRank, ListNet, …)
are special configurations of this general framework. Indeed, their losses are obtained
by accurately choosing the likelihood p(y | s, π) and the ranked list distribution p(π |
s).
2. This framework allows us to define metric-driven loss functions directly connected to
the ranking metrics that we want to optimize. This allows to significantly improve the
state-of-the-art on Learningt to Rank tasks.

You might also like