2.unit 2 ML Q&A
2.unit 2 ML Q&A
2.unit 2 ML Q&A
2. Manhattan Distance:
Named after New York City’s grid-like street layout, Manhattan distance calculates the
distance between two points using only horizontal and vertical movements. Especially used in
analyses on categorical and ordinal data.
3. Minkowski Distance
Minkowski distance generalizes the Euclidean and Manhattan distances. It allows for different
distance calculations based on a special parameter ‘p’, offering flexibility for various data
types.
4. Cosine Similarity
Cosine similarity measures the similarity between two vectors based on the angle between
them. This method is particularly used in areas like text similarity and document classification.
The cosine similarity of two vectors is calculated using the cosine of the angle between them.
5. Hamming Distance
Hamming distance is primarily used with binary data. It calculates the distance by counting
the number of differing characters between two strings. This plays a significant role in data
transmission and error correction coding.
6. Jaccard Similarity
Jaccard similarity measures the ratio of the intersection to the union of two sets. This is a
useful method for understanding similarities and differences in set-based data.
Each distance measurement offers tailored solutions for different data types and analysis
requirements. As a data analyst, knowing when and how to use these measurements can lead
to success in your machine learning projects.
Nearest Neighbours
2. Explain KNN algorithm with an example. [7M] July – 2023 Set -
1[Understand]
KNN is one of the most basic yet essential classification algorithms in machine learning. It
belongs to the supervised learning domain and finds intense application in pattern
recognition, data mining, and intrusion detection. It is widely disposable in real-life
scenarios since it is non-parametric, meaning it does not make any underlying assumptions
about the distribution of data (as opposed to other algorithms such as GMM, which assume
a Gaussian distribution of the given data).
We are given some prior data (also called training data), which classifies coordinates into
groups identified by an attribute.
Decision Trees
4. Explain about Decision tree classifier with an example. [7M] July – 2023
Set - 2[Understand]
Decision Tree
Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It is a
tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select
the best attribute for the nodes of the tree. There are two popular techniques for ASM, which
are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:
Where,
P(no)= probability of no
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
o Gini index can be calculated using the below formula:
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the learning
tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:
Naive Bayes
5. What is Bayes theorem? Explain Naïve bayes with an example. [7M] July
– 2023 Set -3[Remember]
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple
without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:
Problem: If the weather is sunny, then the Player should play or not?
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal distribution.
This means if predictors take continuous values instead of discrete, then the model
assumes that these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification problems, it
means a particular document belongs to which category such as Sports, Politics,
education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular
word is present or not in a document. This model is also famous for document
classification tasks.
Linear Models:
Linear Regression
6. Discuss about Linear regression with an example. [7M] July – 2023 Set -
4[Understand]
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression
shows the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
Y= a0+a1X+ ε
Here,
The values for X and Y variables are training datasets for Linear Regression model
representation.
Linear regression can be further divided into two types of the algorithm:
The weight of the person is linearly related to their height. So, this shows a linear relationship
between the height and weight of the person. According to this, as we increase the height, the
weight of the person will also increase. It is not necessary that one variable is dependent on
others, or one causes the other, but there is some critical relationship between the two
variables. In such cases, we use a scatter plot to simplify the strength of the relationship
between the variables. If there is no relation or linking between the variables then the scatter
plot does not indicate any increasing or decreasing pattern. In such cases, the linear
regression design is not beneficial to the given data.
Find a linear regression equation for the following two sets of data:
x 2 4 6 8
y 3 7 5 10
Sol: To find the linear regression equation we need to find the value of Σx, Σy, Σxand Σxy
x y x² xy
2 3 4 6
4 7 16 28
6 5 36 30
8 10 64 80
The formula of the linear equation is y=a+bx. Using the formula we will find the value of a
and b
b=0.95
Hence, we got the value of a = 1.5 and b = 0.95
The linear equation is given by
Y = a + bx
Now put the value of a and b in the equation
Logistic Regression
7. Can Logistic regression be used for classification or regression? Discuss about
Logistic Regression algorithm. [7M] July – 2023 Set -2[Understand]
o Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight,
etc.
o Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
o Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:
The logistic regression model uses the logistic function (sigmoid function) to transform a
linear combination of input features into values between 0 and 1. The logistic function is
defined as:
The general form of the logistic regression equation for a binary classification problem
with one independent variable is:
Here:
P(Y=1) is the probability of the dependent variable (Y) being 1 (belonging to the
positive class).
e is the base of the natural logarithm.
β0 is the intercept term.
β1 is the coefficient associated with the independent variable (X).
Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the
Sigmoid function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
8. What is the purpose of sigmoid function in Logistic Regression? Explain. [7M] July –
2023 Set -3,4[Remember]
The sigmoid function in logistic regression serves a crucial purpose by transforming the
linear combination of input features into a probability value. The logistic regression model
predicts the probability that an instance belongs to a particular class, and the sigmoid function
is instrumental in achieving this. Here are the key purposes of the sigmoid function in logistic
regression:
1. Outputs Probability between 0 and 1:
-The sigmoid function ensures that the output of the logistic regression model is bounded
between 0 and 1.The formula maps any real-valued input Z to the range[0,
1].
- This is crucial because the output is interpreted as the probability that the instance belongs
to the positive class (class 1).
2. Interpretation of Probabilities:
- The output of the sigmoid function can be interpreted as the estimated probability that a
given instance belongs to the positive class. For example, if P(Y=1) is 0.8, it implies an 80%
probability that the instance is in class 1.
3. Log-Odds Transformation:
- The sigmoid function is used to transform the linear combination of input features and
coefficients in the logistic regression equation into log-odds. The log-odds represent the
logarithm of the odds of the event occurring (in this case, the odds of belonging to class 1).
- This transformation allows the linear model to capture non-linear relationships between
features and the log-odds of the dependent variable.
4. Threshold for Classification:
- The probabilities predicted by the sigmoid function can be used to make binary
classifications. By choosing a threshold (commonly 0.5), instances with predicted
probabilities above the threshold are classified as belonging to the positive class, and those
below are classified as belonging to the negative class.
5. Smooth Gradient Descent:
- When training the logistic regression model using techniques like gradient descent, the
sigmoid function provides smooth derivatives. This smoothness aids in stable and efficient
convergence during the optimization process.
Generalized Linear Models
9. What are General Linear Models? Give their parametric equations. [7M]
July – 2023 Set -1[Remember]
TheGeneralized linear models [GLM] is a unified procedure for modeling and fitting
distributions to data with systematic differences. It unifies various statistical models,
including linear regression, logistic regression, and Poisson regression.
Equation:
4. Gamma Regression (Gamma distribution for continuous, positive outcomes):
Link Function: Inverse Link
Equation:
In these equations, μ is the expected value of the response variable, and the link function g(⋅)
transforms the linear predictor to accommodate the specific distributional characteristics of
the response variable. The parameters β0,β1,…,βn are estimated during the model fitting
process, typically using Maximum Likelihood Estimation (MLE).
10. Explain about ANOVA in detail. [7M] July – 2023 Set -1[Understand]
Analysis of Variance (ANOVA) is a statistical method used in machine learning to compare
the means of more than two groups. It's used to determine if there's a difference in the mean
somewhere in the model, but it doesn't indicate where the difference is.
ANOVA test checks whether a difference in the average somewhere in the model or not
(checking whether there was an overall effect or not); however, this method doesn't tell us the
spot of the difference (if there is one). We can find the spot of the difference between the
group by conducting the post hoc tests.
However, in order to perform any tests, we first have to define the null and alternate
hypotheses:
1. Null Hypothesis:There is no noteworthy difference between the groups.
2. Alternate Hypothesis:There is a noteworthy difference between the groups.
We can perform an ANOVA Test by comparing two types of variations. The First variation is
between the sample means and the other one within each of the samples. The formula shown
below describes one-way ANOVA Test statistics.
The output of the ANOVA formula, the F statistic (also known as the F-ratio), enables the
analysis of the multiple sets of data in order to determine the variability among the samples
and within samples.
We can write the formula for the One-way ANOVA test as illustrated below:
Where,
yi - Sample Mean in the ith group
ni - Number of Observation in the ith group
y - Total mean of the data
k - Total number of the groups
yij - jth observation in the out of k groups
N - Overall sample size
Whenever we plot the ANOVA table, we can see all the above components in the following
format:
ANOVA Test Assumptions
Before performing an ANOVA test, we must make certain assumptions, as shown below:
1. We can obtain observations randomly and independently from the population defined
by the factor levels.
2. The data for every level of the factor is distributed generally.
3. Case Independent: The sample cases must be independent of each other.
4. Variance Homogeneity: Homogeneity signifies that the variance between the group
needs to be around equal.
For example, expanding the above example, a two-way ANOVA can examine the difference
in the cases of Coronavirus (the dependent variable) by Age Group (the first independent
variable) and Gender (the second independent variable).
An ANOVA Test will provide us a single (univariate) F-value; however, a MANOVA Test
will provide us a multivariate F-value.
Support Vector Machines
11. What is Overfitting? Explain about SVM algorithm to overcome it?
[7M] July – 2023 Set -4[Remember]
Overfitting in Machine Learning
A statistical model is said to be overfitted when the model does not make accurate predictions
on testing data. When a model gets trained with so much data, it starts learning from the noise
and inaccurate data entries in our data set. And When testing with test data results in High
variance. Then the model does not categorize the data correctly, because of too many details
and noise.
The causes of overfitting are the non-parametric and non-linear methods because these types
of machine learning algorithms have more freedom in building the model based on the
dataset and therefore they can build unrealistic models.
A solution to avoid overfitting is using a linear algorithm if we have linear data or using the
parameters like the maximal depth if we are using decision trees.
Overfitting is a problem where the evaluation of machine learning algorithms on training data
is different from unseen data.
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is
used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine. Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
Hyperplane:
The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if
there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.
Let’s take an example to have a better idea about confusion matrix in multiclass classification
using Iris dataset which we have already seen above in this article.
Finding precision and recall from above Table.1:
Precision for Virginica class is the number of correctly predicted virginica species out of all
the predicted virginica species, which is 4/7 = 57.1%. This means that only 4/7 of the species
that our predictor classifies as Virginica are actually virginica. Similarly, we can find for other
species i.e. for Setosa and Versicolor, precision is 20% and 62.5% respectively.
Whereas, Recall for Virginica class is the number of correctly predicted virginica species out
of actual virginica species, which is 50%. This means that our classifier classified half of the
virginica species as virginica. Similarly, we can find for other species i.e. for Setosa and
Versicolor, recall is 20% and 71.4% respectively.
Multiclass Vs Multi-label
People often get confused between multiclass and multi-label classification. But these two
terms are very different and cannot be used interchangeably. We have already understood
what multiclass is all about. Let’s discuss in brief how multi-label is different from
multiclass.
Multi-label refers to a data point that may belong to more than one class. For example, you
wish to watch a movie with your friends but you have a different choice of genres that you all
enjoy. Some of your friends like comedy and others are more into action and thrill. Therefore,
you search for a movie that fulfills both the requirements and here, your movie is supposed to
have multiple labels. Whereas, in multiclass or binary classification, your data point can
belong to only a single class. Some more examples of the multi-label dataset could be protein
classification in the human body, or music categorization according to genres. It can also one
of the concepts highly used in photo classification.
MNIST
13. Explain about MNIST dataset. Describe the procedure to apply
classification technique. [7M] July – 2023 Set -2[Understand]
The MNIST database (Modified National Institute of Standards and Technology database) is
a large database of handwritten digits that is commonly used for training various image
processing systems. The database is also widely used for training and testing in the field of
machine learning.
It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that
since NIST's training dataset was taken from American Census Bureau employees, while the
testing dataset was taken from American high school students, it was not well-suited for
machine learning experiments. Furthermore, the black and white images from NIST were
normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced
grayscale levels.
The MNIST database contains 60,000 training images and 10,000 testing images. Half of the
training set and half of the test set were taken from NIST's training dataset, while the other
half of the training set and the other half of the test set were taken from NIST's testing
dataset. The original creators of the database keep a list of some of the methods tested on it.
In their original paper, they use a support-vector machine to get an error rate of 0.8%. An
extended dataset similar to MNIST called EMNIST has been published in 2017, which
contains 240,000 training mnist dataset images, and 40,000 testing mnist dataset images of
MNIST dataset of handwritten digits and characters.
Ranking
14. What is ranking in binary classification in Machine Learning? What is
the best algorithm for raking? [7M] July – 2023 Set -3[Remember]
In this question, by “ranking” we mean sorting documents by relevance to find contents of
interest with respect to a query. This is a fundamental problem of Information Retrieval, but
this task also arises in many other applications:
1. Search Engines — Given a user profile (location, age, sex, …) a textual query, sort
web pages results by relevance.
2. Recommender Systems — Given a user profile and purchase history, sort the other
items to find new potentially interesting products for the user.
3. Travel Agencies — Given a user profile and filters (check-in/check-out dates, number
and age of travelers, …), sort available rooms by relevance.
Ranking models typically work by predicting a relevance score s = f(x) for each input x = (q,
d) where q is a query and d is a document. Once we have the relevance of each document, we
can sort (i.e. rank) the documents according to those scores.
Ranking models rely on a scoring function.
The scoring model can be implemented using various approaches.
Vector Space Models – Compute a vector embedding (e.g. using Tf-Idf or BERT) for
each query and document, and then compute the relevance score f(x) = f(q, d) as the
cosine similarity between the vectors embeddings of q and d.
Learning to Rank – The scoring model is a Machine Learning model that learns to
predict a score s given an input x = (q, d) during a training phase where some sort of
ranking loss is minimized.
In this article we focus on the latter approach, and we show how to implement Machine
Learning models for Learning to Rank.
Mean Average Precision is used for tasks with binary relevance, i.e. when the true score y of a
document d can be only 0 (non relevant) or 1 (relevant).
For a given query q and corresponding documents D = {d₁, …, dₙ}, we check how many of
the top k retrieved documents are relevant (y=1) or not (y=0)., in order to
compute precision Pₖ and recall Rₖ. For k = 1…n we get different Pₖ and Rₖ values that define
the precision-recall curve: the area under this curve is the Average Precision (AP).
Finally, by computing the average of AP values for a set of m queries, we obtain the Mean
Average Precision (MAP).
Discounted Cumulative Gain (DCG)
Discounted Cumulative Gain is used for tasks with graded relevance, i.e. when the true
score y of a document d is a discrete value in a scale measuring the relevance w.r.t. a query q.
A typical scale is 0 (bad), 1 (fair), 2 (good), 3 (excellent), 4 (perfect).
For a given query q and corresponding documents D = {d₁, …, dₙ}, we consider the the k-th
top retrieved document. The gain Gₖ = 2^yₖ – 1 measures how useful is this document (we
want documents with high relevance!), while the discount Dₖ = 1/log(k+1) penalizes
documents that are retrieved with a lower rank (we want relevant documents in the top
ranks!).
The sum of the discounted gain terms GₖDₖ for k = 1…n is the Discounted Cumulative Gain
(DCG). To make sure that this score is bound between 0 and 1, we can divide the measured
DCG by the ideal score IDCG obtained if we ranked documents by the true value yₖ. This
gives us the Normalized Discounted Cumulative Gain (NDCG), where NDCG = DCG/IDCG.
Finally, as for MAP, we usually compute the average of DCG or NDCG values for a set
of m queries to obtain a mean value.
Machine Learning Models for Learning to Rank
To build a Machine Learning model for ranking, we need to define inputs, outputs and loss
function.
Input – For a query q we have n documents D = {d₁, …, dₙ} to be ranked by
relevance. The elements xᵢ = (q, dᵢ) are the inputs to our model.
Output – For a query-document input xᵢ = (q, dᵢ), we assume there exists a
true relevance score yᵢ. Our model outputs a predicted score sᵢ = f(xᵢ).
All Learning to Rank models use a base Machine Learning model (e.g. Decision
Tree or Neural Network) to compute s = f(x). The choice of the loss function is the distinctive
element for Learning to Rank models. In general, we have 3 approaches, depending on how
the loss is computed.
1. Pointwise Methods – The total loss is computed as the sum of loss terms defined
on each document dᵢ (hence pointwise) as the distance between the predicted
score sᵢ and the ground truth yᵢ, for i=1…n. By doing this, we transform our task into
a regression problem, where we train a model to predict y.
2. Pairwise Methods – The total loss is computed as the sum of loss terms defined
on each pair of documents dᵢ, dⱼ (hence pairwise) , for i, j=1…n. The objective on
which the model is trained is to predict whether yᵢ > yⱼ or not, i.e. which of two
documents is more relevant. By doing this, we transform our task into a binary
classification problem.
3. Listwise Methods – The loss is directly computed on the whole list of documents
(hence listwise) with corresponding predicted ranks. In this way, ranking metrics can
be more directly incorporated into the loss.
A first approach is to use an iterative method where ranking metrics are used to re-
weight instances at each iteration. This is the approach used
by LambdaRank and LambdaMART, which are indeed between the pairwise and the listwise
approach.
A second approach is to approximate the objective to make it differentiable, which is the idea
behind SoftRank. Instead of predicting a deterministic score s = f(x), we predict
a smoothened probabilistic score s~ 𝒩(f(x), σ²). The ranks k are non-continuous functions
of predicted scores s, but thanks to the smoothening we can compute probability distributions
for the ranks of each document. Finally, we optimize SoftNDCG, the expected NDCG over
this rank distribution, which is a smooth function.
A third approach is consider that each ranked list corresponds to a permutation, and define
a loss over space of permutations. In ListNet, given a list of scores s we define the
probability of any permutation using the Plackett-Luce model. Then, our loss is easily
computed as the Binary Cross-Entropy distance between true and predicted probability
distributions over the space of permutations.
Finally, the LambdaLoss paper introduced a new perspective on this problem, and created
a generalized framework to define new listwise loss functions and achieve state-of-the-art
accuracy. The main idea is to frame the problem in a rigorous and general way, as a mixture
model where the ranked list π is treated as a hidden variable. Then, the loss is defined as the
negative log likelihood of such model.