Unit 3 - Machine Learning - WWW - Rgpvnotes.in
Unit 3 - Machine Learning - WWW - Rgpvnotes.in
Unit 3 - Machine Learning - WWW - Rgpvnotes.in
The revenue we generate from the ads we show on our website and app
funds our services. The generated revenue helps us prepare new notes
and improve the quality of existing study materials, which are
available on our website and mobile app.
If you don't use our website and app directly, it will hurt our revenue,
and we might not be able to run the services and have to close them.
So, it is a humble request for all to stop sharing the study material we
provide on various apps. Please share the website's URL instead.
Downloaded from www.rgpvnotes.in, whatsapp: 8989595022
Subject Notes
IT 802 (A) - Machine Learning
B.Tech IT-8th Semester
Unit III
Course Objective: To familiarize students with the knowledge of machine learning and enable
them to apply suitable machine learning techniques for data handling and to gain knowledge
from it. Evaluate the performance of algorithms and to provide solution for various real-world
applications.
_____________________________________________________________________________
Course Outcome (CO3): Identify and integrate more than one technique to enhance the
performance of learning.
ENSEMBLE LEARNING
Ensemble learning is the process by which multiple models, such as classifiers or
experts, are strategically generated and combined to solve a particular computational
intelligence problem. Ensemble learning is primarily used to improve the (classification,
prediction, function approximation, etc.)
Ensemble methods are techniques that create multiple models and then combine them
to produce improved results. Ensemble methods usually produce more accurate
solutions than a single model would. This has been the case in a number of machine
learning competitions, where the winning solutions used ensemble methods.
Model is the output of the algorithm that trained with data. This model is then used for
making predictions. This algorithm can be any machine learning algorithm such as
logistic regression, decision tree, etc. These models, when used as inputs of ensemble
methods, are called “base models”.
There are different ways the multiple base-learners are combined to generate the final
output:
o In the global approach, also called learner fusion, given an input, all base-
learners generate an output, and all these outputs are used. Examples are
voting and stacking.
• Multistage combination methods use a serial approach where the next base-
learner is trained with or tested on only the instances where the previous base-
learners are not accurate enough. The idea is that the base-learners (or the
different representations they use) are sorted in increasing complexity so that a
complex base-learner is not used (or its complex representation is not extracted)
unless the preceding simpler base-learners are not confident. An example is
cascading.
Figure 3.2: Base-learners are dj and their outputs are combined using f ()
Let us say that we have L base-learners. We denote by dj (x) the prediction of base-
learner Mj given the arbitrary dimensional input x. In the case of multiple
representations, each Mj uses a different input representation xj . The final prediction is
calculated from the predictions of the base-learners:
y = f (d1, d2,...,dL|Φ)
where f () is the combining function with Φ denoting its parameters. When there are K
outputs, for each learner there are dj i(x), i = 1,...,K, j = 1,...,L, and, combining them, we
also generate K values, yi, i = 1,...,K and then for example in classification, we choose the
class with the maximum yi value:
VOTING
The simplest way to combine multiple classifiers is by voting, which corresponds to
taking a linear combination of the learners (see figure 3.2):
This is also known as ensembles and linear opinion pools. In the simplest case, all
learners are given equal weight and we have simple voting that corresponds to taking
an average. Still, taking a (weighted) sum is only one of the possibilities and there are
also other combination rules, as shown in table 3.1. If the outputs are not posterior
probabilities, these rules require that outputs be normalized to the same scale.
An example of the use of these rules is shown in table 3.2, which demonstrates the
effects of different rules. Sum rule is the most intuitive and is the most widely used in
practice. Median rule is more robust to outliers; minimum and maximum rules are
pessimistic and optimistic, respectively.
Table 3.2: Example of combination rules on three learners and three classes
With the product rule, each learner has veto power; regardless of the other ones, if one
learner has an output of 0, the overall output goes to 0. After the combination rules, yi
do not necessarily sum up to 1.
In weighted sum, dj i is the vote of learner j for class Ci and wj is the weight of its vote.
Simple voting is a special case where all voters have equal weight, namely, wj = 1/L. In
classification, this is called plurality voting where the class having the maximum
number of votes is the winner. When there are two classes, this is majority voting where
the winning class gets more than half of the votes. If the voters can also supply the
additional information of how much they vote for each class (e.g., by the posterior
probability), then after normalization, these can be used as weights in a weighted voting
scheme. Equivalently, if dj i are the class posterior probabilities, P (Ci|x,Mj ), then we can
just sum them up (wj = 1/L) and choose the class with maximum yi.
In the case of regression, simple or weighted averaging or median can be used to fuse
the outputs of base-regressors. Median is more robust to noise than the average.
Another possible way to find wj is to assess the accuracies of the learners (regressor or
classifier) on a separate validation set and use that information to compute the weights,
so that we give more weights to more accurate learners. These weights can also be
learned from data.
We cannot integrate over all models; we only choose a subset for which we believe P
(Mj ) is high, or we can have another Bayesian step and calculate P (Mj |X), the
probability of a model given the sample, and sample high probable models from this
density.
Let us assume that dj are iid with expected value E[dj ] and variance Var(dj ), then when
we take a simple average with wj = 1/L, the expected value and variance of the output
are
The expected value does not change, so the bias does not change. But variance, and
therefore mean square error, decreases as the number of independent voters, L,
increase. In the general case,
which implies that if learners are positively correlated, variance (and error) increase.
We can thus view using different algorithms and input features as efforts to decrease, if
not eliminate, the positive correlation.
Further decrease in variance is possible if the voters are not independent but negatively
correlated. The error then decreases if the accompanying increase in bias is not higher
because these aims are contradictory; we cannot have a number of classifiers that are
all accurate and negatively correlated. In mixture of experts for example, where learners
are localized, the experts are negatively correlated but biased.
Base-learners are binary classifiers having output −1/ + 1, and there is a code matrix W
of K × L whose K rows are the binary codes of classes in terms of the L base-learners dj .
For example, if the second row of W is [−1, +1, +1, −1], this means that for us to say an
instance belongs to C2, the instance should be on the negative side of d1 and d4, and on
the positive side of d2 and d3. Similarly, the columns of the code matrix define the task
of the base-learners. For example, if the third column is [−1, +1, +1]T , we understand
that the task of the third base-learner, d3, is to separate the instances of C1 from the
instances of C2 and C3 combined. This is how we form the training set of the base-
learners. For example in this case, all instances labeled with C2 and C3 form X+3 and
instances labeled with C1 form X−3 , and d3 is trained so that xt ∈ X+3 give output +1
and xt ∈ X−3 give output −1.
The code matrix thus allows us to define a polychotomy (K > 2 classification problem) in
terms of dichotomies (K = 2 classification problem), and it is a method that is applicable
using any learning algorithm to implement the base-learners—for example, linear or
multilayer perceptrons (with a single output), decision trees, or SVMs whose original
definition is for two-class problems.
The typical one discriminant per class setting corresponds to the diagonal code matrix
where L = K. For example, for K = 4, we have
The problem here is that if there is an error with one of the base learners, there may be
a misclassification because the class code words are so similar. So the approach in
error-correcting codes is to have L>K and increase the Hamming distance between the
code words. One possibility is pairwise separation of classes where there is a separate
base learner to separate Ci from Cj , for i<j (section 10.4). In this case, L = K(K − 1)/2
and with K = 4, the code matrix is
where a 0 entry denotes “don’t care.” That is, d1 is trained to separate C1 from C2 and
does not use the training instances belonging to the other classes. Similarly, we say that
an instance belongs to C2 if d1 = −1 and d4 = d5 = +1, and we do not consider the values
of d2, d3, and d6. The problem here is that L is O(K2), and for large K pairwise
separation may not be feasible.
If we can have L high, we can just randomly generate the code matrix with −1/ + 1 and
this will work fine, but if we want to keep L low, we need to optimize W. The approach is
to set L beforehand and then find W such that the distances between rows, and at the
same time the distances between columns, are as large as possible, in terms of Hamming
distance. With K classes, there are 2(K−1) − 1 possible columns, namely, two-class
problems. This is because K bits can be written in 2K different ways and complements
(e.g., “0101” and “1010,” from our point of view, define the same discriminant) dividing
the possible combinations by 2 and then subtracting 1 because a column of all 0s (or 1s)
is useless. For example, when K = 4, we have
When K is large, for a given value of L, we look for L columns out of the 2(K−1)−1. We
would like these columns of W to be as different as possible so that the tasks to be
learned by the base-learners are as different from each other as possible. At the same
time, we would like the rows of W to be as different as possible so that we can have
maximum error correction in case one or more base learners fail. ECOC can be written
as a voting scheme where the entries of W, wij, are considered as vote weights:
and then we choose the class with the highest yi. Taking a weighted sum and then
choosing the maximum instead of checking for an exact match allows dj to no longer
need to be binary but to take a value between −1 and +1, carrying soft certainties
instead of hard decisions. Note that a value pj between 0 and 1, for example, a posterior
probability, can be converted to a value dj between −1 and +1 simply as dj = 2pj – 1
BAGGING
Bagging is a voting method whereby base learners are made different by training them
over slightly different training sets. Bagging, that often considers homogeneous weak
learners, learns them independently from each other in parallel and combines them
following deterministic averaging process.
In parallel methods we fit the different considered learners independently from each
other’s and, so it is possible to train them concurrently. The most famous such approach
is “bagging” (standing for “bootstrap aggregating”) that aims at producing an ensemble
model that is more robust than the individual models composing it.
underlying distribution), the fitted model is also subject to variability: if another dataset
had been observed, we would have obtained a different model.
The idea of bagging is then simple: we want to fit several independent models and
“average” their predictions in order to obtain a model with a lower variance. However,
we can’t, in practice, fit fully independent models because it would require too much
data. So, we rely on the good “approximate properties” of bootstrap samples
(representativity and independence) to fit models that are almost independent.
First, we create multiple bootstrap samples so that each new bootstrap sample will act
as another (almost) independent dataset drawn from true distribution. Then, we can fit
a weak learner for each of these samples and finally aggregate them such that we kind
of “average” their outputs and, so, obtain an ensemble model with less variance that its
components. Roughly speaking, as the bootstrap samples are approximatively
independent and identically distributed (i.i.d.), so are the learned base models. Then,
“averaging” weak learners outputs do not change the expected answer but reduce its
variance (just like averaging i.i.d. random variables preserve expected value but reduce
variance).
and then aggregate them into some kind of averaging process in order to get an
ensemble model with a lower variance. For example, we can define our strong model
such that.
and then aggregate them into some kind of averaging process in order to get an
ensemble model with a lower variance. For example, we can define our strong model
such that
There are several possible ways to aggregate the multiple models fitted in parallel. For a
regression problem, the outputs of individual models can literally be averaged to obtain
the output of the ensemble model. For classification problem the class outputted by
each model can be seen as a vote and the class that receives the majority of the votes is
returned by the ensemble model (this is called hard voting). Still for a classification
problem, we can also consider the probabilities of each class returned by all the models,
average these probabilities and keep the class with the highest average probability (this
is called soft-voting). Averages or votes can either be simple or weighted if any relevant
weights can be used.
Finally, we can mention that one of the big advantages of bagging is that it can be
parallelized. As the different models are fitted independently from each other’s,
intensive parallelization techniques can be used if required.
Figure 3.3: Bagging consists in fitting several base models on different bootstrap
samples
Sampling over features has indeed the effect that all trees do not look at the exact same
information to make their decisions and, so, it reduces the correlation between the
different returned outputs. Another advantage of sampling over the features is that it
makes the decision-making process more robust to missing data: observations (from
the training dataset or not) with missing data can still be regressed or classified based
on the trees that take into account only features where data are not missing. Thus,
random forest algorithm combines the concepts of bagging and random feature
subspace selection to create more robust models.
BOOSTING
Boosting methods work in the same spirit as bagging methods: we build a family of
models that are aggregated to obtain a strong learner that performs better. However,
unlike bagging that mainly aims at reducing variance, boosting is a technique that
consists in fitting sequentially multiple weak learners in a very adaptative way: each
model in the sequence is fitted giving more importance to observations in the dataset
that were badly handled by the previous models in the sequence. Intuitively, each new
model focuses its efforts on the most difficult observations to fit up to now, so that we
obtain, at the end of the process, a strong learner with lower bias (even if we can notice
that boosting can also have the effect of reducing variance). Boosting, like bagging, can
be used for regression as well as for classification problems.
Being mainly focused at reducing bias, the base models that are often considered for
boosting are models with low variance but high bias. For example, if we want to use
trees as our base models, we will choose most of the time shallow decision trees with
only a few depths. Another important reason that motivates the use of low variance but
high bias models as weak learners for boosting is that these models are in general less
computationally expensive to fit (few degrees of freedom when parametrized). Indeed,
as computations to fit the different models can’t be done in parallel (unlike bagging), it
could become too expensive to fit sequentially several complex models.
Once the weak learners have been chosen, we still need to define how they will be
sequentially fitted (what information from previous models do we consider when fitting
current model?) and how they will be aggregated (how do we aggregate the current
model to the previous ones?). We will discuss these questions in the two following
subsections, describing more especially two important boosting algorithms: adaboost
and gradient boosting.
In a nutshell, these two meta-algorithms differ on how they create and aggregate the
weak learners during the sequential process. Adaptive boosting updates the weights
attached to each of the training dataset observations whereas gradient boosting updates
the value of these observations. This main difference comes from the way both methods
try to solve the optimization problem of finding the best model that can be written as a
weighted sum of weak learners.
ADABOOST
In adaptative boosting (often called “adaboost”), we try to define our ensemble model as
a weighted sum of L weak learners
Finding the best ensemble model with this form is a difficult optimization problem.
Then, instead of trying to solve it in one single shot (finding all the coefficients and weak
learners that give the best overall additive model), we make use of an iterative
optimization process that is much more tractable, even if it can lead to a sub-optimal
solution. More especially, we add the weak learners one by one, looking at each iteration
for the best possible pair (coefficient, weak learner) to add to the current ensemble
model. In other words, we define recurrently the (s_l)’s such that
where c_l and w_l are chosen such that s_l is the model that fit the best the training data
and, so, that is the best possible improvement over s_(l-1). We can then denote
where E(.) is the fitting error of the given model and e(.,.) is the loss/error function.
Thus, instead of optimizing “globally” over all the L models in the sum, we approximate
the optimum by optimizing “locally” building and adding the weak learners to the
strong model one by one.
More especially, when considering a binary classification, we can show that the
adaboost algorithm can be re-written into a process that proceeds as follow. First, it
updates the observations weights in the dataset and train a new weak learner with a
special focus given to the observations misclassified by the current ensemble model.
Second, it adds the weak learner to the weighted sum according to an update coefficient
that expresses the performances of this weak model: the better a weak learner
performs, the more it contributes to the strong learner.
So, assume that we are facing a binary classification problem, with N observations in
our dataset and we want to use adaboost algorithm with a given family of weak models.
At the very beginning of the algorithm (first model of the sequence), all the observations
have the same weights 1/N. Then, we repeat L times (for the L learners in the sequence)
the following steps:
• fit the best possible weak model with the current observations weights
• compute the value of the update coefficient that is some kind of scalar evaluation
metric of the weak learner that indicates how much this weak learner should be
taken into account into the ensemble model
• update the strong learner by adding the new weak learner multiplied by its
update coefficient
• compute new observations weights that expresses which observations we would
like to focus on at the next iteration (weights of observations wrongly predicted
by the aggregated model increase and weights of the correctly predicted
observations decrease)
Repeating these steps, we have then built sequentially our L models and aggregate them
into a simple linear combination weighted by coefficients expressing the performance of
each learner. Notice that there exist variants of the initial adaboost algorithm such that
LogitBoost (classification) or L2Boost (regression) that mainly differ by their choice of
loss function.
STACKING
Stacking mainly differ from bagging and boosting on two points. First stacking often
considers heterogeneous weak learners (different learning algorithms are combined)
whereas bagging and boosting consider mainly homogeneous weak learners. Second,
stacking learns to combine the base models using a meta-model whereas bagging and
boosting combine weak learners following deterministic algorithms.
The idea of stacking is to learn several different weak learners and combine them by
training a meta-model to output predictions based on the multiple predictions returned
by these weak models. So, we need to define two things in order to build our stacking
model: the L learners we want to fit and the meta-model that combines them.
for example, for a classification problem, we can choose as weak learners a KNN
classifier, a logistic regression and a SVM, and decide to learn a neural network as meta-
model. Then, the neural network will take as inputs the outputs of our three weak
learners and will learn to return final predictions based on it.
So, assume that we want to fit a stacking ensemble composed of L weak learners. Then
we have to follow the steps thereafter:
the meta-model. Thus, an obvious drawback of this split of our dataset in two parts is
that we only have half of the data to train the base models and half of the data to train
the meta-model. In order to overcome this limitation, we can however follow some kind
of “k-fold cross-training” approach (similar to what is done in k-fold cross-validation)
such that all the observations can be used to train the meta-model: for any observation,
the prediction of the weak learners are done with instances of these weak learners
trained on the k-1 folds that do not contain the considered observation. In other words,
it consists in training on k-1 fold in order to make predictions on the remaining fold and
that iteratively so that to obtain predictions for observations in any folds. Doing so, we
can produce relevant predictions for each observation of our dataset and then train our
meta-model on all these predictions.
Multi-levels Stacking
For each meta-model of the different levels of a multi-levels stacking ensemble model,
we have to choose a learning algorithm that can be almost whatever we want (even
algorithms already used at lower levels). We can also mention that adding levels can
either be data expensive (if k-folds like technique is not used and then, more data are
needed) or time expensive (if k-folds like technique is used and, then, lot of models need
to be fitted).