Unit - 3 ML
Unit - 3 ML
Unit - 3 ML
Learning with Trees – Decision Trees – Constructing Decision Trees – Classification and Regression Trees
– Ensemble Learning – Boosting – Bagging – Different ways to Combine Classifiers – Probability and
Learning – Data into Probabilities – Basic Statistics – Gaussian Mixture Models – Nearest Neighbor
Methods – Unsupervised Learning – K means Algorithms – Vector Quantization – Self Organizing Feature
Map
There are a few different decision tree algorithms, but they are almost all variants of the same
principle: the algorithms build the tree in a greedy manner starting at the root, choosing the most
informative feature at each step. We are going to start by focusing on the most common:
Quinlan’s ID3, although we’ll also mention its extension, known as C4.5,and another known as
CART.
The important idea is to work out how much the entropy of the whole training set would
decrease if we choose each particular feature for the next classification step. This is known as the
information gain, and it is defined as the entropy of the whole set minus the entropy when a
particular feature is chosen.
As an example, suppose that we have data (with outcomes) S = {s1 = true, s2 =false, s3 = false,
s4 = false} and one feature F that can have values {f1, f2, f3}. In the example, the feature value
for s1 could be f2, for s2 it could be f2, for s3, f3 and for s4,f1 then we can calculate the entropy
of S as (where ⊕ means true, of which we have one example, and means false, of which we have
three examples):
The function Entropy(Sf ) is similar, but only computed with the subset of data wherefeature F
has values f.
If you were trying to follow those calculations on a calculator, you might be wondering how to
compute log2 p. The answer is to use the identity log2 p = ln p/ ln(2), where ln is the natural
logarithm, which your calculator can produce. NumPy has the np.log2() function.
We now want to compute the information gain of F, so we now need to compute each of the
values inside the summation in Equation (12.2), |Sf ||S| Entropy(S) (in our example, the features
are ‘Deadline’, ‘Party’, and ‘Lazy’):
CLASSIFICATION AND REGRESSION TREES (CART)
There is another well-known tree-based algorithm, CART, whose name indicates that it can be
used for both classification and regression. Classification is not wildly different in CART,
although it is usually constrained to construct binary trees.
Gini Impurity
The entropy that was used in ID3 as the information measure is not the only way to pick features.
Another possibility is something known as the Gini impurity. The ‘impurity’ in the name
suggests that the aim of the decision tree is to have each leaf node represent a set of data points
that are in the same class, so that there are no mismatches.
It is typically labelled as λij and is presented as a matrix, with element λij representing the cost of
misclassifying i as j. Using it is simple, modifying the Gini impurity
Regression in Trees
The new part about CART is its application in regression. While it might seem strange to use
trees for regression, it turns out to require only a simple modification to the algorithm. Suppose
that the outputs are continuous, so that a regression model is appropriate.
BOOSTING
The principal algorithm of boosting is named AdaBoost, In that algorithm, the training set was
split into three. A classifier was trained on the first third, and then tested on the second third. All
of the data that was misclassified during that testing was used to form anew dataset, along with
an equally sized random selection of the data that was correctly.
A second classifier was trained on this new dataset, and then both of the classi-fiers were tested
on the final third of the dataset. If they both produced the same output, then that data point was
ignored, otherwise the data point was added to yet another new classified. dataset, which formed
the training set for a third classifer. Rather than looking further atthis version, we will look at the
more common algorithm.
AdaBoost
The AdaBoost algorithm is conceptually very simple. At each iteration a new classifieris trained
on the training set, with the weights that are applied to the training set for each data point being
modified at each iteration according to how successfully that data point has been classified in the
past.
AdaBoost Algorithm
BAGGING
The simplest method of combining classifiers is known as bagging, which stands for boot-strap
aggregating, the statistical description of the method.
The name bootstrap is more popular in computer science than anywhere else, since there is also a
bootstrap loader, which is the first program to run when a computer is turned on. It comes from
the nonsensical idea of ‘picking yourself up by your bootstraps,’ which means lifting yourself up
by your shoelaces, and is meant to imply starting from nothing.
Subagging
The method of subagging wins the prize for the oddest sounding word. It is a combination of
‘subsample’ and ‘bagging,’ and it is the fairly obvious idea that you don’t need to produce
samples that are the same size as the original data. If you make smaller datasets, then it makes
sense to sample without replacement, but otherwise the implementation is only very slightly
different from the bagging one, except that in NumPy you use np.random.shuffle() to produce
the samples. It is common to use a dataset size that is half that of the original data, and the results
of this can often be comparable to a full bagging simulation.
Bagging puts most of its effort into ensuring that the different classifiers see different data,since
they see different samples of the data.
This is different than boosting, where the data stays the same, but the importance of each data
point changes for the different classifiers, since they each get different weights according to how
well the previous classifiers have performed.
Both boosting and bagging take a vote from amongs tthe classifiers, although they do it in
different ways: boosting takes a weighted vote, while bagging simple takes the majority vote.
There are other alternatives to these methods, as well.
Some classification systems will only produce an output where all the classifiers agree, or more
than half of them agree, whereas others simply take the most common output, which is what we
usually mean by majority voting.
The idea of not always producing an output is to ensure that the ensemble does not produce
outputs that are contentious, because they are probably difficult data points.
If the number of classifiers is odd and the classifiers are each independent of each other, then
majority voting will return the correct label if more than half of the classifiers agree. Assuming
that each individual classifier has a success rate of p, the probability of the ensemble getting the
correct answer is a binomial distribution of the form:
The Hierarchical Mixture of Networks network, consisting of a set of classi-fiers (experts) with
gating systems that also use the inputs to decide which classifiers to trust.
where T is the number of classifiers. If p > 0.5, then this sum approaches 1 as T → ∞.
The Mixture of Experts Algorithm
It is perfectly possible to use any other probability distribution instead of a Gaussian, but
Gaussians are by far the most common choice. Then the output for any particular data point that
is input to the algorithm will be the sum of the values expected by all of the M Gaussians:
Histograms of training data from a mixture of two Gaussians and two fitted models, shown as the
line plot . The model shown on the left fits well, but the one on the right produces two Gaussians
right on top of each other that do not fit the data well.
where φ(x; μm, Σm) is a Gaussian function with mean μm and covariance matrix Σm,
The algorithm that is used is an example of a very general one known as the expectation-
maximization (or more compactly, EM) algorithm.
The basic idea of the EM algorithm is that sometimes it is easier to add extra variables that are
not actually known (called hidden or latent variables) and then to maximize the function over
those variables.
The assumption now is that data were created by randomly choosing one of two possible
Gaussians, and then creating a sample from that Gaussian. If the probability of picking Gaussian
one is p, then the entire model looks like this (where N (μ,σ2) specifies a Gaussian distribution
with mean μ and standard deviation σ):
If f = 0 then the data came from Gaussian one, if f = 1 then it came from Gaussian two.
This is the typical initial step of an EM algorithm: adding latent variables. Now we just need to
work out how to optimise over them. This is the time when the reason for the algorithm being
called expectation-maximisation becomes clear.
where D denotes the data. Note that since we have set f = 1 this means that we are choosing
Gaussian two.
Computing the value of this expectation is known as the E-step. Then this estimate of the
expectation is maximised over the model parameters (the parameters of the two Gaussians and
the mixing parameter π), the M-step.
This requires differentiating the expectation with respect to each of the model parameters. These
two steps are simply iterated until the algorithm converges. Note that the estimate never gets any
smaller, and it turns out that EM algorithms are guaranteed to reach a local maxima.
One of the simplest decision procedures that can be used for classification is the nearest neighbour (NN)
rule. It classifies a sample based on the category of its nearest neighbour.
This method suffers from the curse of dimensionality .First, as show above, the computational
costs get higher as the number of dimensions grows. This is not as bad as it might appear at first:
there are sets of methods such as KD-Trees that compute this in O(N log N) time.
The nearest neighbours decision boundary with left: one neighbour and right:two neighbours.
For the k-nearest neighbours algorithm the bias-variance decomposition can be com-puted as:
The way to interpret this is that when k is small, so that there are few neighbours considered, the
model has flexibility and can represent the underlying model well, but that it makes mistakes
(has high variance) because there is relatively little data. As k increases, the variance decreases,
but at the cost of less flexibility and so more bias.
Both of these kernels are designed to give more weight to points that are closer to the current
input, with the weights decreasing smoothly to zero as they pass out of the range of the current
input, with the range specified by a parameter λ. They are the Epanechnikov quadratic kernel:
Large values average over more data points, and therefore produce lower variance, but at the cost
of higher bias.
The construction of the tree is O(N log2 N), with much of the computational cost being in the
computation of the me-dian, which with a naïve algorithm requires a sort and is therefore O(N
log N), or can be computed with a randomised algorithm in O(N) time.
Output of the nearest neighbour method and two kernel smoothers
Distance Measures
You can measure it on the map by just using a ruler, but it essentially consists of measuring the
distance in one direction (we’ll call it north-south) and then the distance in another direction that
is perpendicular to the first(let’s call it east-west) and then squaring them, adding them together,
and then taking the square root of that.
The distance between my house and the shop in the ‘north-south’ direction and the distance in
the ‘east-west’ direction, and then add the two distances together.
A metric function or norm takes two inputs and gives a scalar (the distance) back, which is
positive, and 0 if and only if the two points are the same, symmetric (so that the distance to the
shop is the same as the distance back), and obeys the triangle inequality, which says that the
distance from a to b plus the distance from b to c should not be less than the direct distance from
a to c.
The idea of invariant metrics is to find measures that ignore changes that you don’t want. So if
you want to be able to rotate shapes around and still recognize them, you need a metric that is
invariant to rotation.
THE K-MEANS ALGORITHM
The number of clusters found from data by the method is denoted by the letter 'K' in K-means. In
this method, data points are assigned to clusters in such a way that the sum of the squared
distances between the data points and the centroid is as small as possible.
• Initialisation
– choose a value for k
– choose k random positions in the input space
– assign the cluster centres μj
• Usage
There are certainly cases where we don’t know in advance how many clusters we will see in the
data, but the k-means algorithm doesn’t deal with this at all well.
There are lots of reasons for performing clustering, but one of the more common ones is to deal
with noisy data readings. These might be slightly corrupted, or occasionally just plain wrong.
Unfortunately, the mean average, which is central to the k-means algorithm, is very susceptible
to outliers, i.e., very noisy measurements. One way to avoid the problem is to replace the mean
average with the median, which is what is known as a robust statistic, meaning that it is not
affected by outliers
The only change that is needed to the algorithm is to replace the computation of the mean with
the computation of the median. This is computationally more expensive, as we’ve discussed
previously, but it does remove noise effectively.
The k-means algorithm clearly works, despite its problems with noise and the difficulty with
choosing the number of clusters. Interestingly, while it might seem a long way from neural
networks, it isn’t. If we think about the cluster centres that we optimise the positions of as
locations in weight space, then we could position neurons in those places and use neural network
training.
The On-Line k-Means Algorithm
• Initialisation
– choose a value for k, which corresponds to the number of output nodes
– initialise the weights to have small random values
• Learning
– normalise the data so that all the points lie on the unit sphere
– repeat:
∗ for each datapoint:
· compute the activations of all the nodes
· pick the winner as the node with the highest activation
· update the weights using Equation (14.7)
∗ until number of iterations is above a threshold
• Usage
– for each test point:
∗ compute the activations of all the nodes
∗ pick the winner as the node with the highest activation
VECTOR QUANTISATION
We’ve already discussed using competitive learning for removing noise. There is a
related application, data compression, which is used both for storing data and for the
transmission of speech and image data.
The reason that the applications are related is that both replace the current input by the
cluster centre that it belongs to.
For noise reduction we do this to replace the noisy input with a cleaner one, while for
data compression we do it to reduce the number of data points that we send.
Both of these things can be understood by considering them as examples of data com-
munication.
This application is called learning vector quantization because we are learning an
efficient vector quantisation.
THE SELF-ORGANISING FEATURE MAP
A self-organizing map (SOM) or self-organizing feature map (SOFM) is an unsupervised machine
learning technique used to produce a low-dimensional (typically two-dimensional) representation of a
higher dimensional data set while preserving the topological structure of the data. There are two novel
departures in this for us: firstly, the relative locations of the neu-rons in the network matters
This topology preservation is not necessarily possible, because the SOM typically uses a1D
or 2D array of neurons, and most of our input spaces are of much higher dimensionality than
that. This means that the ordering cannot be preserved.
The Self-Organising Map network. As usual, input nodes (on the left) do no computation,
and the weights are modified to change the activations of the neurons(weights are only
shown to two nodes for clarity). However, the nodes within the SO Maffect each other in that
the winning node also changes the weights of neurons that are close to it.
This is known as the ‘Mexican Hat’ form of lateral connections, for reasons that should be
clear from the picture. We can then just use ordinary competitive learning, j ust like we did
for the k-means network .The Self-Organising Map does pretty much exactly this.
The SOM Algorithm
In Kohonen’s SOM algorithm, the weight update rule is modified instead, so that informa-tion
about neighbouring neurons is included in the learning rule, which makes the algorithm simpler.
The algorithm is a competitive learning algorithm, so that one neuron is chosen ast he winner,
but when its weights are updated, so are those of its neighbours, although to a lesser extent.
Neurons that are not within the neighbourhood are ignored, not repelled.
We will now look at the SOM algorithm before examining some of the details further.
Learning
– repeat:
∗ for each datapoint:
· select the best-matching neuron nb using the minimum Euclidean dis-tance between the
weights and the input,
where ηn(t) is the learning rate for neighbourhood nodes, and h(nb, t) is the neighbourhood
function, which decides whether each neuron should be in-cluded in the neighbourhood of the
winning neuron (so h = 1 for neighbours and h = 0 for non-neighbours)
∗ reduce the learning rates and adjust the neighbourhood function, typically
by η(t+1) = αη(t)k/kmax where 0 ≤ α ≤ 1 decides how fast the size decreases, k is the number
of iterations the algorithm has been running for, and k maxis when you want the learning to stop.
The same equation is used for both learning rates (η, ηn) and the neighbor hood function h(nb, t).
– until the map stops changing or some maximum number of iterations is exceeded
Usage
– for each test point:
∗ select the best-matching neuron nb using the minimum Euclidean distance
between the weights and the input: