ML - Mid2

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 24

1a) LINEAR REGRESSION WITH LEAST SQUARE ERROR CRITERION

Linear regressor functions have a variety of good analytical properties. Linear regressor
functions are comparatively simple to compute, and if there is no information to suggest
otherwise, linear regressors are ideal candidates for early, trial regression.
The problem of finding a linear regressor function will be formulated as a problem of
minimizing a criterion function. The widely-used criterion function for regression purposes is
the sum-of-error-squares.
In general, regression methods are used to predict the value of response (dependent) variable
from attribute (independent) variables, where the variables have continuous numeric values.
Linear regressor model fits a linear function (relationship) between dependent (output) variable
and independent (input) variables—it expresses the output ˆyy^ as a linear combination of the
attributes x1, x2, …, xn; given the N data points
(x(i)j,y(i));i=1,…,N;j=1,…,n.(xj(i),y(i)); i=1,…,N; j=1,…,n. The linear regressor model, thus, has the
form
ˆy(i)=w0+w1x(i)1+⋯+wnx(i)n;i=1,…,Ny^(i)=w0+w1x1(i)+⋯+wnxn(i); i=1,…,N
where {w0, w1, …, wn} are the parameters of the model.
Of interest is the difference between the predicted value ˆyy^ and the actual value y. The
method of linear regression is to choose the (n + 1) coefficients w0, w1, …, wn, to minimize the
residual sum of squares of these differences over all the N training instances. The performance
criterion is thus the sum-of-error-squares:
Residual error
e(i)=y(i)−ˆy(i)=y(i)−n∑j=0wjx(i)j;x(i)0=1e(i)=y(i)−y^(i)=y(i)−∑j=0nwjxj(i); x0(i)=1
Residual sum-of-error-squares
E=N∑i=1(e(i))2=N∑i=1(y(i)−n∑j=0wjx(i)j)2;x(i)0=1E=∑i=1N (e(i))2=∑i=1N (y(i)−∑j=0nwjxj(i)) 2;
x0(i)=1
Least Mean Square (LMS) Algorithm
Till now we have discussed about deterministic algorithms to determine the optimum setting of
the weights ¯¯¯ww¯ that minimizes the criterion function given by Eqn (3.75). Now we explore a
digression from this error function. Consider the problem of computing weights
vector ¯¯¯ww¯ so as to minimize Mean Square Error (MSE) between desired and true outputs,
defined as follows:
E(¯¯¯w)=E[N∑i=1(y(i)−¯¯¯wT¯¯x(i))2]E (w¯)=E [∑i=1N (y(i)−w¯Tx¯(i))2]
where EE is the statistical expectation operator.
The solution to this problem requires the computation of autocorrelation matrix EE [xxT] of the
set of feature vectors, and cross-correlation matrix EE[xy] between the desired response and
the feature vector.
The Least Mean Square (LMS) algorithm, originally formulated by Widrow and Hoff, is
a stochastic gradient algorithm that iterates weight vector ¯¯¯ww¯ in the regressor in the
direction of the gradient of the squared amplitude of error signal with respect to that weight
vector. LMS is called a stochastic gradient algorithm because the gradient vector is chosen at
'random' and not, as in the steepest descent case—precisely derived from the shape of the
total error surface. Random here means the instantaneous value of the gradient. This is then
used as the estimator of the true quantity.
In practice, this simply means that LMS uses at each iteration step the actual value of the error
after single data pair is presented, not the full sum-of-error-squares function which requires
presentation of the entire set of training patterns for calculation of the gradient. Thus, LMS is
used in an online mode, described earlier in this section, where the weights are updated after
each data pair is presented.
The design of the LMS algorithm is very simple, yet a detailed analysis of its convergence
behavior is a challenging mathematical task. It turns out that under mild conditions, the
solution provided by the LMS algorithm converges in probability to the solution of MSE
optimization problem.

1 b) MINIMUM DESCRIPTION LENGTH PRINCIPLE


Another framework to discuss hypothesis complexity versus prediction accuracy issue is offered
by the Minimum Description Length (MDL) principle. Though conceptually quite different, this
principle results in a formalism which is identical to the Bayesian one [52].Suppose there is a
'sender' who desires to transmit output values (y) corresponding to the learning cases (training
data DD) to a 'receiver' who is already aware of the input values (x), with the use of a message
of the shortest possible length (wherein length of the message is measured by the number of
bits, for instance). There are many options for this transmission. One approach would be simply
to transmit a suitable encoded form of the output values of each case in DD using some fixed
coding scheme. In another approach, an accurate model able to recompute the output values
from the inputs (the model is expected to be a complex one that makes no errors) is
transmitted. Still another approach is to send an approximate (less complex) model together
with error corrections. The MDL principle states that the best model for the learning sample is
the one which minimizes the length of the message which has to be sent to the receiver:
K(h,D)description length=K(h)Complexity+K(Dusingh)errorK(h,D)
︸description length=K(h)︸Complexity+K(D using h)︸error

when K(h) is the minimal number of bits necessary to describe the model to the receiver,
and K(DD using h) is the minimal number of bits necessary to describe the errors made by the
model on the learning sample. Intuitively, the lower is the empirical risk of the model (achieved
with increasing model complexity), the lower will be the length K(DD using h); in the limit if the
model perfectly fits the data DD, this term vanishes. On the other hand, the number of bits
used to code a model is certainly an increasing function of the number of parameters used to
represent the model and hence of its complexity. Thus the two terms making up the description
length are essentially representing a trade-off between the empirical risk and the model
complexity.
The two terms in Eqn (3.101) defining K(h, DD) are in theory, defined as the Kolmogorov
complexities. These quantities are essentially non-computable and hence various
approximations have been proposed in the literature leading to various practical versions of
MDL principle.
Bayesian Perspective
The shortest description length is expected when the hypothesis h results in a precise
representation of the statistical process which generated the data, and it is also expected, on an
average, this hypothesis will have the ideal generalization properties.
For the discrete hypothesis space and data, we will write P(h) to express the prior probability
that hypothesis h holds; it may be based on the background knowledge we possess about the
probability that h is the right hypothesis. In the same way, we will write P(DD|h) to imply the
probability of observing data DD given a world wherein hypothesis h holds; and P(DD) to
express the prior probability that training data DD will be observed with no knowledge about
which hypothesis holds. We are interested in the posterior probability of h, P(h|DD),
that h holds given the observed data DD. Bayes formula states
P(h|D)=P(h)P(D|h)P(D)P(h | D)=P(h) P(D | h)P(D)
(3.102)
The optimal hypothesis h* is the one yielding highest posterior probability, that is,
h∗=argmaxh[P(h)P(D|h)]h*=arg maxh [P(h) P(D | h)]
(3.103)
When h∗h* in Eqn (3.103) is equivalently expressed in terms of log2, we have
h∗=argmaxh[log2P(h)+log2P(D|h)]h*=arg maxh [log2 P(h)+log2 P(D | h)]
(3.104a)
or alternatively, minimizing the negative of this quantity:
h∗=argmaxh[−log2P(h)−log2P(D|h)]h*=arg maxh [− log2 P(h)−log2 P(D | h)]
(3.104b)
Shannor and Weaver (1949) showed that the optimal code (i.e., the code that minimizes the
expected message length) assigns (–log2 P(x)) bits to encode message x. Equation (3.104b) can
thus be interpreted in terms of MDL principle; linking minimum description length to Bayesian
approach.
Probability distributions are, however, mostly unknown in real learning problems.
2 a) PERCEPTRON ALGORITHM
Let us assume that we have a set of N samples x(1), x(2), …, x(N); some labeled Class 1 and
some labeled Class 2:
D:{(x(1),y(1)),…,(x(N),y(N))}D:{(x(1),y(1)),…,(x(N),y(N))}

with x(i) ∊ ℜn, and y(i) ∊ {+1, –1}.


(4.10)

We want to use these samples to determine the weights w and w0 in a linear discriminant
function
g(x)=wTx+w0g(x)=wTx+w0
(4.11)
Let us say there is a reason behind the belief that there is a solution for which the likelihood of
error is quite low. This leads to the desire to seek a weight vector that correctly classifies all the
samples. If there does exist such a weight vector, the samples are said to be linearly
separable (Fig. 4.5a).
Figure 4.5

Mostly, the classes are overlapped and the genuine separability is given by nonlinear decision
boundaries (Fig. 4.5b). The samples in these cases are said to linearly inseparable.
Consider now the problem of constructing a criterion function for solving the linear
classification problem to determine the weights w and w0. The criterion function, which is the
most obvious for the purpose of classification, is the number of samples misclassified by the
weight vector. But since this function is stepwise constant, it is naturally a weak candidate for
gradient search. The Perceptron Algorithm seems to be a better alternative for this criterion
function.
In the following, we describe perceptron training algorithm and its limitations for classification
problems.
A perceptron takes a vector of real-valued inputs xj; j = 1, …, n, calculates a linear combination
of these inputs (n∑j=1wjxj=wTx)(∑j=1nwjxj=wTx), then outputs a +1 if the result is greater than
the threshold (–w0) and –1 if the result is less than the threshold (Fig. 4.3). The perceptron
algorithm tests the decision function g(x) on each element in the training set, and if the test
fails, it adjusts the free parameters w and w0 incrementally. This process continues until all the
elements of the training set are perfectly classified.
At the heart of the algorithm are two update rules (Here y(i) is the target output for the current
training sample x(i)(i = 1, …, N), and ˆy(i)y^(i) is the output generated by the perceptron):
w←w+Δw=w+ηy(i)x(i)w←w+Δw=w+ηy(i)x(i)
(4.12a)
w0←w0+Δw0=w0+ηy(i)R2w0←w0+Δw0=w0+ηy(i)R2
(4.12b)
The quantity R is called the radius of the training data and can be considered the radius of the

dataset. In our case where the data universe is ℜn, this is simply the position vector length of
hyper sphere cantered at the origin of our coordinate system that encloses all the points of the

the training set point located farthest from the origin [53]:
R←max1≤i≤N||x(i)||R ← max1 ≤ i ≤ N||x(i)||
(4.13)
where || · || stands for the Euclidean norm.
The quantity η is a positive scale factor (0 < η ≤ 1) that sets the step size. Called the learning
rate, it controls the convergence speed of the search heuristic. Note that if η is too small,
convergence is needlessly slow, whereas if η is too large, the correction process will overshoot,
and can even diverge. So choice of η is crucial.
The intuition behind the update rules is that if ˆy(i)=y(i)y^(i)=y(i) (the point is correctly
classified), there is no weights update (Δw = 0, Δw0 = 0). In case of a misclassified point
(ˆy(i)≠y(i)y^(i)≠y(i)), the rules attempt to correct the position of the decision surface in such a
way that the point is no longer misclassified.
One may come across different expressions of the weight changes Δw, Δw0 in the literature
compared to the ones given in (4.12). However, the conceptual framework of the weight
updates remains the same. For example, the following update rules are probably most
commonly employed:
w←w+Δw=w+η(y(i)−ˆy(i))x(i)w0←w0+Δw0=w0+η(y(i)−ˆy(i))w←w+Δw=w+η (y(i)
−y^(i))x(i)w0←w0+Δw0=w0+η (y(i)−y^(i))
(4.14)
Here, we describe the conceptual framework with respect to the rules given in (4.12).
Consider a training dataset point (x(i), y(i)) with y(i) = +1. If this point is correctly classified by
the decision surface (ˆy(i)y^(i) = +1), Δw and Δw0 are zero and no weights are updated.
Suppose the perceptron outputs a –1, when the target output is +1. The update
rule (4.12a) attempts to correct this misclassification by rotating the decision surface in the
direction of x(i). The rotation is accomplished by adding a scaled version of x(i) to the normal
vector w (refer to Fig. 4.6). An analogous computation can be performed for a misclassified
point with a target value of –1. In this case, the adjustment term will be subtracted from the
normal vector, causing the rotation in the opposite direction.
Figure 4.6 The point x(i) is no more misclassified after the rotation of hyperplane

The second update rule (4.12b) attempts to correct a misclassification by translating the
decision surface. We may rewrite the rule as,
−w0←−w0−ηy(i)R2−w0←−w0−η y(i)R2
or
b←b−ηy(i)R2b←b−η y(i)R2
(4.15)
For a misclassified point with y(i) = +1, b is reduced and we need to translate the decision
surface in the direction opposite to the normal vector (refer to Fig. 4.7).
Figure 4.7 Translation of hyperplane

The overall effect of the two update rules is demonstrated in Fig. 4.8. A square point is
misclassified at time step t. Decision surface is rotated and translated at time step t+1. The
square point is now correctly classified but overcompensation results in misclassification of a
circle point. This misclassification in turn forces the perceptron algorithm to apply the update
rules in the opposite direction during the next iteration, leading to a decision surface at time
step t+2. This process continues. If one episode is completed (all data used), another episode
starts from the first sample. The algorithm terminates when all the points are classified
correctly.
Figure 4.8 Rotation and translation of decision surface

The solution is nonunique because there are more than one hyperplanes separating two
linearly separable classes (refer to Fig. 4.9). The decision surface search stops as soon as some
surface is found that separates the training set. This can lead to decision surfaces that are
positioned close to training set points. Considering that the training dataset is only an
approximate representation of the rest of the data universe, such solutions can lead to
misclassifications of unseen points.
Attractiveness of the perceptron algorithm lies in its simplicity. There is, however, a major
problem associated with this algorithm for real-world solutions: datasets are almost
certainly not linearly separable, while the algorithm finds a separating hyperplane only for
linearly separable data. When the dataset is linearly inseparable, the test of the decision
surface will always fail for some subset of training points, regardless of the adjustments we
make to the free parameters, and the algorithm will loop forever. There may, however, be
situations when linear separating hyperplane can be a good solution even when data are
overlapped; an upper bound needs to be imposed on the number of iterations when this
method is applied in practice.
Figure 4.9 Nonunique solution

2b) LINEAR NEURON AND THE WIDROW-HOFF LEARNING RULE


The perceptron ('softer' version; sigmoid unit) has been a fundamental building block in the
present-day neural models. Another important building block has been ADALINE
(ADAptiveLINear Element), developed by Bernard Widrow and Marcian Hoff in 1959. It
originated from the field of signal processing, or more specifically, from the adaptive noise
cancellation problem. It resulted from the solution of regression problem; the regressor having
the properties of noise canceller (linear filter). All its power in linear domain is still in full
service, and despite being a simple neuron, it is present (without a thresholding device) in
almost all the neural models for regression functions. The words ADALINE and linear neuron are
both used here for a neural processing unit with a linear activation function shown in Fig. 5.12a.
The neuron labeled with summation sign only (Fig. 5.12b) is equivalent to linear neuron of Fig.
5.12a.
Figure 5.12 Neural processing units with a linear activation function
Note that this 'sum of error squares' criterion function is deterministic and the gradient
descent scheme gives a deterministic algorithm for minimization of this function. Now we
explore a digress from this criterion function. Consider the problem of computing weights
(w, w0) so as to minimize Mean Square Error (MSE) between desired and true outputs, defined
as follows.
E(w,w0)=E[12∑i=1N (y(i)−(wTx(i)+w0))2]
(5.27)
where E is the statistical expectation operator.
The solution to this problem requires the computation of autocorrelation matrix E[x xT] of the
set of feature vectors, and cross-correlation matrix E[xy] between the desired response and the
feature vector. This presupposes knowledge of the underlying distributions, which, in general, is
not known. Thus, our major goal becomes to see if it is possible to solve this optimization
problem without having this statistical information.
The Least Mean Square (LMS) algorithm, originally formulated by Widrow and Hoff, is
a stochastic gradient algorithm that iterates weights (w, w0) in the regressor after each
presentation of data sample, unlike the standard gradient descent that iterates weights after
presentation of the whole training dataset. That is, the kth iteration in standard gradient
descent means the kth epoch, or the kth presentation of the whole training dataset,
while kth iteration in stochastic gradient descent means the presentation of kth single training

the gradient needed for this, is pattern-based, not epoch-based (Δw¯ = –η ∇E(w¯); Eqn (5.23b)).
data pair (drawn in sequence or randomly). Thus, the calculation of the weight change Δw¯ or

LMS is called a stochastic gradient algorithm because the gradient vector is chosen at 'random'
and not, as in steepest descent case, precisely derived from the shape of the total error surface.
Random means here the instantaneous value of the gradient. This is then used as the estimator
of the true quantity.
The design of the LMS algorithm is very simple, yet a detailed analysis of its convergence
behavior is a challenging mathematical task. It turns out that under mild conditions, the
solution provided by the LMS algorithm converges in probability to the solution of the sum-of-
error-squares optimization problem.
Stochastic Gradient Descent
While the standard gradient descent training rule of Eqn (5.26) calculates weight updates after
summing errors over all the training examples in the given dataset D; the concept behind
stochastic gradient descent is to approximate this gradient descent search by updating
weights incrementally, following the calculation of the error for each individual example. This
modified training rule is like the training rule given by Eqns (5.26) except that as we iterate
through each training example, we update the weights according to the gradient with respect
to the distinct error function,
E(k)=12[y(i)−y^(k)]2=12[e(k)]2
(5.28a)
y^(k)=∑j=1n wj(k)xj(i)+w0(k)
(5.28b)
where k is the iteration index. Note that the input components xj(i) and the desired
output y(i) are not functions of the iteration index. Training pairs (x(i), y(i)), drawn in sequence
or randomly, are presented to the network at each iteration. The gradients with respect to
weights and bias are computed as follows:
∂E(k)∂wj(k)=e(k)∂e(k)∂wj(k)=−e(k)∂y^(k)∂wj(k)=−e(k)xj(i)∂E(k)∂w0(k)=−e(k)
The stochastic gradient descent algorithm becomes,
wj(k+1)=wj(k)+η e(k)xj(i)
(5.29a)
w0(k+1)=w0(k)+η e(k)
(5.29b)
In terms of vectors, this algorithm may be expressed as,
w(k+1)=w(k)+η e(k) x(i)
(5.30a)
w0(k+1)=w0(k)+η e(k)
(5.30b)
Stochastic gradient training algorithm iterates over the training examples i = 1, 2, …, N (drawn in
sequence or randomly); at each iteration, altering weights as per the above equations. The
sequence of these weight updates, iterated over all the training examples, gives rise to
reasonable approximation to the gradient with respect to the entire set of training data.
3a)LARGE MARGIN CLASSIFIER FOR LINEARLY SEPARABLE DATA
Let the set of training (data) examples DD be
D={x(1),y(1)),(x(2),y(2)),…,(x(N),y(N))}D={x(1),y(1)),(x(2),y(2)),…,(x(N),y(N))}
(4.16)

example in a real-valued space X ⊆ℜn; y is its class label (output value), and y ∊ {+1, –1}. +1
where x = [x1 x2 … xn]T is an n-dimensional input vector (pattern with n-features) for the ith

denotes Class 1 and –1 denotes Class 2.


To build a classifier, SVM finds a linear function of the form
g(x)=wTx+w0g(x)=wTx+w0
(4.17)
so that the input vector x(i) is assigned to Class 1 if g(x(i)) > 0, and to Class 2 if g(x(i)) < 0, i.e.,
y(i)={+1ifwTx(i)+w0>0−1ifwTx(i)+w0<0y(i)={+1 if wTx(i)+w0>0− 1 if wTx(i)+w0<0

Hence, g(x) is a real-valued function; g: X ⊆ℜn → ℜ.


(4.18)

w = [w1 w2 … wn]T ∊ℜn is called the weight vector and w0 ∊ℜ is called the bias.
In essence, SVM finds a hyperplane
wTx+w0=0wTx+w0=0
(4.19)
that separates Class 1 and Class 2 training examples. This hyperplane is called the decision
boundary or decision surface. Geometrically, the hyperplane (4.19) divides the input space into
two half spaces: one half for Class 1 examples and the other half for Class 2 examples. Note that
hyperplane (4.19) is a line in a two-dimensional space and a plane in a three-dimensional space.
For linearly separable data, there are many hyperplanes (lines in two-dimensional feature
space; Fig. 4.9) that can perform separation. How can one find the best one? The SVM
framework provides a good answer to this question. Among all the hyperplanes that minimize
the training error, find the one with the largest margin—the gap between the data points of the
two classes. This is an intuitively acceptable approach: select the decision boundary that is far
away from both the classes (Fig. 4.10). Large-margin separation is expected to yield good
classification on previously unseen data, i.e., good generalization.
Figure 4.10

From Section 4.2, we know that in wTx + w0 = 0, w defines a direction perpendicular to the
hyperplane. w is called the normal vector (or simply normal) of the hyperplane. Without
changing the normal vector w, varying w0 moves the hyperplane parallel to itself. Note also

to KwTx + Kw0 = 0 for K ∊ℜ+ (positive real numbers), without changing the hyperplane.
that wTx + w0 = 0 has an inherent degree of freedom. We can rescale the hyperplane

Since SVM maximizes the margin between Class 1 and Class 2 data points, let us find the
margin. The linear function g(x) = wTx + w0 gives an algebraic measure of the
distance r from x to the hyperplane wTx + w0 = 0. We have seen earlier in Section 4.2 that this
distance is given by (Eqn (4.7))
r=g(x)||w||r=g(x)||w||
(4.20)
Now consider a Class 1 data point (x(i), +1) that is closest to the hyperplane wTx + w0 = 0 (Fig.
4.11).
Figure 4.11 Geometric interpretation of algebraic distances of points to a hyperplane for two-
dimensional case
The distance d1 of this data point from the hyperplane is
d1=g(x(i))∥w∥=wTx(i)+w0∥w∥d1=g(x(i))‖w‖=wTx(i)+w0‖w‖
(4.21a)
Similarly,
d2=g(x(k))∥w∥=wTx(k)+w0∥w∥d2=g(x(k))‖w‖=wTx(k)+w0‖w‖
(4.21b)

We define two parallel hyperplanes ℋ1 and ℋ2 that pass through x(i) and x(k),
where (x(k), –1) is a Class 2 data point closest to the hyperplane wTx + w0 = 0.

respectively. ℋ1 and ℋ2 are also parallel to the hyperplane wTx + w0 = 0. We can


rescale w and w0 to obtain (this rescaling, as we shall see later, simplifies the quest for
significant patterns, called support vectors)
H1:wTx+w0=+1H2:wTx+w0=−1ℋ1:wTx+w0=+1ℋ2:wTx+w0=−1
(4.22)
such that
wTx(i)+w0≥1if y(i)=+1wTx(i)+w0≤−1if y(i)=−1wTx(i)+w0≥1 if y(i)=+1wTx(i)+w0≤−1 if y(i)=−1
(4.23a)
or equivalently
y(i)(wTx(i)+w0)≥1y(i)(wTx(i)+w0)≥1

which indicates that no training data fall between hyperplanes ℋ1 and ℋ2. The distance
(4.23b)

between the two hyperplanes is the margin M. In the light of rescaling given by (4.22),
d1=1||w||;d2=−1||w||d1=1||w|| ; d2=− 1||w||
(4.24)
where the '−' sign indicates that x(k) lies on the side of the hyperplane wTx + w0 = 0 opposite to
that where x(i) lies. From Fig. 4.11, it follows that
M=2||w||M=2||w||
(4.25)
Equation (4.25) states that maximizing the margin of separation between classes is equivalent
to minimizing the Euclidean norm of the weight vector w.
Our interest here is in the following nonlinear optimization problem with inequality constraints:
minimize f(x)subject to gi(x)≥0; i=1,…,mminimize f(x)subject to gi(x)≥0; i=1,…,m
(4.26)
where x = [x1 x2 … xn]T, and the functions f and gigi are continuously differentiable.
The optimality conditions are expressed in terms of the Lagrangian function
L(x,λ)=f(x)−m∑i=1λigi(x)L(x, λ)=f(x)−∑i=1mλigi(x)
(4.27)
where λ = [λ1 … λm]T is a vector of Lagrange multipliers.
An optimal solution to the problem (4.26) must satisfy the following necessary conditions,
called Karush-Kuhn-Tucker (KKT) conditions:
(i)∂L(x,λ)∂xj=0;j=1,…,n(ii)gi(x)≥0;i=1,…,m(iii)λi≥0;i=1,…,m(iv)λigi(x)=0;i=1,…,m(i)∂L (x, λ)∂xj=0;
j=1,…,n(ii)gi(x)≥0; i=1,…,m(iii)λi≥0; i=1,…,m(iv)λi gi(x)=0; i=1,…,m

In view of condition (iii), the vector of Largerange multipliers belongs to the set {λ ∊ℜm, λ ≥ 0}.
(4.28)

Also note that condition (ii) is the original set of constraints.


For this class of optimization problems, when there exist vectors x0 and λ0 such that the point
(x0, λ0) satisfies the KKT conditions (4.28), then x0 gives the global minimum of the
function f(x), with the constraint given in (4.26).

L∗(x)=maxλ∈RmL(x,λ),andL∗(λ)=minx∈RnL(x,λ)L*(x)=maxλ ∈ ℜm L (x, λ), and L*(λ)=minx ∈


Let

ℜn L (x, λ)
It is clear from these equations that for any x ∊ℜn and λ ∊ℜm,
L∗(λ)≤L(x,λ)≤L∗(x)L*(λ)≤L(x,λ)≤L*(x)
and thus, in particular

This holds for any x ∊ℜn and λ ∊ℜm; so it holds for the λ that maximizes the left-hand side,
L∗(λ)≤L∗(x)L*(λ)≤L*(x)

maxλ∈Rmminx∈RnL(x,λ)≤minx∈Rnmaxλ∈RmL(x,λ)maxλ ∈ ℜm minx ∈ ℜn L (x, λ) ≤ minx ∈


and the x that minimizes the right-hand side. Thus,

ℜn maxλ ∈ ℜm L (x, λ)
The two problems, min-max and max-min, are said to be dual to each other. We refer to the
min-max problem as the primal problem. The objective to be minimized, L*(x), is referred to as
the primal function. The max-min problem is referred to as the dual problem, and L∗L*(λ) as
the dual function. The optimal primal and dual function values are equal when f is a convex
function and gigi are linear functions. The concept of duality is widely used in the optimization
literature. The aim is to provide an alternative formulation of the problem which is more
convenient to solve computationally and/or has some theoretical significance. In the context of
SVM, the dual problem is not only easy to solve computationally, but also crucial for
using kernel functions to deal with nonlinear decision boundaries. This will be clear later.
The nonlinear optimization problem defined in (4.26) can be represented as min-max problem,
as follows:
For the Lagrangian (4.27), we have
L∗(x)=maxλ∈Rm[f(x)−m∑i=1λigi(x)]L*(x)=maxλ ∈ ℜm [f(x)−∑i=1mλigi(x)]
Since gigi(x) ≥ 0 for all i, λi = 0 (i = 1, …, m) would maximize the Lagrangian; thus,
L∗(x)=f(x)L*(x)=f(x)
Therefore, our original constrained problem (4.26) becomes the min-max primal problem:
minimizex∈RnL∗(x)subject to gi(x)≥0;i=1,…,mminimizex∈ℜn L*(x)subject to gi(x)≥0; i=1,…,m

maximizeλ∈Rm,λ≥0L∗(λ)maximizeλ ∈ ℜm, λ≥0 L*(λ)


The concept of duality gives the following formulation for max-min dual problem:

More explicitly, this nonlinear optimization problem with dual variables λ, can be written in the

maximizeλ≥0minx∈Rn[f(x)−m∑i=1λigi(x)]maximizeλ≥0 minx ∈ ℜn [f(x)−∑i=1mλigi(x)]


form:

(4.29)
Let us now state the learning problem in SVM.
Given a set of linearly separable training examples,
D={(x(1),y(1)),(x(2),y(2)),…,(x(N),y(N))},D={(x(1),y(1)),(x(2),y(2)),…,(x(N),y(N))},
the learning problem is to solve the following constrained minimization problem:
minimizef(w)=12wTwsubject toy(i)(wTx(i)+w0)≥1;i=1,…,Nminimizef(w)=12wTwsubject toy(i)
(wTx(i)+w0)≥1; i=1,…,N
(4.30)
This formulation is called the primal formulation of hard-margin SVM. Solving this problem will
produce the solutions for w and w0 which in turn, give us the maximal margin
hyperplane wTx + w0 = 0 with the margin 2/||w||.
The objective function is quadratic and convex in parameters w, and the constraints are linear
in parameters w and w0. The dual formulation of this constrained optimization problem is
obtained as follows.
First we construct the Lagrangian:
L(w,w0,λ)=12wTw−N∑i=1λi[y(i)(wTx(i)+w0)−1]L(w,w0,λ)=12wTw−∑i=1Nλi [y(i) (wTx(i)+w0)−1]
(4.31)
The KKT conditions are as follows:
i. ∂L∂w∂L∂ w = 0; which gives w = N∑i=1λiy(i)x(i)∑i=1Nλi y(i)x(i)
∂L∂w0=0; which givesN∑i=1λiy(i)=0∂L∂w0=0; which gives∑i=1Nλi y(i)=0
(4.32)
ii. y(i) (wTx(i) + w0) – 1 ≥ 0; i = 1, …, N
iii. λi ≥ 0; i = 1, …, N
iv. λi[y(i)(wTx(i) + w0) – 1] = 0; i = 1, …, N
From condition (i) of KKT conditions (4.32), we observe that the solution vector has an
expansion in terms of training examples. Note that although the solution w is unique (due to
the strict convexity of the function f (w)), the dual variables λi need not be. There is a dual

data points not on the margin hyperplanes (i.e., ℋ1 and ℋ2), λi = 0:


variable λi for each training data point. Condition (iv) of KKT conditions (4.32) shows that for

y(i)(wTx(i)+w0)−1>0⇒λi=0y(i)(wTx(i)+w0)−1>0⇒λi=0
For data points on the margin hyperplanes, λi ≥ 0:
y(i)(wTx(i)+w0)−1=0⇒λi≥0y(i)(wTx(i)+w0)−1=0⇒λi≥0
However, the data points on the margin hyperplanes with λi = 0 do not contribute to the
solution w, as is seen from condition (i) of KKT conditions (4.32). The data points on the margin
hyperplanes with associated dual variables λi > 0 are called support vectors, which give the
name to the algorithm, support vector machines.
To postulate the dual problem, we first expand Eqn (4.31), term by term, as follows:
L(w,w0,λ)=12wTw−N∑i=1λiy(i)wTx(i)−w0N∑i=1λiy(i)+N∑i=1λiL(w, w0, λ)=12 wT
w−∑i=1Nλiy(i)wTx(i)−w0 ∑i=1Nλiy(i)+∑i=1Nλi
(4.33)
Transformation from the primal to the corresponding dual is carried out by setting the partial
derivatives of the Lagrangian (4.33) with respect to the primal variables (i.e., w and w0) to zero,
and substituting the resulting relations back into the Lagrangian. The objective is to merely
substitute condition (i) of KKT conditions (4.32) into the Lagrangian (4.33) to remove the primal
variables; which gives us the dual objective function.
The third term on the right-hand side of Eqn (4.33) is zero by virtue of condition (i) of KKT
conditions (4.32). Furthermore, from this condition we have,
wTw=N∑i=1λiy(i)wTx(i)=N∑i=1N∑k=1λiλky(i)y(k)x(i)Tx(k)wT w=∑i=1Nλiy(i)wTx(i)=∑i=1N
∑k=1Nλiλky(i)y(k)x(i)T x(k)
Accordingly, minimization of function L in Eqn (4.33) with respect to primal variables w and w0,
gives us the following dual objective function:
L∗(λ)=N∑i=1λi−12N∑i=1N∑k=1λiλky(i)y(k)x(i)Tx(k)L*(λ)=∑i=1Nλi−12 ∑i=1N
∑k=1Nλiλky(i)y(k)x(i)T x(k)
(4.34)
We may now state the dual optimization problem.
Given a set of linearly separable training examples {(x(i),y(i))}Ni=1,{(x(i), y(i))}i=1N, find the dual
variables {λi}Ni=1,{λi}i=1N, that maximize the objective function (4.34) subject to the
constraints
∙N∑i=1λiy(i)=0∙λi≥0;i=1,…,N• ∑i=1Nλiy(i)=0• λi≥0;i=1,…,N
(4.35)
This formulation is dual formulation of the hard-margin SVM.
Having solved the dual problem numerically (using MATLAB's quadprog function, for example),
the resulting optimum λi values are then used to compute w and w0. w is computed using
condition (i) of KKT conditions (4.32):
w=N∑i=1λiy(i)x(i)w=∑i=1Nλiy(i)x(i)
(4.36)
and w0 is computed using condition (iv) of KKT conditions (4.32):
λi[y(i)(wTx(i)+w0)−1]=0;i=1,…,Nλi[y(i)(wTx(i)+w0)−1]=0;i=1,…,N
(4.37)
Note that though there are N values of λi in Eqn (4.36), most vanish with λi = 0 and only a small
percentage have λi > 0. The set of x(i) whose λi > 0 are the support vectors, and as we see
in Eqn (4.36), w is the weighted sum of these training instances that are selected as the support

w=∑i∈svindexλiy(i)x(i)w=∑i ∈ svindexλiy(i)x(i)
vectors:

(4.38)

From Eqn (4.38), we see that the support vectors x(i); i ∊ svindex, satisfy
where svindex denotes the set of indices of support vectors.

y(i)(wTx(i)+w0)=1y(i)(wTx(i)+w0)=1
and lie on the margin. We can use this fact to calculate w0 from any support vector as,

For y(i) ∊ [+1, –1], we can equivalently express this equation as,
w0=1y(i)−wTx(i)w0=1y(i)−wTx(i)

w0=y(i)−wTx(i)w0=y(i)−wTx(i)
(4.39)
Instead of depending on one support vector to compute w0, in practice, all support vectors are
used to compute w0, and then their average is taken for the final value of w0. This is because

w0=1|svindex|∑i∈svindex(y(i)−wTx(i))w0=1|svindex| ∑i ∈ svindex(y(i)−wTx(i))
the values of λi are computed numerically and can have numerical errors.

(4.40)
where |svindex| corresponds to total number of indices in the set svindex, i.e., total number of
support vectors.
The majority of λi are 0, for which y(i)(wTx(i) + w0) > 1. These are the x(i) points that exist more
than adequately away from the discriminant, and have zero effect on the hyperplane. The
instances that are not support vectors have no information; the same solution will be obtained
on removing any subset from them. From this viewpoint, the SVM algorithm can be said to be
similar to the k-NN algorithm (Section 3.4) which stores only the instances neighboring the class
discriminant.
During testing, we do not enforce a margin. We calculate
g(x)=wTx+w0g(x)=wTx+w0
(4.41a)
and choose the class according to the sign of g(x): sgn (g(x)) which we call the indicator function
iF,
iF=ˆy=sgn(wTx+w0)iF=y^=sgn(wTx+w0)
(4.41b)
Choose Class 1 (ˆyy^ = +1) if wTx + w0 > 0, and Class 2 (ˆyy^ = –1) otherwise.
4b)MEASURES OF IMPURITY FOR EVALUATING SPLITS IN DECISION TREES
An impurity measure is a heuristic for selection of the splitting criterion that best separates a
given dataset DD of class-labeled training tuples into individual classes. If we divide DD into
smaller partitions as per the outcome of the splitting criterion, each partition should ideally
be pure, with all the tuples falling into each partition belonging to the same class. We choose
the criterion that is closest in terms of outcome to this ideal scenario.
Evaluation of potential splits may be done with the help of various criteria. Alternate splitting
criteria may result in trees appearing very different from one another, but having similar
performance. Diverse impurity measures choose different splits, but since all the measures
attempt to seize the same idea, the resulting models end up behaving in the similar manner. In
this section, we explain three popular impurity measures—information gain, gain ratio, and Gini
index.
INFORMATION GAIN/ENTROPY REDUCTION
Information gain uses a clever idea for defining impurity, borrowed from the world of
information theory (revisiting Section 3.9 will be helpful). If a leaf is entirely pure, then the
classes in the leaf can be described very easily—there is only one. On the other hand, if a leaf is
highly impure, then describing the classes in the leaf is much more complicated. Information
theory has a measure for this, called entropy, which measures how disorganized a system is.
The root node holds the entire dataset DD which describes the patterns s(1), s(2), …, s(N) with
their corresponding classes y1 or y2 (for a binary classification task). Imagine selecting a pattern
at random from the dataset DD and announcing that it belongs to class yq. This message has
the probability
Pq=freq(yq,D)|D|Pq=freq(yq,D)|D|
(8.1)
where freq(yq,DD) stands for the number of patterns in DD that belong to class yq and |DD|
denotes the total number of patterns in DD (|DD| = N).
The expected information needed to classify a pattern in DD is given by
Info(D)=−2∑q=1Pqlog2(Pq)Info(D)=−∑q=12Pqlog2(Pq)
(8.2)
A log function to the base 2 is used because information is encoded in bits (refer to Section
3.9). Info (DD) is just the average amount of information needed to identify the class label of a
pattern in DD. Note that at this point, the information we have is solely based on the
proportions of patterns in each class. Info (DD) can also be expressed as entropy of DD, denoted
as Entropy (DD).
Entropy(D)=−2∑q=1Pqlog2PqEntropy(D)=−∑q=12Pqlog2Pq
(8.3)
Associated with root node of the decision tree, Info(DD) represents the expected amount of
information that would be needed to specify whether a new instance should be classified
as y1 or y2, given that the example reached the node. Info(DD) is 0 if all patterns in DD belong
to the same class (P1 = 0, P2 = 1): – P1 log2 P1 – P2 log2 P2 = 0 (note that 0log20 = 0). Info (DD)
is 1 when the collection DD contains an equal number of Class 1 and Class 2
patterns (P1=12,P2=12)(P1 = 12, P2 = 12) representing maximum heterogeneity (randomness) in
the dataset: – P1 log2 P1– P2 log2 P2 = 1. If the collection DD contains unequal number of Class
1 and Class 2 patterns, Info (DD) is between 0 and 1. It is, thus, a measure of impurity of the
collection of examples.
To illustrate, we consider training set of Table 8.1 (Weather Data). It has nine examples of
class Yes, and five examples of class No. Therefore,
Info(D)=Entropy(D)=−914log2914−514log2514=0.94bitsInfo(D)=Entropy(D)=−914log2914−514lo
g2514=0.94 bits
Root node with dataset DD will therefore be a highly impure node.
The training set DD contains instances that belong to a mixture of classes (high entropy). In this
situation, the idea of 'divide-and-conquer' strategy is to divide DD into subsets of instances that
are, or seem to be, heading towards single-class collection of instances.
The term |DDl|/|DD| acts as the weight of lth partition.
Info(DDl) is given by
Info(Dl)=−2∑q=1Pqllog2PqlInfo(Dl)=−∑q=12Pqllog2Pql
(8.5)
where Pql is the probability that the arbitrary sample in subset DDl belongs to class yq, and is
estimated as
Pql=freq(yq,Dl)|Dl|Pql=freq(yq,Dl)|Dl|
(8.6)
Info(DD, xj) is expected information required to classify a pattern from DD based on the
partitioning by xj. The smaller the expected information (still) required, the greater the purity of
the patterns.
Reconsider the dataset of Table
8.1: x1 = Outlook, x2 = Temperature, x3 = Humidity, x4 = Wind; v1x1v1x1 =
sunny, v2x1v2x1 = overcast, v3x1v3x1 = rain; v1x2v1x2 = hot, v2x2v2x2 = mild, v3x2v3x2 = cool;
v1x3v1x3 = high, v2x3v2x3 = normal; v1x4v1x4 = weak, v2x4v2x4 = strong. For these four
choices of attribute at the root node, the tree stumps for the weather data are shown in Fig.
8.2.
Consider the attribute x1 = Outlook (Tree stump of Fig. 8.2a). Refer to Figs (8.3) and (8.2a). Five
patterns belong to the value sunny; out of which two belong to Yes class and three belong
to No. Four patterns belong to the value overcast; all of them belong to Yes class. Five patterns
belong to the value rain; out of which three belong to Yes class and two belong to No.
Info(D,x1)=d1∑l=1|Dl|D×Info(Dl);d1=3Info(Dl)=−2∑q=1Pqllog2Pql;Pql=freq(yq,Dl)|Dl|
Info(D,x1)=∑l=1d1|Dl|D×Info(Dl);d1=3Info(Dl)=−∑q=12Pqllog2Pql;Pql=freq(yq,Dl)|Dl|
Therefore,
Info(D1)=−25log225−35log235=0.97Info(D2)=0Info(D3)=−35log235−25log225=0.97Info(D,x1)=5
14×0.97−514×0.97=0.693Info(D1)=−25log225−35log235=0.97Info(D2)=0Info(D3)=−35log235−25
log225=0.97Info(D,x1)=514×0.97−514×0.97=0.693
Similarly, for other tree stumps of Fig. 8.2, we obtain
Info(D,x2)=0.911(Temperature)Info(D,x3)=0.788(Humidity)Info(D,x4)=0.892(Wind)Info(D,x2)=0.
911 (Temperature)Info(D,x3)=0.788 (Humidity)Info(D,x4)=0.892 (Wind)
We see that expected required information for classification is the least if we select
attribute Outlook for the root node. Humidity is the next best choice.
Information gain is defined as the difference between the original information requirement
(i.e., based on the partition of classes in the entire dataset DD) and the new requirement (i.e.,
obtained after partitioning on xj). That is,
Gain(D,xj)=Info(D)−Info(D,xj)Gain(D,xj)=Info(D)−Info(D,xj)
(8.7)
In other words, Gain(DD, xj) tells us how much would be gained by branching on xj. It is the
expected reduction in information requirement (expected reduction in entropy) by partitioning
on xj. The attribute xj with the highest information gain, Gain (DD, xj), is chosen as the splitting
attribute at the root node. This is equivalent to saying that we want to partition on the
attribute xj that would do the best classification so that the amount of information still required
(i.e., Info(DD, xj)) to finish classification task is minimal.
We have selected x1 = Outlook as the splitting attribute at the root node, for which
Gain(D,x1)=0.94−0.693=0.247Gain(D,x1)=0.94−0.693=0.247
Gains for other attributes are:
Gain(D,x2)=0.029Gain(D,x3)=0.152Gain(D,x4)=0.048Gain(D,x2)=0.029Gain(D,x3)=0.152Gain(D,x
4)=0.048
Obviously, Outlook provides the maximum gain.
The same strategy is applied recursively to each subset of training instances. Figure 8.4 shows
the possibilities for a further branching at the node reached when Outlook is sunny.
The information gain for the three attributes at the daughter nodes are:
Gain(Temperature)=0.571Gain(Humidity)=0.971Gain(Wind)=0.020Gain
(Temperature)=0.571Gain (Humidity)=0.971Gain (Wind)=0.020
Therefore, we select Humidity as the splitting attribute at this point. There is no need to split
these nodes any further; so this branch has reached the leaf nodes. Continued application of
the same idea leads to the decision tree of Fig. 8.1 for the weather data.
Note that information gain, Gain (DD, xj), measures the expected reduction in entropy, caused
by partitioning the patterns in dataset DD according to the attribute xj.
Gain(D,xj)=Entropy(D)−Entropy(D,xj)Gain(D,xj)=Entropy(D)−Entropy(D,xj)
(8.8a)
=Entropy(D)−dj∑l=1|Dl||D|×Entropy(Dl) =Entropy(D)−∑l=1dj|Dl||D|
×Entropy(Dl)
(8.8b)
where
Entropy(D)=−2∑q=1Pqlog2Pq;Pq=freq(yq,D)|D|Entropy(D)=−∑q=12Pqlog2Pq;Pq=freq(yq,D)|D|
(8.8c)
and
Entropy(Dl)=−2∑q=1Pqllog2Pql;Pql=freq(yq;Dl)|Dl|
Entropy(Dl)=−∑q=12Pqllog2Pql;Pql=freq(yq;Dl)|Dl|
(8.8d)
When the output in dataset DD belongs to M distinct classes, i.e., y takes on values yq; q = 1, 2,
…, M, the defining equations for Gain/EntropyReduction become:
Gain(D,xj)=Info(D)−dj∑l=1|Dl||D|×Info(Dl)Gain(D,xj)=Info(D)−∑l=1dj|Dl||D|×Info(Dl)
(8.9a)
where
Info(D)=−M∑q=1Pqlog2Pq;Pq=freq(yq;D)|D|Info(D)=−∑q=1MPqlog2Pq;Pq=freq(yq;D)|D|
(8.9b)
and
Info(Dl)=−M∑q=1Pqllog2Pql;Pql=freq(yq,Dl)|Dl|Info(Dl)=−∑q=1MPqllog2Pql;Pql=freq(yq,Dl)|Dl|
(8.9c)
The first term in Eqn (8.8a) is just the entropy of the original dataset DD, which represents the
level of randomness in the dataset with respect to target variable. The second term in Eqn
(8.8a) is the expected value of entropy after DD is partitioned using attribute xj. The expected
entropy described by this term is simply the sum of the entropies of each subset DDl, weighted
by fraction of patterns |DDl|/|DD|, that belong to DDl (Eqn (8.8b)). Gain (DD, xj) is, therefore,
the expected reduction in entropy caused by the partitioning.
EntropyReduction(D,xj)=Entropy(D)−dj∑l=1|Dl||D|×Entropy(Dl)Entropy
Reduction(D,xj)=Entropy(D)−∑l=1dj|Dl||D|×Entropy(Dl)
(8.10a)
where
Entropy(D)=−M∑q=1Pqlog2Pq;Pq=freq(yq,D)|D|Entropy(D)=−∑q=1MPqlog2Pq;Pq=freq(yq,D)|
D|
(8.10b)
and
Entropy(Dl)=−M∑q=1Pqllog2Pql;Pql=freq(yq,Dl)|Dl|
Entropy(Dl)=−∑q=1MPqllog2Pql;Pql=freq(yq,Dl)|Dl|
(8.10c)
Gain Ratio
For ID3, a decision-tree tool developed by Quinlan, the selection of partitioning was made on
the basis of the information gain/entropy reduction. It gave quite good results, and became
part of several commercial data mining software packages. It, however, ran into trouble for
applications having some attributes xj with large number of possible values vlxjvlxj; l = 1, ..., dj;
giving rise to multiway splitting with many daughter nodes. Just by breaking the larger dataset
into large number of small subsets, the number of classes represented in each node tends to go
down; so each daughter node increases in purity. The information gain criterion has, thus, a
serious deficiency—it has a strong bias in favor of attributes with large number of values. The
attribute with large number of values will get selected at root itself and may lead to all leaf
nodes, resulting in a too simple hypothesis model unable to capture the structure of the data.
C4.5, a successor of ID3, uses an extension of information gain, known as gain ratio, which
attempts to overcome this bias. It applies a kind of normalization to information gain using a
'split information' value defined analogously with Info(DD, xj) as
SplitInfo(D,xj)=−dj∑l=1|Dl||D|log2|Dl||D|SplitInfo(D,xj)=−∑l=1dj|Dl||D|log2|Dl||D|
(8.11)
This value is representative of the potential information derived from the division of the
dataset, DD, into dj partitions matching with the dj values of the attribute xj. For each value
of xj, the number of tuples possessing that value is considered with respect to the total number
of tuples in DD. This is different from information gain, which measures the information with
respect to classification obtained on the basis of same partitioning.
The gain ratio is defined as
GainRatio(D,xj)=Gain(D,xj)SpitInfo(D,xj)GainRatio(D,xj)=Gain(D,xj)SpitInfo(D,xj)
(8.12)
The attribute with the maximum gain ratio is selected as the splitting attribute. Note
that SplitInfo term discourages the selection of attributes with many uniformly distributed
values dj. SplitInfo is high when dj is large.

denominator in Eqn (8.12) can be zero or very small when |DDl| ≃ |DD| for one of the DDl.
One practical issue that arises in using GainRatio in place of Gain to select attributes is that the

This either makes the GainRatio undefined or very large for such attributes. A standard fix is to
choose the attribute that maximizes the gain ratio, provided that the information gain for that
attribute is at least as great as the average information gain for all the attributes examined.
That is, a constraint is added whereby the information gain of the attribute selected must be
large.
Returning to the tree stumps of the weather data in Fig. 8.2, x1 = Outlook splits the dataset into
three subsets of size 5, 4, and 5 (refer to Fig. 8.3); and thus the SplitInfo is given by
SplitInfo(D,x1)=−3∑l=1|Dl||D|log2|Dl||D|
=−514log2514−414log2414−514log2514=1.577SplitInfo(D,x1)=−∑l=13|Dl||D|log2|Dl||D|
=−514log2514−414log2414−514log2514=1.577
without paying any attention to the classes involved in the subsets. We can normalize
the information gain by dividing by the split info value to get the gain ratio:
GainRatio(D,x1)=Gain(D,x1)SplitInfo(D,x1)=0.2471.577=0.156GainRatio(D,x1)=Gain(D,x1)SplitInf
o(D,x1)=0.2471.577=0.156
The results of these calculations for the tree stumps of Fig. 8.2 are summarized as follows:
Outlook:Gain(D,x1)=0.247,SplitInfo(D,x1)=1.577,GainRation(D,x1)=0.156Temperature:Gain(D,x2
)=0.029,SplitInfo(D,x2)=1.362,GainRation(D,x2)=0.019Humidity:Gain(D,x3)=0.152,SplitInfo(D,x3)
=1.000,GainRation(D,x3)=0.152Wind:Gain(D,x4)=0.048,SplitInfo(D,x4)=0.985,GainRation(D,x4)=
0.049Outlook:Gain(D,x1)=0.247,
SplitInfo(D,x1)=1.577,GainRation(D,x1)=0.156Temperature:Gain(D,x2)=0.029,
SplitInfo(D,x2)=1.362,GainRation(D,x2)=0.019Humidity:Gain(D,x3)=0.152,
SplitInfo(D,x3)=1.000,GainRation(D,x3)=0.152Wind:Gain(D,x4)=0.048,
SplitInfo(D,x4)=0.985,GainRation(D,x4)=0.049
Outlook still comes out on top, but Humidity now is a much closer contender because it splits
the data into two subsets instead of three.
Gini Index
Another popular splitting criterion is named Gini, after the 20th century Italian statistician and
economist Corrado Gini. Gini index is used in CART.
Gini(D)=1−M∑q=1P2qGini(D)=1−∑q=1MPq2
(8.13)
where Pq is the probability that a tuple in DD belongs to class yq, and is estimated by
Pq=freq(yq,D)|D|Pq=freq(yq,D)|D|
Gini index considers a binary split for each attribute. Let us first consider the case where xj is
continuous-valued attribute having dj distinct values vlxjvlxj; l = 1, 2, …, dj. It is common to take
mid-point between each pair of (sorted) adjacent values as a possible split-point (It is a simple
policy, although something might be gained by adopting a more sophisticated policy. One such
policy will be discussed in the next section). The point giving the minimum Gini index for the
attribute xj is taken as its split-point.
For a possible split-point of xj, DD1 is the number of tuples in DD satisfying xj ≤ split-point,
and DD2 is the set of tuples satisfying xj > split-point. The reduction in impurity that would be
incurred by a binary split on xj is
ΔGini(xj)=Gini(D)−Gini(D,xj)ΔGini(xj)=Gini(D)−Gini(D,xj)
(8.14a)
Gini(D,xj)=|D1||D|Gini(D1)+|D2||D|Gini(D2)Gini(D,xj)=|D1||D|Gini(D1)+|D2||D|Gini(D2)
(8.14b)
The attribute that maximizes the reduction in impurity (or equivalently, has the minimum Gini
index) is selected as the splitting attribute. Then one of these two parts (DD1, DD2) is divided in
a similar manner by choosing a variable again and a split value for the variable. This process is
continued till we get pure leaf nodes (refer to Example 8.3).
4a)C4.5, AND CART DECISION TREES
The C4.5 algorithm, a successor and refinement of ID3, is the most popular in tree-based
classification methods. In C4.5, multiway splits for categorical variable are treated the same
way as in ID3. Continuous-valued attributes have been incorporated by dynamically defining
new discrete-valued attributes that partition the continuous-attribute values into a binary set of
intervals. In particular, a new attribute zj is created that
is true if zj < vthzjvzjth and false otherwise. The only question is how to select the best value for
the threshold vthzjvzjth.
It has been found that an entropy impurity works acceptably in most cases and is a natural
default. However, CART, which uses Gini index as impurity measure, provides a general
framework that can be instantiated in various ways to produce different classification (and
regression) trees.
The CART approach restricts the splits to binary values for both categorical and continuous-
valued attributes. Thus, CART is a binary tree.
Example 8.2
The C4.5 Decision Tree
To illustrate the process of implementation of C4.5 decision tree, we consider the (toy) training
set of Table 7.3, in which there are four
attributes: x1 (Outlook), x2 (Temperature), x3 (Humidity), x4 (Wind); and two classes for the
output variable PlayTennis: Yes, No. The attributes x1 and x4 have categorical values, and the
attributes x2 and x3 have continuous numeric values. Attribute x1 (Outlook) has three
categorical values: sunny, overcast, and rain, and therefore the node labeled Outlook will have
three branches. The node for categorical variable x2 (Wind) will have two branches. The other
two variables are continuous-valued, and as per the strategy followed in C4.5, the
corresponding nodes will have binary splits. This will require discretization of these variables.
We will use here the entropy-based method for attribute discretization discussed earlier
in Section 7.8.
There are four choices of attribute at the root node: Outlook, Temperature, Humidity,
and Wind. Tree stumps for the attribute Outlook (x1) are shown in Fig. 8.7(a); a three-way split
corresponding to the three categorical values of x1.
Figure 8.7 Tree stumps for weather data of Table 7.3

Temperature (x2) has continuous numeric values. For this attribute, we will select the best cut-
point Tx2Tx2 from its range of values by evaluating every candidate cut point. Examples are first
sorted by increasing value of the attribute, and interval between each successive pair of values
in the sorted sequence gives a potential cut-point. However, as stated in Section 7.8, the cut-
point that minimizes the entropy will never occur between two patterns of the same class.
Therefore, it is necessary to consider potential divisions that separate patterns of different
classes. For the weather data of Table 7.3, this gives us eight potential cut-points: {64, 65, 70,
71, 72, 75, 80, 83}. Note that boundary points of the intervals between classes have been taken
as the potential cut-points (Example 7.3). Entropy for each of these cut-points is evaluated and
the one that results in maximum information gain/gain ratio is selected as the split-value for
the attribute x2. It follows from Example 7.3 that x2 = 83 is the selected split-value.
Entropy(D)=0.9402Entropy(D,Tx2=83)=0.8268Gain(D,Tx2=83)=0.9402−0.8268=0.1134SplitInfo(
D,Tx2=83)=0.3711GainRatio(D,Tx2=83)=0.11340.3711=0.3053Entropy(D)=0.9402Entropy(D,Tx2
=83)=0.8268Gain(D,Tx2=83)=0.9402−0.8268=0.1134SplitInfo(D,Tx2=83)=0.3711GainRatio(D,Tx2
=83)=0.11340.3711=0.3053
Tree stump for x2 (Temperature) is shown in Fig. 8.7(b). Evaluating all the four candidate root
node variables, Outlook turns out to be the best choice with respect to entropy reduction/gain
ratio measure.
Table 8.3 shows the dataset for the branch sunny, obtained from the data DD of Table 7.3.
Repeating the process described above on this dataset, we select Humidity as the daughter
node with split-value = 70.
Table 8.3 Dataset corresponding to the branch sunny.
s(i) Temperature Humidity Wind Class
1 75 70 strong Yes
2 80 90 strong No
3 85 85 weak No
s(i) Temperature Humidity Wind Class
4 72 95 weak No
5 69 70 weak Yes

The fully-grown C4.5 decision tree from the data of Table 7.3 is shown in Fig. 8.8.
Figure 8.8 Decision tree for the weather data of Table 7.3

Example 8.3
The CART Decision Tree
Consider the data sample shown in Table 8.4, for training a binary CART tree. Note that there
are N = 24 points in two dimensions given by the data sample; a pilot random sample of
households in a city with respect to ownership of a lawn tractor [20]. The dataset is set up for a
predictive model—a lawn-tractor manufacturer would like to find a way of classifying
households in a city into those likely to purchase a lawn tractor and those not likely to buy one.
Input variables, x1 = Income and x2 = Lawn size, are recorded for 24 households, and the target
variable y = Ownership of a lawn tractor, is assigned to each household. The dataset is
balanced, containing equal numbers of Owner/Nonowner households.
Table 8.4 A pilot random sample of households in a city with respect to ownership of a lawn
tractor
Household s(i) Income ($ thousands) x1 Lawn size (thousands ft2)x2 Ownership of a lawn tractor y
1 60 18.4 Owner
2 75 19.6 Nonowner
3 85.5 16.8 Owner
4 52.8 20.8 Nonowner
5 64.8 21.6 Owner
6 64.8 17.2 Nonowner
7 61.5 20.8 Owner
8 43.2 20.4 Nonowner
9 87 23.6 Owner
Household s(i) Income ($ thousands) x1 Lawn size (thousands ft2)x2 Ownership of a lawn tractor y
10 84 17.6 Nonowner
11 110.1 19.2 Owner
12 49.2 17.6 Nonowner
13 108 17.6 Owner
14 59.4 16 Nonowner
15 82.8 22.4 Owner
16 66 18.4 Nonowner
17 69 20 Owner
18 47.4 16.4 Nonowner
19 93 20.8 Owner
20 33 18.8 Nonowner
21 51 22 Owner
22 51 14 Nonowner
23 81 20 Owner
24 63 14.8 Nonowner
When searching for a binary split on a continuous-valued input variable, midpoints between the
consecutive values may be treated as candidate values for the split. The candidate split points
for the variable x1 (Income) are {38.1, 45.3, 50.1, …, 109.5}, and those for x2 (Lawn size) are
{14.4, 15.4, 16.2, …, 23}. We need to rank the candidate split points according to how much
they reduce impurity (heterogeneity) in the resulting subsets after the split. Note that the total
dataset DD given in Table 8.4 has the highest impurity. With respect to Gini index as impurity
measure,
Gini(D)=1−2∑q=1P2q;Pq=freq(yq,D)|D|
=1−(0.5)2−(0.5)2=0.5Gini(D)=1−∑q=12Pq2;Pq=freq(yq,D)|D|=1−(0.5)2−(0.5)2=0.5
It can easily be verified that the Gini index impurity measure is at its peak when Pq = 0.5, i.e.,
when the data is perfectly balanced.
Calculating Gini index for all the candidate split points for both x1 and x2 variables, and ranking
them according to how much they reduce impurity, we choose x2 (Lawn size) for the first split
with a splitting value of 19. The (x1, x2) space is now divided into two rectangles, one with x2 ≤
19 and the other with x2 > 19. This is illustrated in Fig. 8.9.
Gini(D,x2)=|D1||D|×Gini(D1)+|D2||D|
×Gini(D2)=1224×Gini(D1)+1224×Gini(D2)=1224(1−(312)2−(912)2)+1224(1−(912)2−(312)2)=12×0
.375+12×0.375=0.375Gini(D,x2)=|D1||D|×Gini(D1)+|D2||D|
×Gini(D2)=1224×Gini(D1)+1224×Gini(D2)=1224(1−(312)2−(912)2)+1224(1−(912)2−(312)2)=12×0
.375+12×0.375=0.375

You might also like