BITS F464 ML Lecture Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 86

Course Logistics

AIM OF THE COURSE


1. Design, implementation & applications of supervised and unsupervised ML algorithms
2. Classification algorithms: Perceptron, Logistic Regression, SVM, Neural Networks,
Decision Trees, Ensembles

COURSE STRUCTURE AND EVALUATION


36-42 lectures or 12-14 weeks (3 hrs per week Mon/Wed/Fri 9am to 10am) at F106

REFERENCES
1. Pattern Recognition and Machine Learning, Christopher M Bishop
2. Machine Learning, Tom M Mitchell
3. The Elements of Statistical Learning, Trevor Hastie et. al
4. Python Machine Learning, Sebastian Raschka

OBJECTIVES
1. Learn theoretical and practical aspects of linear models for regression & classification
2. Understand probabilistic, discriminative & generative models for classification
3. Learn theoretical and practical aspects of SVM and ANN
4. Understand decision tree learning and ensemble methods
Essential Readings
CS229 Lecture Notes, Andrew Ng and Tengyu Ma
Linear regression - Chapters 1.1, 1.2, 1.3
Support Vector Machines - Chapter 6
Pattern Recognition and Machine Learning, Christopher M Bishop
Introduction - Chapters 1.1, 1.2
Linear Models for Regression - Chapters 3.1
Linear Models for Classification - Chapters 4.1, 4.2, 4.3
Neural Networks - Chapters 5.1, 5.2, 5.3
Introduction to Data Mining, Pang Ning Tan, Michael Steinbach, Vipin Kumar
Classification: Basic Concepts - Chapters 4.3, 4.4
Classification: Alternative Techniques - Chapters 5.4, 5.5
STAT 479 Lecture Notes, Sebastian Raschka
Ensemble Methods - Chapter 7
Chapter-1 Models for Regression
1.1 Definitions, Applications and Types of ML
For many problems, it's difficult to program the correct behavior by hand
- recognizing people & objects
- understanding human speech

ML approach - program an algorithm to automatically learn from data or experience

We ask the question "How can we build computer systems that automatically improve with
experience and what are the fundamental laws that govern all learning processes?"

The term Machine Learning was first coined by Arthur Lee Samuel in 1959.
He defined ML as follows:
- ML is the field of study that gives computers the ability to learn without being explicitly
programmed.

A FORMAL DEFINITION
A computer program is said to learn from experience E with respect to some class of tasks T
and performance measure P, if its performance at tasks in T, as measured by P, improves with
experience E.

Example: Handwritten Digit Recognition

T: Classifying handwritten digits from images


P: Percentage of digits classified correctly
E: Dataset of digits given classification (MNIST)

APPLICATIONS

Personal Assistants
Advertising & Business Intelligence
News Feeds/ Filtering Algorithms
Recommendation Engines
....and so on

CATEGORIES OF ML

1. Supervised Learning - Labeled Data, Direct Feedback, Predict Outcome/Future


2. Unsupervised Learning - Unlabeled Data, No Feedback, Find Hidden Structure in Data
3. Reinforcement Learning - Decision Process, Reward System, Learn Series of Actions

Classification: Predict likelihood that something belongs to some class. E.g. Spam E-Mail
Detection

Regression: Predicts a numerical value. E.g. Predict amount of rain on a certain day.
Clustering: Forms clusters based on similarity and finds patterns in data. E.g. Fake News
Detection

Reinforcement Learning: Make predictions by getting rewards or penalties based on actions


performed in an environment. E.g. Robotics

1.2 Supervised Learning


Goal: Given a training set, learn a function

h : X → Y

so that h(x) is a good predictor for y. This function is called a hypothesis.

For a classification task, we define the hypothesis as

h : X → Y

where
m
X = R , Y = {1, … , K} with class labels K

For binary classification, we consider

Y = {0, 1} or {−1, 1}

For regression, we define our hypothesis as


m
h : R → R

Algorithm Categorization Schemes

The supervised learning algorithms can be categorized as follows:

1. Eager vs Lazy - process data immediately vs defer the processing until the prediction
2. Batch vs Online - learn on the entire data at once vs learn from one example at a time.
3. Discriminative vs Generative - probabilistic vs direct

Modular Approach
Data is the driving force
Model is the mathematical relationship from input features to output labels
Loss function is the difference between predicted and actual values
Optimization so that the model gets updated to get the best solution
Evaluation to determine the performance of the model

1.3 Regression Analysis - House Price Prediction

Plot of the data

Question: Given data like this, how can we learn to predict the prices of other houses, as a
function of the size of their living areas?

Notation
(i)
x → input f eatures
(i)
y → output f eatures

(i) (i)
(x ,y ) → training example

(i) (i)
{(x ,y ) : i = 1, … , n} → training set

Given a training set we wish to learn the hypothesis

h : X → Y

so that h(x) is a good predictor for the corresponding value of y.

Example: We consider two input features, living area and number of bedrooms and wish to
predict price of the house.

The x are two-dimensional vectors in R2


x1(i) denotes living area of ith house

x2(i) denotes number of bedrooms of ith house

We try to approximate y as a linear function of x

h θ (x) = θ 0 + θ 1 x 1 + θ 2 x 2

Here the θ's are parameters or weights that parameterize the space of linear functions mapping
from X to Y.

To simplify the notation we set x0 = 1 and rewrite the hypothesis as

T
h(x) = ∑ θ i x i = θ x

i=0

A Simpler Model
Essentials of Linear Algebra - Cheat Sheet by Ivan Savov

A single feature x1

y = θ0 + θ1 x1

How do we learn model parameters?


make h(x) close to y, at least for the available training examples.
define a function that measures, for each value of the θ’s, how close the h(x(i))’s are to
the corresponding y(i)’s.

Cost function:
m
1 (i) (i) 2
J (θ) = ∑(h θ (x ) − y )
2
i=1

A cost function is used to gauge the performance of the ML model; compares the predicted
values with actual values.

1.4 Least Mean Squares (LMS) Algorithm


Want to choose θ so as to minimize J(θ)
Start with an initial guess for θ and repeatedly change θ to make J(θ) smaller, until
convergence to a value of θ that minimizes J(θ)

Gradient Descent Algorithm


A generic optimization algorithm that can be used to learn weight vectors of most of the ML
models.
The gradient always points in the direction of steepest increase in the loss function. The
algorithm takes a step in the direction of the negative gradient to reduce the loss.

We define the LMS update rule as follows:


θ j := θ j − α J (θ)
∂θ j

Now we compute the partial derivative in the RHS:

Hence, for a single training example, the update rule takes the form:

(i) (i) (i)


θ j := θ j + α(y − h θ (x ))x j

This rule is called the LMS or Widrow-Hoff learning rule.

For more than one training example, we replace the rule with the following algorithm:
This is known as Batch Gradient Descent. It looks at every example in the entire training set
on every step.

Another alternative is to use the following algorithm:

This is known as Stochastic Gradient Descent. We repeatedly run through the training set,
and each time we encounter a training example, we update the parameters according to the
gradient of the error with respect to that single training example only.

1.5 Normal Equations


A second way to minimize J.
Explicitly take the derivatives with respect to θj's and set them to zero.

Since we know
(i) (i) T
h θ (x ) = (x ) θ

it can be shown that


To minimize J,

Thus, the value of θ that minimizes J(θ) is given by

θ = (X
T
X)
−1
X
T

y

1.6 Polynomial Regression


Many times the relationship between input features and output labels is non-linear.
Simple linear models are unable to learn such mappings.

Clearly, this is not a great fit.

We create polynomial features by combining existing input features.


Then apply linear regression model on this representation.

We define the error function as follows:

N
1
2
E(w) = ∑{y(x n , w) − t n }
2
n=1
Here, xn are the data points, tn are the corresponding target values and y(xn,w) are the
predicted values.

Model Selection

Choosing the order, M of the polynomial.

M = 0 and M = 1 give poor fits to the data and hence poor representations of the function.
M = 3 seems to give the best fit to the function.
For M = 9, we obtain an excellent fit to the training data and E(w*) = 0. The curve, however,
oscillates wildly and gives a poor representation to the function. This is known as over-
fitting.

The goal is to achieve good generalization by making accurate predictions for new data.
The test set error is a measure of how well we are doing in predicting the values of t for
new data observations of x.
Small values of M give relatively large values of the test set error.
Values of M in the range 3 ≤ M ≤ 8 give small values for the test set error, and these also
give reasonable representations.
For M = 9, the training set error goes to zero and the test set error has become very large.

The issue with Polynomial Regression

Higher order polynomials are very flexible and more prone to overfitting.
This issue is resolved using Regularization.

1.7 Regularization
Controlling model complexity to fit complex models on relatively small dataset without
overfitting.
Idea is to add a penalty term in the loss function for the large weights.
Penalty is a function of the weight vector.
λ is a regularization rate that controls the amount of penalty to be added.

T
f (b) = E(b) + (λ/n)b b

E(b) is the error function


b is the weight vector
bT b is the penalty

Bias-Variance Tradeoff

Low complexity models -> Low variance, High Bias


High complexity models -> High variance, Low Bias

The approach to minimize E(b) unconditionally is to impose a restriction on the squared


magnitude of the regression coefficients. Hence, we impose the following condition to b1,...,bp:

2
∑ bj ≤ c

j=1

We can think of c as a budget for how much we can spend on the size of all the coefficients.

Now, we have to following constrained minimization:

1 T 2 T
min{ (Xb − y) (Xb − y)} s. t. ∥b∥ 2 = b b ≤ c
b n

This can be rewritten as:

λ T
min b {E(b) + b b}
n

Ridge Regression Estimate


λ = 0 causes the ridge solution to be the same as linear regression
λ -> ∞ causes the ridge coefficients to collapse to zero.
λ is a tuning parameter, i.e. it cannot be found analytically.

LASSO (Least Absolute Shrinkage and Selection Operator) Regression

The mathematical setup is given as follows:

This is equivalent to:

L1 & L2 norms

∥w∥ 1 = |w 1 | + |w 2 | + ⋯ + |w N |

1
2 2 2
∥w∥ 2 = (|w 1 | + |w 2 | + ⋯ + |w N | ) 2
Error Surface

Consider two inputs x1 and x2 and corresponding parameters b1 and b2. The error function E(b)
generates a bowl shaped (paraboloid) convex error surface. On the b1, b2 plane the locus of
points such that bTb ≤ c is a disk of radius c, centered at origin.

1.8 Essentials of Probability Theory


Provides a consistent framework for the quantification and manipulation of uncertainty and
forms one of the central foundations for pattern recognition.

A simple example

We have two boxes, one red and one blue, and in the red box we have 2 apples and 6
oranges, and in the blue box we have 3 apples and 1 orange.
We randomly pick one of the boxes and from that box we randomly select an item of fruit,
and having observed which sort of fruit it is we replace it in the box from which it came. We
could imagine repeating this process many times.
Let us suppose that in so doing we pick the red box 40% of the time and we pick the blue
box 60% of the time, and that when we remove an item of fruit from a box we are equally
likely to select any of the pieces of fruit in the box.
The identity of the box that will be chosen is a random variable, which we shall denote by
B.
Similarly, the identity of the fruit is also a random variable and will be denoted by F.

Now we can compute the following probabilities as follows:

p(B = r) = 4/10 ; probability of selecting the red box


p(B = b) = 6/10 ; ; probability of selecting the blue box
We can see that p(B = r) + p(B = b) = 1

Supposing we pick a box at random, and it turns out to be the blue box. Then the probability of
selecting an apple in the blue box is given as p(F = a | B = b) = 3/4.

Similarly, we can write all the four conditional probabilities as follows:

Using this, we now compute overall probability of choosing an apple as:

1 4 3 6 11
p(F = a) = p(F = a|B = r)p(B = r) + p(F = a|B = b)p(B = b) = . + . =
4 10 4 10 20

from this it follows that:

9
p(F = o) = 1 − 11/20 =
20

Now, suppose instead we are told that a piece of fruit has been selected and it is an orange,
and we would like to know which box it came from. This requires that we evaluate the
probability distribution over boxes conditioned on the identity of the fruit. To solve this, we use
Bayes' Theorem as follows:

p(F = o|B = r)p(B = r) 3 4 20 2


p(B = r|F = o) = = . . =
p(F = o) 4 10 9 3

and it clearly follows that:

2 1
p(B = b|F = o) = 1 − =
3 3

p(B) is called the prior probability because it is the probability available before we
observe the identity of the fruit.
p(B|F) is called the posterior probability because it is the probability obtained after we have
observed F.

A summary of the basic rules


1. Sum Rule:

p(X) = ∑ p(X, Y )

2. Product Rule:

p(Y |X)p(Y )
p(X, Y ) =
p(X)

3. Bayes' Theorem:

p(X|Y )p(Y )
p(Y |X) =
p(X)

If the joint distribution of two variables factorizes into the product of the marginals, so that
p(X,Y) = p(X)p(Y), then X and Y are said to be independent.

Gaussian/Normal Distribution

For a single real-valued variable x, we define the Gaussian distribution as

where µ is the mean, σ2 is the variance and σ is the standard deviation. We also define
precision as the reciprocal of the variance, i.e. β = 1/ σ2.

The univariate Gaussian distribution:

Suppose that we have a data set of observations x = (x1,...,xN )T, representing N


observations of the scalar variable x.
The observations are drawn independently from a Gaussian distribution whose mean µ
and variance σ2 are unknown, and we would like to determine these parameters from the
data set.
Data points that are drawn independently from the same distribution are said to be
independent and identically distributed, which is often abbreviated to i.i.d.

We can write the probability of the dataset as:


This is the likelihood function of the Gaussian.

We denote it as

L(μ, σ; data) = P (data; μ, σ)

LHS: The likelihood of the parameters µ and σ taking certain values given that data is
observed.
RHS: Probability density of observing the data with model parameters µ and σ.

1.9 Maximum Likelihood Estimation


We want to know the values of µ and σ that result in the curve that best fits the data.

In practice, we observe that maximizing the log of the likelihood function is easier. Hence, we
maximize

and obtain

which is nothing but the sample mean and sample variance respectively.

Derivation of the result


1.10 Probabilistic Interpretation of Regression
We assume that the target variables and inputs are related as
(i) T (i) (i)
y = θ x + ϵ

ε(i) is an error term that captures either unmodeled effects or random noise.
We also assume that ε(i) are i.i.d according to a Gaussian distribution with µ = 0 and σ2.

The density of ε(i) is given by

and it implies that


We can express our uncertainty over the value of the target variable using a probability
distribution.
For this purpose, we shall assume that, given the value of x, the corresponding value of t
has a Gaussian distribution with a mean equal to the value y(x, w) of the polynomial curve.
Given X (the design matrix, containing all x(i)'s) and θ, we wish to know the distribution of
y(i)'s.

The probability of the data is given by


p(y|X; θ)

Since we want to view it as a function of θ, we define it as the likelihood function


L(θ) = L(θ; X, y) = p(y|X; θ) →

This can be re-written as follows


n n (i) T (i) 2
1 (y − θ x )
(i) (i)
L(θ) = ∏ p(y |x ; θ) = ∏ exp(− )
2
√ 2πσ 2σ
i=1 i=1

Now, given this probabilistic model, we want to choose our best guess of the parameters θ. The
principal of maximum likelihood says that we should choose θ so as to make the data as high
probability as possible. i.e. we should choose θ to maximize L(θ).

To make our computations simpler, we choose to maximize the log likelihood instead:
Chapter-2 Linear Models for Classification
2.1 The Classification Problem
Goal: Take an input vector x and assign it to one of K discrete classes Ck where k = 1,...,K.
The classes are usually taken as disjoint, so each input is assigned to only one class.
The input space is thus divided into decision regions whose boundaries are called decision
boundaries or decision surfaces.
The decision surfaces are linear functions of the input vector x and defined by (D-1) -
dimensional hyperplanes within the D-dimensional input space.
The datasets whose classes can be separated exactly by linear decision surfaces are said
to be linearly separable.

In the case of two-class problems, we use the binary representation in which there is a
single target variable t ∈ {0, 1} such that t = 1 represents class C1 and t = 0 represents
class C2.
We can interpret the value of t as the probability that the class is C1, with the values of
probability taking only the extreme values of 0 and 1.
For K > 2 classes, it is convenient to use a 1-of-K coding scheme in which t is a vector of
length K such that if the class is Cj , then all elements tk of t are zero except element tj,
which takes the value 1.
For instance, if we have K = 5 classes, then a pattern from class 2 would be given the
target vector t = (0, 1, 0, 0, 0)T.
Again, we can interpret the value of tk as the probability that the class is Ck.
For classification problems, we wish to predict discrete class labels, or more generally
posterior probabilities that lie in the range (0, 1). To achieve this, we consider a
generalization of this model in which we transform the linear function of w using a nonlinear
function f(·) so that
T
y(x) = f (w x + w0 )

This is known as an activation function or a link function.


The decision surfaces correspond to y(x) = constant, so that wTx + w0 = constant and
hence the decision surfaces are linear functions of x, even if the function f(·) is nonlinear.
For this reason, this class of models are called generalized linear models.

Approaches to the classification problem

1. Constructing a discriminant function that directly assigns each vector x to a specific class.
2. Model the conditional probabilities in an inference stage, and then use it to make decisions.
1. Model them directly by representing them as parametric models
2. Generative approach, i.e. model the class-conditional densities with prior probabilities

2.2. Discriminant Functions

It takes an input vector x and assigns it to one of the K classes, denoted Ck

For Two Classes (K=2)

The simplest representation can be given as


T
y(x) = w x + w0

where w is the weight vector and w0 is a bias.

An input vector x is assigned to class C1 if y(x) ≥ 0 and to class C2 otherwise.


The corresponding decision boundary is defined by the relation y(x) = 0, which
corresponds to a (D-1) - dimensional hyperplane withing the D-dimensional input space.
If we consider two points xA and xB lying on the decision surface, because y(xA) = y(xB) =
0, we get wT(xA-xB) = 0.
Hence, vector w is orthogonal to every vector lying within the decision surface, and so w
determines the orientation of the surface.

Similarly, if x is a point on the decision surface, then y(x) = 0, and the normal distance from
the origin to the decision surface is given by
T
w x w0
= −
∥w∥ ∥w∥

y(x) gives the signed measure of the perpendicular distance r of point x from decision
surface.
Consider an arbitrary point x and let x⊥ be its orthogonal projection onto the decision
surface, so that
w
x = x⊥ + r
∥w∥

Also we know
T
w x + w 0 = y(x)

and
T
w x ⊥ + w 0 = y(x ⊥ ) = 0
Finally, we obtain

y(x)
r =
∥w∥

For Multiple Classes (K>2)

1. One vs Rest - build K-1 discriminant functions. Each solves a two class classification.
2. One vs One - one discriminant function per pair of classes. K(K-1)/2 functions.

These approaches lead to the problem of ambiguous regions.

To avoid this we consider a single K-class discriminant comprising K linear functions for the
form
T
y k (w) = w k x + w k 0

Then we assign a point x to class Ck if yk(x) > yj(x) for all j ≠ k

The decision boundary between class Ck and class Cj is therefore given by yk(x) = yj(x) and
hence corresponds to a (D − 1)-dimensional hyperplane defined by
T
(w k − w j ) x + (w k 0 − w j 0 ) = 0

2.3 Least Squares Classification


Consider a K-class problem with 1-of-K binary coding scheme for target vector t.
Each class Ck has its own linear model

T
y k (x) = w k x + w k 0

This can be rewritten as

0 0
w ⋯ w
⎡ 1 k ⎤
~
W =
⋮ ⋱ ⋮

⎣ D D⎦
w ⋯ w
1 k

A new input x is assigned to the class for which the output yk is largest.
Now we determine the parameter matrix by minimizing a sum of squares error function.
Consider a training data set {xn, tn} where n = 1,...,N and define matrix T whose nth row is
the vector tnT together with matrix X whose nth row to xnT. The error function can be written
as

~ 1 ~ ~ T ~ ~
E D (W ) = T r(X W − T ) (X W − T )
2

Setting the derivative to zero and rearranging we get


~ ~ T ~ −1 ~ T ~†
W = (X X) X T = X T

Here, X† is the pseudo-inverse of the matrix X


The least squares solutions lack robustness to outliers and can give poor results.

2.4 Fisher's Linear Discriminant


Take a D-dimensional input vector and project it down to 1-dimension using y = wTx
Idea: Find the projection that maximizes the class separation.

We consider a 2-class problem with N1 points of class C1 and N2 points of class C2, so that
the mean vectors are given as

1 1
m1 = ∑ xn m2 = ∑ xn
N1 N2
n∈C 1 n∈C 2

The simplest measure of separation of classes when projected onto w, is the separation of
project class means. That is
T
Choose w to maximize m 2 − m 1 = w (m 2 − m 1 )

where mk = wTmk is the mean of projected data from Ck


The project formula transforms the set of labelled data points in x into a labelled set of 1-
dimensional space y.
The within class variance of transformed data from Ck is

2 2
s k = ∑ (y n − m k )

n∈C k

where yn = wT xn
The total within class variance for the whole dataset can be defined to be
2 2
s1 + s2

Fisher's Criterion
The ratio of between class variance to the within class variance, i.e.
2
(m 2 − m 1 )
J (w) =
2 2
s + s
1 2

We can rewrite the criterion


T
w SB w
J (w) =
T
w SW w

where SB is the between class covariance matrix given by

T
S B = (m 2 − m 1 )(m 2 − m 1 )

and SW is the total within class covariance matrix give by

T T
S W = ∑ (x n − m 1 )(x n − m 1 ) + ∑ (x n − m 2 )(x n − m 2 )

n∈C 1 n∈C 2

Differentiating with respect to w, we see that J(w) is maximized when


T T
(w S B w)S W w = (w S W w)S B w

SBw is always in the direction of (m2 - m1) and since we do not care about the magnitude
of w, the scalar factors can be dropped and after multiplying both sides by SW-1 we obtain

−1
w ∝ S (m 2 − m1)
W

This result is known as Fisher's Linear Discriminant.


Although strictly it is not a discriminant but rather a specific choice of direction for projection
of the data down to one dimension.
However, the projected data can subsequently be used to construct a discriminant, by
choosing a threshold y0 so that we classify a new point as belonging to C1 if y(x) ≥ y0 and
classify it as belonging to C2 otherwise.

2.5 Perceptron
Another example of a linear discriminant model is the perceptron of Rosenblatt (1962),
which occupies an important place in the history of pattern recognition algorithms.
It corresponds to a two-class model in which the input vector x is first transformed using a
fixed nonlinear transformation to give a feature vector φ(x), and this is then used to
construct a generalized linear model of the form
T
y(x) = f (w ϕ(x))

The non-linear activation function f(.) is given by a step function for the form
+1 , a ≥ 0
f (a) = {
−1 , a < 0

For the perceptron, it is more convenient to use target values t = +1 for class C1 and t = -1
for class C2, which matches the choice of the activation function.

Perceptron Learning Algorithm

The error function is known as perceptron criterion.


We are seeking a weight vector w such that patterns xn in class C1 will have wTφ(xn) > 0,
whereas patterns xn in class C2 have wTφ(xn) < 0.
Using the t ∈ {−1, +1} target coding scheme it follows that we would like all patterns to
satisfy wTφ(xn)tn > 0.
The perceptron criterion associates zero error with any pattern that is correctly classified,
whereas for a misclassified pattern xn it tries to minimize the quantity −wTφ(xn)tn.
The perceptron criterion is therefore given by
T
E p (w) = − ∑ w ϕn tn

n∈M

where M denotes the set of all misclassified patterns.


The contribution to the error associated with a particular misclassified pattern is a linear
function of w in regions of w space where the pattern is misclassified and zero in regions
where it is correctly classified.
The total error function is therefore piecewise linear.

Optimization

The perceptron learning algorithm has a simple interpretation, as follows.


We cycle through the training patterns in turn, and for each pattern xn we evaluate the
perceptron function. If the pattern is correctly classified, then the weight vector remains
unchanged, whereas if it is incorrectly classified, then for class C1 we add the vector φ(xn)
onto the current estimate of weight vector w while for class C2 we subtract the vector φ(xn)
from w.

2.6 Probabilistic Generative Models


Model class conditional densities p(x | Ck) as well as priors p(Ck) and then computer
posterior probabilities p(Ck | x) through Bayes' Theorem.
For the case of 2 classes, the probability for class C1 can be written as

p(x|C 1 )p(C 1 ) 1
p(C 1 |x) = = = σ(a)
p(x|C 1 )p(C 1 ) + p(x|C 2 )p(C 2 ) 1 + exp(−a)

where

p(x|C 1 )p(C 1 )
a = ln
p(x|C 2 )p(C 2 )

σ(a) is termed as the logistic sigmoid function. The term sigmoid means 'S-shaped'.
This is sometimes also called squashing function as it maps the whole real axis into a finite
interval.
It maps the feature space into a probability function.

Properties of the sigmoid function

1. Symmetry: σ(−a)=1 − σ(a)


2. Inverse: a = ln (σ/(1-σ)) known as the logit function. Represents log of the ratio of
probabilities.

Log Odds: ln(p(C1 | x)/p(C2 | x)) for two classes


For probability p, sigmoid(logit(p)) = p
The logit function is given as

p(x)
logit(p(x)) = log
1 − p(x)
Logistic function is the inverse of logit (maps values from the range (-∞,+∞) into [0,1])

For K>2, we define the normalized exponential and it can be regarded as a multiclass
generalization of the logistic sigmoid.

This is also known as the SoftMax function. It turns logits into probabilities by taking
exponential of each output and then normalizing each number by the sum of those
exponents so that the entire output vector adds up to one.

Derivative of the sigmoid


Continuous Inputs

Let us assume that the class-conditional densities are Gaussian and then explore the
resulting form for the posterior probabilities. To start with, we shall assume that all classes
share the same covariance matrix. Thus the density for class Ck is given by

For a D-dimensional vector x, the multivariate Gaussian distribution takes the above form,
where μ is a D-dimensional mean vector, Σ is a DxD covariance matrix, and |Σ| denotes the
determinant.
We consider the case of 2 classes

Maximum Likelihood Solution

Once we have specified a parametric functional form for the class-conditional densities
p(x|Ck), we can then determine the values of the parameters, together with the prior class
probabilities p(Ck), using maximum likelihood.
This requires a data set comprising observations of x along with their corresponding class
labels.
We consider the case of two classes, each having a Gaussian class-conditional density
with a shared covariance matrix, and suppose we have a data set {xn, tn} where n = 1,...,N.
Here tn = 1 denotes class C1 and tn = 0 denotes class C2. We denote the prior class
probability p(C1) = π, so that p(C2)=1 − π. For a data point xn from class C1, we have tn =
1 and hence

p(x n , C 1 ) = p(C 1 )p(x n |C 1 ) = πN (x n |μ 2 , Σ)

Similarly for class C2, we have tn = 0 and hence

p(x n , C 2 ) = p(C 2 )p(x n |C 2 ) = (1 − π)N (x n |μ 2 , Σ)

Thus the likelihood function is given by


N

tn 1−t n
p(t, X|π, μ 1 , μ 2 , Σ) = ∏ [πN (x n |μ 1 Σ)] [(1 − π)N (x n |μ 2 , Σ)]

n=1

The log likelihood is given by

ln p(t, X|π, μ 1 , μ 2 , Σ) = ∑ t n ln π + t n ln N (x n |μ 1 , Σ) + (1 − t n ) ln(1 − π) + (1 − t n ) ln N (x n |μ 2 , Σ

n=1

Maximizing with respect to π we get

N
1 N1
π = ∑ tn =
N N1 + N2
n=1

Maximizing with respect to μ1 we get

N
1
μ1 = ∑ tn xn
N1
n=1
Maximizing with respect to μ2 we get

N
1
μ2 = ∑(1 − t n )x n
N2
n=1

Maximizing with respect to Σ we get

The maximum likelihood solution represents a weighted average of the covariance matrix
associated with each of the two classes.

Bernoulli Distribution - Discrete Features

We begin by considering a single binary random variable x ∈ {0, 1}.


For example, x might describe the outcome of flipping a coin, with x = 1 representing
‘heads’, and x = 0 representing ‘tails’.
We can imagine that this is a damaged coin so that the probability of landing heads is not
necessarily the same as that of landing tails. The probability of x = 1 will be denoted by the
parameter µ so that p(x = 1 | µ) = µ, where 0 ≤ µ ≤ 1, from which it follows p(x = 0 | µ)=1 −
µ. The probability distribution over x can therefore be written in the form
x 1−x
Bern(x|μ) = μ (1 − μ)

This is known as Bernoulli Distribution.


Now suppose we have a data set D = {x1,...,xN } of observed values of x. We can construct
the likelihood function, which is a function of µ, on the assumption that the observations are
drawn independently from p(x|µ), so that
N N

xn 1−x n
p(D|μ) = ∏ p(x n |μ) = ∏ μ (1 − μ)

n=1 n=1

We can estimate a value for µ by maximizing the likelihood function, or equivalently by


maximizing the logarithm of the likelihood. In the case of the Bernoulli distribution, the log
likelihood function is given by
N N

ln p(D|μ) = ∑ ln p(x n |μ) = ∑{x n ln μ + (1 − x n ) ln(1 − μ)}

n=1 n=1

Now consider the case of discrete feature values xi. For simplicity, we begin by looking at
binary feature values xi ∈ {0, 1}. Here we will make the naive Bayes assumption in which
the feature values are treated as independent, conditioned on the class Ck. Thus we have
class-conditional distributions of the form
D
xi 1−x i
p(x|C k ) = ∏ μ (1 − μ ki )
ki

i=1

which contain D independent parameters for each class.


Substituting into ak = ln p(x|Ck)p(Ck) we get

a k = ∑{x i ln μ ki + (1 − x i ) ln(1 − μ ki )} + ln p(C k )

i=1

which again are linear functions of the input values xi.

Example

The data tuples are described by the attributes age, income, student and credit_rating.
The class label attribute buys_computer, has two distinct values (namely, {yes, no}).
Let C1 correspond to the class buys_computer = yes and C2 correspond to buys_computer
= no.
We wish to classify X = (age = youth, income = medium, student = yes, credit_rating = fair)
The prior probability of each class, can be computed based on the training tuples:

P(C1)=P(buys_computer =yes) =9/14 =0.643


P(C2)=P(buys_computer =no) =5/14 =0.357.
To compute P(X|Ci), for i=1, 2, we compute the following conditional probabilities:

P(age =youth | buys_computer =yes) = 2/9 = 0.222


P(age =youth | buys_computer =no) =3/5 =0.600
P(income =medium | buys_computer =yes) =4/9 =0.444
P(income =medium | buys_computer =no) =2/5 =0.400
P(student =yes | buys_computer =yes) =6/9 =0.667
P(student =yes | buys_computer D=no) =1/5 =0.200
P(credit_rating =fair | buys_computer =yes) =6/9 =0.667
P(credit_rating =fair | buys_computer =no) =2/5 =0.400
Using these probabilities we obtain
P(X | buys_computer =yes)= P(age =youth | buys computer = yes) P(income = medium |
buys_computer =yes) P(student = yes | buys_computer = yes) P(credit rating = fair |
buys_computer =yes) = 0.222 0.444 0.667 0.667 = 0.044
Similarly, P(X | buys_computer = no)=0.600 0.400 0.200 * 0.400 = 0.019
To find the class, Ci , that maximizes P(X|Ci) * P(Ci), we compute

P(X | buys_computer = yes) P(buys_computer = yes) = 0.044 0.643 = 0.028

P(X | buys_computer = no) P(buys_computer = no) =0.019 0.357 = 0.007

Therefore, the naive Bayesian classifier predicts buys_computer = yes for tuple X.

2.7 Probabilistic Discriminative Models


For the two-class classification problem, we have seen that the posterior probability of
class C1 can be written as a logistic sigmoid acting on a linear function of x, for a wide
choice of class-conditional distributions p(x|Ck).
Similarly, for the multiclass case, the posterior probability of class Ck is given by a softmax
transformation of a linear function of x.
For specific choices of the class-conditional densities p(x|Ck), we have used maximum
likelihood to determine the parameters of the densities as well as the class priors p(Ck) and
then used Bayes’ theorem to find the posterior class probabilities.
However, an alternative approach is to use the functional form of the generalized linear
model explicitly and to determine its parameters directly by using maximum likelihood.
The indirect approach to finding the parameters of a generalized linear model, by fitting
class-conditional densities and class priors separately and then applying Bayes’ theorem,
represents an example of generative modelling, because we could take such a model and
generate synthetic data by drawing values of x from the marginal distribution p(x).
In the direct approach, we are maximizing a likelihood function defined through the
conditional distribution p(Ck|x), which represents a form of discriminative training.
One advantage of the discriminative approach is that there will typically be fewer adaptive
parameters to be determined, as we shall see shortly. It may also lead to improved
predictive performance, particularly when the class-conditional density assumptions give a
poor approximation to the true distributions.

Fixed Basis Functions

We have considered classification models that work directly with the original input vector x.
However, all of the algorithms are equally applicable if we first make a fixed nonlinear
transformation of the inputs using a vector of basis functions φ(x).
The resulting decision boundaries will be linear in the feature space φ, and these
correspond to nonlinear decision boundaries in the original x space.
Classes that are linearly separable in the feature space φ(x) need not be linearly separable
in the original observation space x.

Logistic Regression

Consider the two-class classification problem. The posterior probability of class C1 can be
written as a logistic sigmoid function:

1
T
p(C 1 |x) = = σ(w x)
T
1 + exp(−w x)
where, p(C2|x) = 1 - p(C1|x) and we omit the bias term for clarity.
This model is known as logistic regression.

Maximum Likelihood for Logistic Regression

We observed a training set {xn, tn}, n = 1,...,N; t∈{0,1}


We want to maximize the probability of getting the label right, so the likelihood function is:
N

tn 1−t n
p(t|X, w) = ∏ [y n (1 − y n ) ]

n=1

where, yn = p(C1|φn) and also yn = σ(wTxn)


Taking the negative log of the likelihood, we define the cross-entropy error function:

N N

E(w) = − ln p(t|X, w) = − ∑[t n ln y n + (1 − t n ) ln(1 − y n )] = ∑ E n

n=1 n=1

Differentiation and using the chain rule we get,

d yn − tn
En =
dy n y n (1 − y n )

d dE n dy n
En = = (y n − t n )x n
dw dy n dw

Therefore, we obtain

∇E(w) = ∑(y n − t n )x n

n=1

This takes exactly the same form as the gradient of the sum-of-squares error function for
the linear regression model.
Unlike linear regression, there is no closed form solution, due to non-linearity of the logistic
sigmoid function.
The error function is convex and can be optimized using standard gradient-based or more
advanced optimization techniques.
Chapter-3 Decision Trees & Ensembles
3.1 Decision Tree Learning
Construction of decision tree from class-labeled training tuples
It is a flow-chart like structure, where each internal node (non-leaf) denotes a test on an
attribute, each branch represents the outcome of the test, and each leaf node (terminal)
holds a class label.
The population or sample is split into two or more homogeneous sets (sub-populations)
based on most significant splitter/differentiator in input variables.

Example

Given a dataset of two inputs (x) of height in centimeters and weight in kilograms, the
output is the sex as male or female.
The tree can be stored into a file as a graph or a set of rules.

For example, below is the above decision tree as a set of rules.

- If Height > 180 cm Then Male


- If Height <= 180 cm AND Weight > 80 kg Then Male
- If Height <= 180 cm AND Weight <= 80 kg Then Female

For example, given the input of [height = 160 cm, weight = 65 kg], we would traverse the
above tree as follows:

Height > 180 cm: No


Weight > 80 kg: No
Therefore: Female

Important Terminology

1. Root Node: It is the topmost node in the tree. It represents the entire population or sample
and this further gets divided into two or more homogeneous sets.
2. Splitting: It is a process of dividing a node into two or more sub-nodes.
3. Decision Node: When a sub-node splits into further sub-nodes, then it is called the
decision node.
4. Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
Example of a decision tree

The classification problem can be solved by asking a series of carefully crafted questions
about the attributes of the test record.
The series of questions and their possible answers can be organized in the form of a
decision tree.
There could be more than one tree that fits the same data.
Applying Model to Test Data

Starting from the root node of the tree, we apply the test condition to the record and follow
the appropriate branch based on the outcome of the test.
This will lead us either to another internal node, for which a new test is applied, or to a leaf
node.
The class label associated with the leaf node is then assigned to the record.

Decision Tree Classification Task

3.2 Decision Tree Induction


Many Algorithms:

Hunt’s Algorithm (one of the earliest)


CART
ID3, C4.5
SLIQ,SPRINT

Hunt's Algorithm
In Hunt's Algorithm, a decision tree is grown in a recursive fashion by partitioning the
training records into successively purer subsets.
Let Dt be the set of training records that reach a node t and y = {y1,...yc} be the class
labels.
If Dt contains records that belong the same class yt , then t is a leaf node labeled as yt
If Dt contains records that belong to more than one class, use an attribute test to split the
data into smaller subsets. Recursively apply the procedure to each subset.

Consider the problem of predicting whether a loan applicant will repay their loan obligations
or become delinquent, subsequently defaulting on their loan.

Design Issues of Decision Tree Induction

Greedy Strategy: Split records based on an attribute test that optimizes certain criterion.
Issues:
Determine how to split the records
How to specify the attribute test condition
How to determine the best split
Determine when to stop splitting
How to Specify Attribute Test Condition

Depends on attribute types


Nominal
Ordinal
Continuous
Depends on number of ways to split
2-way split
Multi-way split

Splitting Based on Nominal Attributes

1. Multi-way Split: Use as many partitions as distinct values

2. Binary Split: Divides values into two subsets and need to find optimal partitioning

Splitting Based on Ordinal Attributes

1. Multi-way Split: Use as many partitions as distinct values

2. Binary Split: Divides values into two subsets and need to find optimal partitioning
Splitting Based on Continuous Attributes

Different ways of handling:


Binary decision: (A<v) or (A≥v)
Consider all possible splits and find the best cut
Can be more compute intensive
For multiway split:
Consider all possible ranges of continuous values and apply the discretization
strategies
After discretization, a new ordinal value will be assigned to each discretized
interval

Determining Best Split

Greedy Approach: Nodes with homogeneous class distribution are preferred.


Need a measure of node impurity
Measures of Node Impurity

1. Gini Index

c−1
2
Gini I ndex = 1 − ∑ p i (t)

i=0

where pi(t) is the frequency of class i at node t, and c is the total number of classes

2. Entropy

c−1

Entropy = − ∑ p i (t) log 2 p i (t)

i=0

3. Misclassification Error

Classif ication Error = 1 − max[p i (t)]

Compute impurity measure (P) before splitting


Compute impurity measure (M) after splitting
Compute impurity measure of each child node
M is the weighted impurity of child nodes
Choose the attribute test condition that produces the highest gain

Gain = P − M

or equivalently, lowest impurity measure after splitting (M)

3.3 Measures of Impurity


1. GINI Index

For a node t,

(c−1)

2
Gini index = 1 − ∑ p i (t)

i=0

For a 2 class problem (p, 1-p):


2 2
GI N I = 1 − p − (1 − p) = 2p(1 − p)
Computing Gini Index on a single node

Computing Gini Index for a collection of nodes

When a node p is split into k partitions (children)

k
ni
GI N I split = ∑ GI N I (i)
n
i=1

where,
ni = number of records at child i
n = number of records at parent node p

Binary Attributes - Computing Gini Index

Splits into two partitions (child nodes)


Effect of Weighing partitions – Larger and purer partitions are sought

Categorical Attributes - Computing Gini Index

For each distinct value, gather counts for each class in the dataset
Use the count matrix to make decisions
2. Entropy

Entropy at a given node t

c−1

Entropy = − ∑ p i (t) log 2 p i (t)

i=0

where pi(t) is the frequency of class i at node t, and c is the total number of classes

Computing Entropy of a single node

Computing Information Gain after splitting

Information Split:

k
ni
Gain split = Entropy(p) − ∑ Entropy(i)
n
i=1

Parent Node, p is split into k partitions (children)


ni is number of records in child node i
Choose the split that achieves most reduction (maximizes GAIN)
Used in the ID3 and C4.5 decision tree algorithms

3. Classification Error

Classification Error at a node t,

Classif ication Error = 1 − max[p i (t)]


i

Computing Error of a single node


Comparison among Impurity Measures

For 2-class problem

All three measures attain their maximum value when the class distribution is uniform (p =
0.5)
The minimum values for the measures are attained when all the records belong to the
same class

Decision Tree Based Classification

Advantages
Relatively inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Robust to noise (especially when methods to avoid overfitting are employed)
Can easily handle redundant attributes
Can easily handle irrelevant attributes (unless the attributes are interacting)
Disadvantages
Due to the greedy nature of splitting criterion, interacting attributes (that can
distinguish between classes together but not individually) may be passed over in favor
of other attributed that are less discriminating.
Each decision boundary involves only a single attribute

3.4 Model Overfitting and Underfitting


Errors committed by a classification model are generally divided into
Training Errors: misclassification on training set records
Generalization Errors (Testing Errors): errors made on testing set
Good model has low training error and low generalization error
Overfitting: model has low training error rate,, but high generalization errors
The training and test error rates of the model are large when the size of the tree is very
small. This situation is known as model overfitting.
Underfitting occurs because the model has yet to learn the true structure of the data.
As a result, it performs poorly on both the training and test sets.
As the number of nodes increases, the tree will have fewer training and test errors.
However, once the tree becomes too large, its test error begins to increase even though its
training error rate continues to decrease. This is known as model overfitting.

The training error of a model can be reduced by increasing the model complexity.
For example, the leaf nodes of the tree can be expanded until it perfectly fits the training
data.
Although the training error for such a complex tree is zero, the test error can be large
because the tree may contain nodes that accidently fit some of the noise points in the
training data.
Such nodes can degrade the performance of the tree because they do not generalize
well to the test examples.
The tree that contains the smaller number of nodes has a higher training error rate, but a
lower test error rate compared to the more complex tree.

Reasons for Overfitting

1. Presence of noise
2. Lack of representative samples

Overfitting due to presence of noise


Tree perfectly fits the training data.
Test error rate is 30% -> Both humans and dolphins were misclassified as nonmammals
because their attribute values for Body Temperature, Gives Birth, and Four-legged are
identical to the mislabeled records in the training set.

M1 has overfitted the training data because there is a simpler model with lower error rate
on the test set.
The Four-legged attribute test condition in model M1 is spurious because it fits the
mislabeled training records, which leads to the misclassification of records in the test set

Overfitting due to lack of representative samples

Models that make their classification decisions based on a small number of training records
are also susceptible to overfitting.

Humans, elephants, and dolphins are misclassified because the decision tree classifies all
warm-blooded vertebrates that do not hibernate as non-mammals.
The tree arrives at this classification decision because there is only one training record,
which is an eagle, with such characteristics.
Handling Overfitting in Decision Trees

Two strategies for avoiding model overfitting in the context of decision tree induction.
Pre-pruning (Early Stopping)
Post-pruning

Pre-pruning

The tree-growing algorithm is halted before generating a fully grown tree that perfectly fits
the entire training data.
To do this, a more restrictive stopping condition must be used; e.g., stop expanding a leaf
node when the observed gain in impurity measure (or improvement in the estimated
generalization error) falls below a certain threshold.
The advantage of this approach is that it avoids generating overly complex subtrees that
overfit the training data.
Nevertheless, it is difficult to choose the right threshold for early termination.

Post-pruning

In this approach, the decision tree is initially grown to its maximum size.
This is followed by a tree-pruning step, which proceeds to trim the fully grown tree in a
bottom-up fashion.
Trimming can be done by replacing a subtree with
a new leaf node whose class label is determined from the majority class of records
affiliated with the subtree, or
the most frequently used branch of the subtree.
The tree-pruning step terminates when no further improvement is observed.
Post-pruning tends to give better results than pre-pruning because it makes pruning
decisions based on a fully grown tree, unlike pre-pruning, which can suffer from premature
termination of the tree-growing process.
However, for post-pruning, the additional computations needed to grow the full tree may be
wasted when the subtree is pruned.

3.5 Bias-Variance Tradeoff


The goal of any supervised machine learning algorithm is to best estimate the mapping
function (f) for the output variable (Y) given the input data (X).
The mapping function is often called the target function because it is the function that a
given supervised machine learning algorithm aims to approximate.
The prediction error for any machine learning algorithm can be broken down into three
parts:
Bias Error
Variance Error
Irreducible Error
The irreducible error cannot be reduced regardless of what algorithm is used.
It is the error introduced from the chosen framing of the problem and may be caused
by factors like unknown variables that influence the mapping of the input variables to
the output variable.

Bias and Variance as a Function of Model Complexity

Bias Error

Bias are the simplifying assumptions made by a model to make the target function easier to
learn.
Generally, linear algorithms have a high bias making them fast to learn and easier to
understand but generally less flexible.
In turn, they have lower predictive performance on complex problems that fail to meet the
simplifying assumptions of the algorithms bias.
Low Bias: Suggests less assumptions about the form of the target function.
High-Bias: Suggests more assumptions about the form of the target function.
Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest
Neighbors and Support Vector Machines.
Examples of high-bias machine learning algorithms include: Linear Regression, Linear
Discriminant Analysis and Logistic Regression.

Variance Error
Variance is the amount that the estimate of the target function will change if different
training data was used.
The target function is estimated from the training data by a machine learning algorithm, so
we should expect the algorithm to have some variance.
Ideally, it should not change too much from one training dataset to the next, meaning that
the algorithm is good at picking out the hidden underlying mapping between the inputs and
the output variables.
Low Variance: Suggests small changes to the estimate of the target function with changes
to the training dataset.
High Variance: Suggests large changes to the estimate of the target function with changes
to the training dataset.
Generally, nonlinear machine learning algorithms that have a lot of flexibility have a high
variance. For example, decision trees have a high variance, that is even higher if the trees
are not pruned before use.
Examples of low-variance machine learning algorithms include: Linear Regression, Linear
Discriminant Analysis and Logistic Regression.
Examples of high-variance machine learning algorithms include: Decision Trees, k-Nearest
Neighbors and Support Vector Machines.

Graphical Definition

We can create a graphical visualization of bias and variance using a bulls-eye diagram.
Imagine that the center of the target is a model that perfectly predicts the correct values. As
we move away from the bulls-eye, our predictions get worse and worse. Imagine we can
repeat our entire model building process to get a number of separate hits on the target.
Each hit represents an individual realization of our model, given the chance variability
in the training data we gather.
Sometimes we will get a good distribution of training data so we predict very well and
we are close to the bulls-eye, while sometimes our training data might be full of
outliers or non-standard values resulting in poorer predictions.
These different realizations result in a scatter of hits on the target.
Underfitting happens when a model unable to capture the underlying pattern of the data.
These models usually have high bias and low variance. It happens when we have very less
amount of data to build an accurate model or when we try to build a linear model with a
nonlinear data.
Overfitting happens when our model captures the noise along with the underlying pattern in
data. It happens when we train our model a lot over noisy dataset. These models have low
bias and high variance.
The bias-variance tradeoff is a central problem in supervised learning. Ideally, one wants to
choose a model that both accurately captures the regularities in its training data, but also
generalizes well to unseen data.
Unfortunately, it is typically impossible to do both simultaneously.
High-variance learning methods may be able to represent their training set well, but
are at risk of overfitting to noisy or unrepresentative training data.
In contrast, algorithms with high bias typically produce simpler models that may fail to
capture important regularities (i.e. underfit) in the data.
The goal of any supervised machine learning algorithm is to achieve low bias and low
variance. In turn the algorithm should achieve good prediction performance.
Usually there is a trend as follows:
Linear machine learning algorithms often have a high bias but a low variance.
Nonlinear machine learning algorithms often have a low bias but a high variance.
The parameterization of machine learning algorithms is often a battle to balance out bias
and variance.

3.6 Ensembles
Ensemble learning refers to algorithms that combine the predictions from two or more
models.
In broad terms, using ensemble methods is about combining models such that the
ensemble has a better performance than an individual model on average.
The three main classes of ensemble learning methods are bagging, stacking, and boosting.

Different Ensemble Models

Bagging involves fitting many decision trees on different samples of the same dataset and
averaging the predictions.
Stacking involves fitting many different models types on the same data and using another
model to learn how to best combine the predictions.
Boosting involves adding ensemble members sequentially that correct the predictions
made by prior models and outputs a weighted average of the predictions.

Bagging

The name Bagging came from the abbreviation of Bootstrap AGGregatING.


This typically involves using a single machine learning algorithm, almost always an
unpruned decision tree, and training each model on a different sample of the same training
dataset.
The predictions made by the ensemble members are then combined using simple
statistics, such as voting or averaging.

Bootstrap sampling

Bagging adopts the bootstrap distribution for generating different base learners. In other
words, it applies bootstrap sampling to obtain the data subsets for training the base
learners.
A bootstrap sample is a sample of size n drawn with replacement from an original training
set D with |D| = n.
Consequently, some training examples are duplicated in each bootstrap sample, and some
other training examples do not appear in a given bootstrap sample at all. Usually, we refer
to these examples as "out-of-bag sample"
In the limit, there are approximately 63.2% unique examples in a given bootstrap sample.
Consequently, 37.2% examples from the original training set will not appear in a given
bootstrap sample at all.
If we sample from a uniform distribution, we can compute the probability that a given
example from a dataset of size n is not drawn as a bootstrap sample as

Bootstrapping in the context of Bagging


Stacking

Stacking is a general procedure where a learner is trained to combine the individual


learners. Here, the individual learners are called the first-level learners, while the combiner
is called the second-level learner, or meta-learner.
Boosting

In boosting, the training dataset for each subsequent classifier increasingly focuses on
instances misclassified by previously generated classifiers.
The key property of boosting ensembles is the idea of correcting prediction errors.
The models are fit and added to the ensemble sequentially such that the second model
attempts to correct the predictions of the first model, the third corrects the second model,
and so on. lead
This typically involves the use of very simple decision trees that only make a single or a
few decisions, referred to in boosting as weak learners.
The predictions of the weak learners are combined using simple voting or averaging,
although the contributions are weighed proportional to their performance or capability. The
objective is to develop a so-called "strong-learner" from many purpose-built "weak-
learners."
AdaBoost, or Adaptive Boost

The way boosting generally works is you build multiple models on subsets of the dataset in
a sequential manner, where each model learns from the previous models' mistakes. After
you build all the weak models, you then combine their predictions to get a "strong model".
AdaBoost, short for Adaptive Boosting is one of the early successful algorithms within the
Boosting branch of machine learning, and is used specifically for binary classification.

Adaboost Algorithm
AdaBoost for classification is a supervised machine learning problem.
It consists of iteratively training multiple stumps using feature data (x) and target labels (y).
After training, the model's overall predictions come from a weighted sum of the stumps'
predictions.

Step 1 - Initialization

Step 1 of AdaBoost is initializing a weight of 1/N for each datapoint. Again, the weight of a
datapoint represent the probability of selecting it during sampling.

Step 2 - Training Weak Classifiers


After sampling the dataset to get the training set for the current iteration, a weak classifier
Km is then trained using this training set. In our case, K, is just a decision tree stump.
Decision tree stump is a decision tree with just one decision. leadina to two or more leaves

Step 3 - Calculating Update Parameters from Error

Stumps in AdaBoost progressively learn from previous stumps' mistakes through adjusting
the weights of the dataset to accommodate the mistakes made by stumps at each iteration.
The weights of misclassified data will be increased, and weights of correctly classified data
will be decreased. As a result, as we go into further iterations, the training data will mostly
comprise data that is often misclassified by our stumps.
Intuitively, if one stump is unable to classify certain data properly, then you just need to
train more stumps on the data that is "hard to classify". Hence, taking a weighted sum of
these stumps let's AdaBoost perform very well.

Example

The weight updates at certain iteration m are calculated as the product of the current
weights, with the exponential of (- alpha y Km(x)), where
y represents the true values of the target feature (in our toy dataset case, this would
be the 'species' column).
Km(x) is the prediction made by the stump we trained at iteration m.
alpham is the confidence we place on the predictive power of stump m.
y and Km(x) can each take values -1 or 1. If y is 1 and Km(x) is 1, or both are -1, then
this means that training sample x has been correctly classified. y * Km(x) will then be
equal to 1.
Otherwise, y * Km(x) will be equal to -1.
This corresponds with the "learning from mistakes" concept
When data is misclassified, y * Km(x)) = -1, then the weight update will be positive and
the weights will exponentially increase for that data.
When data is correctly classified, y * Km(x) = 1, the opposite happens.
Epsilon is the ratio of sum of weights for misclassified samples to the total sum of weights
for samples.
Large epsilon values, which mean large misclassification percentage, makes alpha_m
exponentially decrease.
The confidence we have in stump m's predictions also exponentially decrease. Intuitively
this makes so much sense, because more errors = less confidence.
Hence stump m's predictions will only have a very small "say" in the overall prediction.
Naturally, the opposite follows for small epsilon values
The overall prediction of AdaBoost at iteration m, denoted by
m

C m (x) = ∑ α m K m (x)

i=1

AdaBoost trains a sequence of models with augmented sample weights, generating


'confidence' coefficients alpha for individual classifiers based on errors.
Low errors leads to large alpha, which means higher importance in the voting. • When
making overall predictions for AdaBoost, we use alpha, as the weight of each stump's
prediction.

prediction = alpha 1 ∗ stump 1 + alpha 2 ∗ stump 2 …

Example
Step 4 - Making New Overall Predictions

The final step of the AdaBoost algorithm involves making overall predictions on new data.

N ew predictions computed by K(x) = sign[ ∑ α m K m (x)]

m=1

The overall predictions are just the sums of each stump's predictions, weighted by each
stump's alpha (confidence that AdaBoost places on each stump's ability to classify the
data).
Since AdaBoost only predicts {-1, 1}. As long as the weighted sum is above 0, then
AdaBoost will predict 1.
The opposite happens, when the weighted sum is below 0 and AdaBoost will predict -1.

Random Forests

Random forests are among the most widely used machine learning algorithm, probably due
to their relatively good performance "out of the box" and ease of use.
The random forest algorithm can be understood as bagging with decision trees, but instead
of growing the decision trees by basing the splitting criterion on the complete feature set,
we use random feature subsets.
While the size of the feature subset to consider at each node is a hyperparameter that we
can tune, we can use NumFeatures = log2 m + 1.
Chapter-4 Support Vector Machines
4.1 Support Vector Machine
Basic Idea

Optimal hyperplane for linearly separable patterns


Extend to patterns that are not linearly separable by transformations of original data to map
into new space - the Kernel function

Linear Discriminant Function or a Linear Classifier

Given the data and two classes, learn a function of the form:
T
g(x) = w x + b

A hyper-plane in the feature space


Decide class = 1 if g(x)>0 and class = -1 otherwise
Infinite number of answers! Which one is the best?

Large Margin Linear Classifier

The linear discriminant function (classifier) with the maximum margin is the best
Margin is defined as the width that the boundary could be increased by, before hitting a
data point
Which points should influence optimality?
All points?
Linear regression
Or only "difficult points" close to decision boundary
Support vector machines

Support Vectors
Support vectors are the data points that lie closest to the decision surface (or hyperplane)
They are the data points most difficult to classify
They have direct bearing on the optimum location of the decision surface

General input/output for SVMs

Input: set of (input, output) training pair samples


Call the input sample features x1,x2,...,xn and the output result y.
Typically, there can be lots of input features xi
Output: set of weights w (or wi ), one for each feature, whose linear combination predicts
the value of y.
Important difference: We use the optimization of maximizing the margin ('street width') to
reduce the number of weights that are nonzero to just a few that correspond to the
important features that 'matter' in deciding the separating line (hyperplane).
These nonzero weights correspond to the support vectors (because they 'support' the
separating hyperplane)
Support vectors are the elements of the training set that would change the position of the
dividing hyperplane if removed.
Support vectors are the critical elements of the training set
The problem of finding the optimal hyper plane is an optimization problem and can be
solved by optimization techniques (we use Lagrange multipliers to get this problem into a
form that can be solved analytically)

Aim: Learn a large margin classifier

Mathematical Formulation:

2
maximize
||w||

such that
T
F or y i = +1, w xi + b ≥ 1

T
F or y i = −1, w x i + b ≤ −1

1 2
minimize ||w||
2

such that
T
F or y i = +1, w xi + b ≥ 1

T
F or y i = −1, w x i + b ≤ −1

Given a set of data points, define:


T
F or y i = +1, w xi + b ≥ 1

T
F or y i = −1, w x i + b ≤ −1

Algebraic Expression for Width of a Margin

Given 2 parallel lines with equations

ax + by + c 1 = 0 and ax + by + c 2 = 0

the distance between them is given by:

|c 2 − c 1 |
d =
√a 2 + b 2
Our lines in 2-D are:

w 1 x 1 + w 2 x 2 + b − 1 = 0 and w 1 x 1 + w 2 x 2 + b + 1 = 0

|b − 1 − b − 1| 2
Distance = =
2 2
||w||
√w + w
1 2

Formulation:

1
2
minimize ||w||
2

such that
T
y i (w x i + b) ≥ 1

This is a quadratic programming problem with linear constraints, the surface is a


paraboloid, with just a single global minimum
However, we will convert it to Lagrangian dual in order to use the kernel trick!

Intuition: Find intersection of two functions f, g at a tangent point (intersection = both


constraints satisfied; tangent = derivative is 0); this will be a min (or max) for f s.t. the
contraint is satisfied
Rewriting the conditions as:
Solving the Optimization Problem

This indicates the primal form of the optimization problem


We will actually solve the optimization problem by now solving for the dual of this original
problem
The Lagrangian Dual Problem: instead of minimizing over w, b, subject to constraints
involving a's, we can maximize over a (the dual variable) subject to the relations obtained
previously for w and b
This is a constrained optimization problem.
Optimization — because, we are to find the line from which the support vectors are
maximally separated and
Constrained — because, the support vectors should be away from the road and not on
the road. We will use Lagrange Multipliers to solve this problem

4.2 SVM Lagrange Problem


Lagrange stated that if we want to find the minimum of f under the equality constraint g, we
need to solve for:

∀f (x) − α∇g(x) = 0

α is called the Lagrange multiplier


In terms of the SVM optimization problem,

1
2
f (w) = ||w||
2

g(w, b) = y i (wx + b) − 1, i = 1, . . . , m

The Lagrangian function is then


m
1 2
L(w, b, α) = ||w|| − ∑ α i [y i (wx + b) − 1]
2
i=1
The Dual problem

For any convex optimization problem, there always exist settings of the dual variables such
that the unconstrained minimum of the Lagrangian with respect to the primal variables
(keeping the dual variables fixed) coincides with the solution of the original constrained
minimization problem.
Kuhn-Tucker theorem: The solution we find here will be the same as the solution to the
original problem
The advantage of the dual problem over the Lagrange primal problem is that the objective
function now only depends on the Lagrangian multipliers, which is easier to be solved
analytically.
Also it will let us solve the problem by computing the just the inner products of xi, xj

Solving the Optimization Problem


The linear discriminant function is:

T T
g(x) = w x + b = ∑ αi xi x + b

i∈SV

Notice it relies on a dot product between the test point x and the support vectors xi
Also keep in mind that solving the optimization problem involved computing dot products
xiTxjbetween all pairs of training points

Soft Margin Formulation

This idea is based on a simple premise: allow SVM to make a certain number of mistakes and
keep margin as wide as possible, so that other points can still be classified correctly. This can
be done simply by modifying the objective of SVM.
Here the red decision boundary perfectly separates all the training points.
The green decision boundary has a wider margin that would allow it to generalize well on
unseen data.
In that sense, soft margin formulation would also help in avoiding the overfitting problem.
So in such cases, where the data is not linearly separable (noisy data, outliers, etc.)
Slack variables ξi can be added to allow misclassification of difficult or noisy data points
In the problem posed here, we soften the constraint:

T
y i (x i w + b) ≥ 1 − ξ i , ξ i ≥ 0

The new constraint permits a functional margin that is less than 1


We thus state a preference for margins that classify the training data correctly, but soften
the constraints to allow for non-separable data with a penalty proportional to the amount by
which the example is misclassified.

The objective contains a penalty of cost Cξi, for any data point that
falls within the margin on the correct side of the separating hyperplane (i.e., when 0 <
ξi ≤1)
or on the wrong side of the separating hyperplane (i.e., when ξi > 1)
The value of ξi is the distance of xi from the corresponding class's margin if xi is on the
wrong side of the margin, otherwise zero.
Thus the points that are far away from the margin on the wrong side would get more
penalty.
C is a hyperparameter that decides the trade-off between maximizing the margin and
minimizing the mistakes. When C is small, classification mistakes are given less
importance and focus is more on maximizing the margin and vice versa.
We would therefore like to minimize the sum total of the penalties ξi over all i (this is an
upper bound on the training classification errors). Thus, the new well-posed optimization
problem (using L1 regularization) becomes:

m
1
T
min w w + C ∑ ξi
2
i=1
T
s. t. y i (x i w + b) ≥ 1 − ξ i , ξ i ≥ 0

4.3 Non Linear SVM


Datasets that are linearly separable with noise work out great
But what are we going to do if the dataset is just too hard?
Kernel Trick!!!
SVM = Linear SVM + Kernel Trick

Kernel Trick Motivation

Linear classifiers are well understood, widely-used and efficient.


How to use linear classifiers to build non-linear ones?
Neural networks: Construct non-linear classifiers by using a network of linear classifiers
(perceptrons).
Kernels:
Map the problem from the input space to a new higher-dimensional space (called the
feature space) by doing a non-linear transformation using a special function called the
kernel.
Then use a linear model in this new high-dimensional feature space.
The linear model in the feature space corresponds to a non-linear model in the input
space.

Feature Space

General idea: the original input space can be mapped to some higher-dimensional feature
space where the training set is separable:
Proper feature mapping can make non-linear to linear!

The Kernel Trick

With this mapping, our discriminant function is now:


T T
g(x) = w ϕ(x) + b = ∑ αϕ(x i ) ϕ(x) + b

i∈SV

No need to know this mapping explicitly, because we only use the dot product of feature
vectors in both the training and test.
A kernel function is defined as a function that corresponds to a dot product of two feature
vectors in some expanded feature space:

К(х i
, x j ) ≡ ϕ( х i
)
T
х
ϕ( j
)

Solving the Optimization Problem

One particularly interesting consequence of strong duality for convex optimization problems
is a property known as complementary slackness
According to the complementary slackness condition:
T
α i (y i (w x i + b) − 1) = 0

Thus, only support vectors have αi≠0


The solution has the form:
n

w = ∑ αi yi xi = ∑ αi yi xi

i=1 i∈SV

get b from yi(wTxi+b) - 1 = 0, where xi is support vector


Now knowing the αi's we can find the weights w for the maximal margin separating
hyperplane:

w = ∑ αi yi xi

i=1

and now, after training and finding the w by this method, given an unknown point u
measured on features xi we can classify it by looking at the sign of:

f (x) = wu + b = (∑ α i y i x i u) + b

i=1
Simple Example

Another Example

Optimization
Examples of Commonly used Kernel Functions

Algorithm

Choose a kernel function


Choose a value for C
Solve the quadratic programming problem
Construct the discriminant function from the support vectors
Chapter-5 Artificial Neural Networks
5.1 Perceptron
Neural networks, a biologically-inspired programming paradigm which enables a computer
to learn from observational data
Neural networks arise from attempts to model human/animal brains
We will focus on multi-layer perceptrons
The human brain is made up of about 100 billion neurons
Neurons receive electric signals at the dendrites and send them to the axon

The axon of the neuron is connected to the dendrites of many other neurons

The human brain consists primarily of nerve cells called neurons, linked together with other
neurons via strands of fiber called axons.
Axons are used to transmit nerve impulses from one neuron to another whenever the
neurons are stimulated. A neuron is connected to the axons of other neurons via dendrites,
which are extensions from the cell body of the neuron.
The contact point between a dendrite and an axon is called a synapse.
Neurologists have discovered that the human brain learns by changing the strength of the
synaptic connection between neurons upon repeated stimulation by the same impulse.

Artificial neuron - biological motivation

Brain vs. Artificial Neural Networks

Similarities
Neurons, connections between neurons
Learning = change of connections, not change of neurons
Massive parallel processing
But artificial neural networks are much simpler
computation within neuron vastly simplified
discrete time steps
typically some form of supervised learning with massive number of stimuli

Basic Architecture of Perceptron


Perceptron Example

Learning Perceptron Model

In a perceptron, each input node is connected via a weighted link to the output node.
The weighted link is used to emulate the strength of synaptic connection between neurons.
As in biological neural systems, training a perceptron model amounts to adapting the
weights of the links until they fit the input-output relationships of the underlying data.
During the training phase of a perceptron model, the weight parameters w are adjusted
until the outputs of the perceptron become consistent with the true outputs of training
examples.

Perceptron Learning Rule


Since y is a linear combination of input variables, decision boundary is linear 0

Nonlinearly Separable Data

For a XOR classification problem, there is no linear hyperplane that can separate the two
classes.

Multi-layer Neural Network


An ANN has a more complex structure than that of a perceptron model
More than one hidden layer of computing nodes
Every node in a hidden layer operates on activations from preceding layer and transmits
activations forward to nodes of next layer
Also referred to as "feedforward neural networks"

Multi-layer neural networks are able to model more complex relationships between input
and output variables
We can think of each hidden node as a perceptron that tries to construct one of the two
hyperplanes, while the output node simply combines the results of the perceptrons to yield
the decision boundary
Our goal is to extend this model by making the basis functions φj(x) depend on parameters
and then to allow these parameters to be adjusted, along with the coefficients wj during
training.
Neural networks use basis functions such that each basis function is itself a nonlinear
function of a linear combination of the inputs, where the coefficients in the linear
combination are adaptive parameters.

5.2 Feed-Forward Networks


We have looked at generalized linear models of the form:
M

y(x, w) = f (∑ w j ϕ j (x))

j=1

for fixed non-linear basis functions φ(.) where f( ) is a nonlinear activation function in the
case of classification.
We now extend this model by allowing adaptive basis functions, and learning their
parameters.
In feed-forward networks (a.k.a. multi-layer perceptrons) we let each basis function be
another non-linear function of linear combination of the inputs.

ϕ j (x) = f (∑ …)

j=1

Network diagram for a Two Layer Neural Network

Connect together a number of units into a feed-forward network(DAG)


Above shows a network with one layer of hidden units
Starting with input x = (x1, ..., xD), construct M linear combinations:

D
(1) (1)
a j = ∑ w ji x i + w j0

i=1

where j = 1, ..., M and the superscript (1) indicates that the corresponding parameters are
in the first 'layer' of the network. These aj's are known as activations.
Each of them is then transformed using a differentiable, nonlinear activation function
h(.) to give

z j = h(a j )
The nonlinear functions h(.) are generally chosen to be sigmoidal functions such as the
logistic sigmoid or the 'tanh' function.
These values are again linearly combined to give output unit activations
M
(2) (2)
ak = ∑ w zj + w
kj k0

j=1

where k = 1, ..., K and K is the total number of outputs.


This transformation corresponds to the second layer of the network, and again the wk0(2)
are bias parameters.
Finally, the output unit activations are transformed using an appropriate activation function
to give a set of network outputs yk
We can combine these various stages to give the overall network function that, for
sigmoidal output unit activation functions, takes the form

M D
(2) (1) (1) (2)
y k (x, w) = σ (∑ w h (∑ w xi + w ) + w )
kj ji j0 k0

j=1 i=1

Thus the neural network model is simply a nonlinear function from a set of input variables xi
to a set of output variables yk controlled by a vector w of adjustable parameters.

Different Activation Functions

Activation function of a node in an artificial neural network is a function that calculates the
output of the node (based on its inputs and the weights on individual inputs).

Parameter Optimization

We turn next to the task of finding a weight vector w which minimizes the chosen function
E(w).
Because the error E(w) is a smooth continuous function of w, its smallest value will occur at
a point in weight space such that the gradient of the error function vanishes, so that ∇E(w)
= 0 as otherwise we could make a small step in the direction of -∇E(w) and thereby further
reduce the error.
Points at which the gradient vanishes are called stationary points, and may be further
classified into minima, maxima, and saddle points. lead
A minimum that corresponds to the smallest value of the error function for any weight
vector is said to be a global minimum.
Any other minima corresponding to higher values of the error function are said to be local
minima.
For a successful application of neural networks, it may not be necessary to find the global
minimum (and in general it will not be known whether the global minimum has been found)
but it may be necessary to compare several local minima in order to find a sufficiently good

5.3 Error Backpropagation


The goal is to find an efficient technique for evaluating the gradient of an error function
E(w) for a feed-forward neural network. This can be achieved using a local message
passing scheme in which information is sent alternately forwards and backwards through
the network and is known as error backpropagation, or sometimes simply as backprop.

Network Training

Given a specified network structure, how do we set its parameters (weights)?


As usual, we define a criterion to measure how well our network performs, optimize against
it.
For regression, training data comprising a set of input vectors {xn}, where n = 1 ,..., N,
together with a corresponding set of target vectors {tn}

N
2
E(w) = ∑{y(x n , w) − t n }

n=1

For binary classification, we consider a discriminative model, where conditional distribution


of targets given inputs is a Bernoulli distribution of the form:

tn 1−t n
p(t|w) = ∏ y n {1 − y n }

n=1

E(w) = − ∑{t n ln y n + (1 − t n ) ln(1 − y n )}

n=1

We now derive the backpropagation algorithm for a general network having arbitrary feed-
forward topology, arbitrary differentiable nonlinear activation functions, and a broad class of
error function.
The sum is transformed by a nonlinear activation function h( . ) to give the activation zj of
unit j in the form zj = h(aj)
For each pattern in the training set, we shall suppose that we have supplied the
corresponding input vector to the network and calculated the activations of all of the hidden
and output units in the network by successive application of the two equations.
This process is often called forward propagation because it can be regarded as a forward
flow of information through the network.
We shall now see how this simple result extends to the more complex setting of multilayer
feed-forward networks.
In a general feed-forward network, each unit computes a weighted sum of its inputs of the
form

a j = ∑ w ji z i

where z, is the activation of a unit, or input, that sends a connection to unit j, and wji is the
weight associated with that connection.
We now introduce a useful notation, where the ẟ's are often referred to as errors
∂E n
δj ≡
∂a j

We can write

∂a j
= z i because a j = ∑ w ji z i
∂w ji
i

and then we can obtain

∂E n
= δj zi
∂w ji

The required derivative is obtained simply by multiplying the value of ẟ for the unit at the
output end of the weight by the value of z for the unit at the input end of the weight.
This takes the same form as for the simple linear model.
Now consider the evaluation of the derivative of En with respect to a weight wji
The outputs of the various units will depend on the particular input pattern n.
First we note that En depends on the weight wji only via the summed input aj to unit j.
We can therefore apply the chain rule for partial derivatives to give

∂E n ∂E n ∂a j
=
∂w ji ∂a j ∂w ji

As we have seen already, for the output units, we have

δk = yk − tk

To evaluate the ẟ's for hidden units, we again make use of the chain rule for partial
derivatives, where the sum runs over all units k to which unit j sends connections.

∂E n ∂E n ∂a k
δj = ≡ ∑
∂a j ∂a k ∂a j
k

Illustration of the calculation of ẟj for hidden unit j by backpropagation of the ẟ's from those
units k to which unit j sends connections.
The blue arrow denotes the direction of information flow during forward propagation, and
the red arrows indicate the backward propagation of error information.

To evaluate the ẟ's for hidden units, we again make use of the chain rule for partial
derivatives, where the sum runs over all units k to which unit j sends connections.
We obtain the following backpropagation formula which tells us that the value of ẟ for a
particular hidden unit can be obtained by propagating the ẟ's backwards from units higher
up in the network

∂E n
δk ≡
∂a k
∂E n ∂E n ∂a k
δj = = ∑
∂a j ∂a k ∂a j
k

a k = ∑ w ki z i = ∑ = w ki h(a i )

i i

∂a k ′
= ∑ = w kj h (a j )
∂a j
k


δ j = h (a j ) ∑ w kj δ k

Apply an input vector xn to the network and forward propagate through the network using

a j = ∑ w ji z i and z j = h(a j )

to find the activations of all the hidden and output units.


Evaluate the ẟk for all the output units using

δk = yk − tk

Backpropagate the ẟ's using


δ j = h (a j ) ∑ w kj δ k

to obtain ẟj, for each hidden unit in the network.


Use

∂E n
= δj zi
∂w ji

to evaluate the required derivatives.

Example

Lets consider a two-layer network, together with a sum- of-squares error, in which the
output units have linear activation functions, so that yk = ak, while the hidden units have
logistic sigmoid activation function given by

h(a) ≡ tanh(a)

a −a
e − e
tanh(a) =
a −a
e + e

A useful feature of this function is that its derivative can be expressed in a particularly
simple form:
′ 2
h (a) = 1 − h(a)

We also consider a standard sum-of-squares error function, so that for pattern n the error is
given by
K
1 2
En = ∑(y k − t k )
2
k=1

where yk is the activation of output unit k, and tk is the corresponding target, for a particular
input pattern xn
For each pattern in the training set in turn, we first perform a forward propagation using
D
(1)
a j = ∑ w ji x i

i=0

z j = tanh(a j )

M
(2)
yk = ∑ w zj
kj

j=0

Next we compute the ẟ's for each output unit using

δk = yk − tk

Then we backpropagate these to obtain ẟ's for the hidden units using
K

2
δ j = (1 − z j ) ∑ w kj δ k

k=1

Finally, the derivatives with respect to the first-layer and second-layer weights are given by

∂E n
= δj xi
(1)
∂w ji

∂E n
= δk zj
(2)
∂w
kj

You might also like