BITS F464 ML Lecture Notes
BITS F464 ML Lecture Notes
BITS F464 ML Lecture Notes
REFERENCES
1. Pattern Recognition and Machine Learning, Christopher M Bishop
2. Machine Learning, Tom M Mitchell
3. The Elements of Statistical Learning, Trevor Hastie et. al
4. Python Machine Learning, Sebastian Raschka
OBJECTIVES
1. Learn theoretical and practical aspects of linear models for regression & classification
2. Understand probabilistic, discriminative & generative models for classification
3. Learn theoretical and practical aspects of SVM and ANN
4. Understand decision tree learning and ensemble methods
Essential Readings
CS229 Lecture Notes, Andrew Ng and Tengyu Ma
Linear regression - Chapters 1.1, 1.2, 1.3
Support Vector Machines - Chapter 6
Pattern Recognition and Machine Learning, Christopher M Bishop
Introduction - Chapters 1.1, 1.2
Linear Models for Regression - Chapters 3.1
Linear Models for Classification - Chapters 4.1, 4.2, 4.3
Neural Networks - Chapters 5.1, 5.2, 5.3
Introduction to Data Mining, Pang Ning Tan, Michael Steinbach, Vipin Kumar
Classification: Basic Concepts - Chapters 4.3, 4.4
Classification: Alternative Techniques - Chapters 5.4, 5.5
STAT 479 Lecture Notes, Sebastian Raschka
Ensemble Methods - Chapter 7
Chapter-1 Models for Regression
1.1 Definitions, Applications and Types of ML
For many problems, it's difficult to program the correct behavior by hand
- recognizing people & objects
- understanding human speech
We ask the question "How can we build computer systems that automatically improve with
experience and what are the fundamental laws that govern all learning processes?"
The term Machine Learning was first coined by Arthur Lee Samuel in 1959.
He defined ML as follows:
- ML is the field of study that gives computers the ability to learn without being explicitly
programmed.
A FORMAL DEFINITION
A computer program is said to learn from experience E with respect to some class of tasks T
and performance measure P, if its performance at tasks in T, as measured by P, improves with
experience E.
APPLICATIONS
Personal Assistants
Advertising & Business Intelligence
News Feeds/ Filtering Algorithms
Recommendation Engines
....and so on
CATEGORIES OF ML
Classification: Predict likelihood that something belongs to some class. E.g. Spam E-Mail
Detection
Regression: Predicts a numerical value. E.g. Predict amount of rain on a certain day.
Clustering: Forms clusters based on similarity and finds patterns in data. E.g. Fake News
Detection
h : X → Y
h : X → Y
where
m
X = R , Y = {1, … , K} with class labels K
Y = {0, 1} or {−1, 1}
1. Eager vs Lazy - process data immediately vs defer the processing until the prediction
2. Batch vs Online - learn on the entire data at once vs learn from one example at a time.
3. Discriminative vs Generative - probabilistic vs direct
Modular Approach
Data is the driving force
Model is the mathematical relationship from input features to output labels
Loss function is the difference between predicted and actual values
Optimization so that the model gets updated to get the best solution
Evaluation to determine the performance of the model
Question: Given data like this, how can we learn to predict the prices of other houses, as a
function of the size of their living areas?
Notation
(i)
x → input f eatures
(i)
y → output f eatures
(i) (i)
(x ,y ) → training example
(i) (i)
{(x ,y ) : i = 1, … , n} → training set
h : X → Y
Example: We consider two input features, living area and number of bedrooms and wish to
predict price of the house.
h θ (x) = θ 0 + θ 1 x 1 + θ 2 x 2
Here the θ's are parameters or weights that parameterize the space of linear functions mapping
from X to Y.
T
h(x) = ∑ θ i x i = θ x
i=0
A Simpler Model
Essentials of Linear Algebra - Cheat Sheet by Ivan Savov
A single feature x1
y = θ0 + θ1 x1
Cost function:
m
1 (i) (i) 2
J (θ) = ∑(h θ (x ) − y )
2
i=1
A cost function is used to gauge the performance of the ML model; compares the predicted
values with actual values.
∂
θ j := θ j − α J (θ)
∂θ j
Hence, for a single training example, the update rule takes the form:
For more than one training example, we replace the rule with the following algorithm:
This is known as Batch Gradient Descent. It looks at every example in the entire training set
on every step.
This is known as Stochastic Gradient Descent. We repeatedly run through the training set,
and each time we encounter a training example, we update the parameters according to the
gradient of the error with respect to that single training example only.
Since we know
(i) (i) T
h θ (x ) = (x ) θ
θ = (X
T
X)
−1
X
T
→
y
N
1
2
E(w) = ∑{y(x n , w) − t n }
2
n=1
Here, xn are the data points, tn are the corresponding target values and y(xn,w) are the
predicted values.
Model Selection
M = 0 and M = 1 give poor fits to the data and hence poor representations of the function.
M = 3 seems to give the best fit to the function.
For M = 9, we obtain an excellent fit to the training data and E(w*) = 0. The curve, however,
oscillates wildly and gives a poor representation to the function. This is known as over-
fitting.
The goal is to achieve good generalization by making accurate predictions for new data.
The test set error is a measure of how well we are doing in predicting the values of t for
new data observations of x.
Small values of M give relatively large values of the test set error.
Values of M in the range 3 ≤ M ≤ 8 give small values for the test set error, and these also
give reasonable representations.
For M = 9, the training set error goes to zero and the test set error has become very large.
Higher order polynomials are very flexible and more prone to overfitting.
This issue is resolved using Regularization.
1.7 Regularization
Controlling model complexity to fit complex models on relatively small dataset without
overfitting.
Idea is to add a penalty term in the loss function for the large weights.
Penalty is a function of the weight vector.
λ is a regularization rate that controls the amount of penalty to be added.
T
f (b) = E(b) + (λ/n)b b
Bias-Variance Tradeoff
2
∑ bj ≤ c
j=1
We can think of c as a budget for how much we can spend on the size of all the coefficients.
1 T 2 T
min{ (Xb − y) (Xb − y)} s. t. ∥b∥ 2 = b b ≤ c
b n
λ T
min b {E(b) + b b}
n
L1 & L2 norms
∥w∥ 1 = |w 1 | + |w 2 | + ⋯ + |w N |
1
2 2 2
∥w∥ 2 = (|w 1 | + |w 2 | + ⋯ + |w N | ) 2
Error Surface
Consider two inputs x1 and x2 and corresponding parameters b1 and b2. The error function E(b)
generates a bowl shaped (paraboloid) convex error surface. On the b1, b2 plane the locus of
points such that bTb ≤ c is a disk of radius c, centered at origin.
A simple example
We have two boxes, one red and one blue, and in the red box we have 2 apples and 6
oranges, and in the blue box we have 3 apples and 1 orange.
We randomly pick one of the boxes and from that box we randomly select an item of fruit,
and having observed which sort of fruit it is we replace it in the box from which it came. We
could imagine repeating this process many times.
Let us suppose that in so doing we pick the red box 40% of the time and we pick the blue
box 60% of the time, and that when we remove an item of fruit from a box we are equally
likely to select any of the pieces of fruit in the box.
The identity of the box that will be chosen is a random variable, which we shall denote by
B.
Similarly, the identity of the fruit is also a random variable and will be denoted by F.
Supposing we pick a box at random, and it turns out to be the blue box. Then the probability of
selecting an apple in the blue box is given as p(F = a | B = b) = 3/4.
1 4 3 6 11
p(F = a) = p(F = a|B = r)p(B = r) + p(F = a|B = b)p(B = b) = . + . =
4 10 4 10 20
9
p(F = o) = 1 − 11/20 =
20
Now, suppose instead we are told that a piece of fruit has been selected and it is an orange,
and we would like to know which box it came from. This requires that we evaluate the
probability distribution over boxes conditioned on the identity of the fruit. To solve this, we use
Bayes' Theorem as follows:
2 1
p(B = b|F = o) = 1 − =
3 3
p(B) is called the prior probability because it is the probability available before we
observe the identity of the fruit.
p(B|F) is called the posterior probability because it is the probability obtained after we have
observed F.
p(X) = ∑ p(X, Y )
2. Product Rule:
p(Y |X)p(Y )
p(X, Y ) =
p(X)
3. Bayes' Theorem:
p(X|Y )p(Y )
p(Y |X) =
p(X)
If the joint distribution of two variables factorizes into the product of the marginals, so that
p(X,Y) = p(X)p(Y), then X and Y are said to be independent.
Gaussian/Normal Distribution
where µ is the mean, σ2 is the variance and σ is the standard deviation. We also define
precision as the reciprocal of the variance, i.e. β = 1/ σ2.
We denote it as
LHS: The likelihood of the parameters µ and σ taking certain values given that data is
observed.
RHS: Probability density of observing the data with model parameters µ and σ.
In practice, we observe that maximizing the log of the likelihood function is easier. Hence, we
maximize
and obtain
which is nothing but the sample mean and sample variance respectively.
ε(i) is an error term that captures either unmodeled effects or random noise.
We also assume that ε(i) are i.i.d according to a Gaussian distribution with µ = 0 and σ2.
→
p(y|X; θ)
→
L(θ) = L(θ; X, y) = p(y|X; θ) →
Now, given this probabilistic model, we want to choose our best guess of the parameters θ. The
principal of maximum likelihood says that we should choose θ so as to make the data as high
probability as possible. i.e. we should choose θ to maximize L(θ).
To make our computations simpler, we choose to maximize the log likelihood instead:
Chapter-2 Linear Models for Classification
2.1 The Classification Problem
Goal: Take an input vector x and assign it to one of K discrete classes Ck where k = 1,...,K.
The classes are usually taken as disjoint, so each input is assigned to only one class.
The input space is thus divided into decision regions whose boundaries are called decision
boundaries or decision surfaces.
The decision surfaces are linear functions of the input vector x and defined by (D-1) -
dimensional hyperplanes within the D-dimensional input space.
The datasets whose classes can be separated exactly by linear decision surfaces are said
to be linearly separable.
In the case of two-class problems, we use the binary representation in which there is a
single target variable t ∈ {0, 1} such that t = 1 represents class C1 and t = 0 represents
class C2.
We can interpret the value of t as the probability that the class is C1, with the values of
probability taking only the extreme values of 0 and 1.
For K > 2 classes, it is convenient to use a 1-of-K coding scheme in which t is a vector of
length K such that if the class is Cj , then all elements tk of t are zero except element tj,
which takes the value 1.
For instance, if we have K = 5 classes, then a pattern from class 2 would be given the
target vector t = (0, 1, 0, 0, 0)T.
Again, we can interpret the value of tk as the probability that the class is Ck.
For classification problems, we wish to predict discrete class labels, or more generally
posterior probabilities that lie in the range (0, 1). To achieve this, we consider a
generalization of this model in which we transform the linear function of w using a nonlinear
function f(·) so that
T
y(x) = f (w x + w0 )
1. Constructing a discriminant function that directly assigns each vector x to a specific class.
2. Model the conditional probabilities in an inference stage, and then use it to make decisions.
1. Model them directly by representing them as parametric models
2. Generative approach, i.e. model the class-conditional densities with prior probabilities
Similarly, if x is a point on the decision surface, then y(x) = 0, and the normal distance from
the origin to the decision surface is given by
T
w x w0
= −
∥w∥ ∥w∥
y(x) gives the signed measure of the perpendicular distance r of point x from decision
surface.
Consider an arbitrary point x and let x⊥ be its orthogonal projection onto the decision
surface, so that
w
x = x⊥ + r
∥w∥
Also we know
T
w x + w 0 = y(x)
and
T
w x ⊥ + w 0 = y(x ⊥ ) = 0
Finally, we obtain
y(x)
r =
∥w∥
1. One vs Rest - build K-1 discriminant functions. Each solves a two class classification.
2. One vs One - one discriminant function per pair of classes. K(K-1)/2 functions.
To avoid this we consider a single K-class discriminant comprising K linear functions for the
form
T
y k (w) = w k x + w k 0
The decision boundary between class Ck and class Cj is therefore given by yk(x) = yj(x) and
hence corresponds to a (D − 1)-dimensional hyperplane defined by
T
(w k − w j ) x + (w k 0 − w j 0 ) = 0
T
y k (x) = w k x + w k 0
0 0
w ⋯ w
⎡ 1 k ⎤
~
W =
⋮ ⋱ ⋮
⎣ D D⎦
w ⋯ w
1 k
A new input x is assigned to the class for which the output yk is largest.
Now we determine the parameter matrix by minimizing a sum of squares error function.
Consider a training data set {xn, tn} where n = 1,...,N and define matrix T whose nth row is
the vector tnT together with matrix X whose nth row to xnT. The error function can be written
as
~ 1 ~ ~ T ~ ~
E D (W ) = T r(X W − T ) (X W − T )
2
We consider a 2-class problem with N1 points of class C1 and N2 points of class C2, so that
the mean vectors are given as
1 1
m1 = ∑ xn m2 = ∑ xn
N1 N2
n∈C 1 n∈C 2
The simplest measure of separation of classes when projected onto w, is the separation of
project class means. That is
T
Choose w to maximize m 2 − m 1 = w (m 2 − m 1 )
2 2
s k = ∑ (y n − m k )
n∈C k
where yn = wT xn
The total within class variance for the whole dataset can be defined to be
2 2
s1 + s2
Fisher's Criterion
The ratio of between class variance to the within class variance, i.e.
2
(m 2 − m 1 )
J (w) =
2 2
s + s
1 2
T
S B = (m 2 − m 1 )(m 2 − m 1 )
T T
S W = ∑ (x n − m 1 )(x n − m 1 ) + ∑ (x n − m 2 )(x n − m 2 )
n∈C 1 n∈C 2
SBw is always in the direction of (m2 - m1) and since we do not care about the magnitude
of w, the scalar factors can be dropped and after multiplying both sides by SW-1 we obtain
−1
w ∝ S (m 2 − m1)
W
2.5 Perceptron
Another example of a linear discriminant model is the perceptron of Rosenblatt (1962),
which occupies an important place in the history of pattern recognition algorithms.
It corresponds to a two-class model in which the input vector x is first transformed using a
fixed nonlinear transformation to give a feature vector φ(x), and this is then used to
construct a generalized linear model of the form
T
y(x) = f (w ϕ(x))
The non-linear activation function f(.) is given by a step function for the form
+1 , a ≥ 0
f (a) = {
−1 , a < 0
For the perceptron, it is more convenient to use target values t = +1 for class C1 and t = -1
for class C2, which matches the choice of the activation function.
n∈M
Optimization
p(x|C 1 )p(C 1 ) 1
p(C 1 |x) = = = σ(a)
p(x|C 1 )p(C 1 ) + p(x|C 2 )p(C 2 ) 1 + exp(−a)
where
p(x|C 1 )p(C 1 )
a = ln
p(x|C 2 )p(C 2 )
σ(a) is termed as the logistic sigmoid function. The term sigmoid means 'S-shaped'.
This is sometimes also called squashing function as it maps the whole real axis into a finite
interval.
It maps the feature space into a probability function.
p(x)
logit(p(x)) = log
1 − p(x)
Logistic function is the inverse of logit (maps values from the range (-∞,+∞) into [0,1])
For K>2, we define the normalized exponential and it can be regarded as a multiclass
generalization of the logistic sigmoid.
This is also known as the SoftMax function. It turns logits into probabilities by taking
exponential of each output and then normalizing each number by the sum of those
exponents so that the entire output vector adds up to one.
Let us assume that the class-conditional densities are Gaussian and then explore the
resulting form for the posterior probabilities. To start with, we shall assume that all classes
share the same covariance matrix. Thus the density for class Ck is given by
For a D-dimensional vector x, the multivariate Gaussian distribution takes the above form,
where μ is a D-dimensional mean vector, Σ is a DxD covariance matrix, and |Σ| denotes the
determinant.
We consider the case of 2 classes
Once we have specified a parametric functional form for the class-conditional densities
p(x|Ck), we can then determine the values of the parameters, together with the prior class
probabilities p(Ck), using maximum likelihood.
This requires a data set comprising observations of x along with their corresponding class
labels.
We consider the case of two classes, each having a Gaussian class-conditional density
with a shared covariance matrix, and suppose we have a data set {xn, tn} where n = 1,...,N.
Here tn = 1 denotes class C1 and tn = 0 denotes class C2. We denote the prior class
probability p(C1) = π, so that p(C2)=1 − π. For a data point xn from class C1, we have tn =
1 and hence
tn 1−t n
p(t, X|π, μ 1 , μ 2 , Σ) = ∏ [πN (x n |μ 1 Σ)] [(1 − π)N (x n |μ 2 , Σ)]
n=1
n=1
N
1 N1
π = ∑ tn =
N N1 + N2
n=1
N
1
μ1 = ∑ tn xn
N1
n=1
Maximizing with respect to μ2 we get
N
1
μ2 = ∑(1 − t n )x n
N2
n=1
The maximum likelihood solution represents a weighted average of the covariance matrix
associated with each of the two classes.
xn 1−x n
p(D|μ) = ∏ p(x n |μ) = ∏ μ (1 − μ)
n=1 n=1
n=1 n=1
Now consider the case of discrete feature values xi. For simplicity, we begin by looking at
binary feature values xi ∈ {0, 1}. Here we will make the naive Bayes assumption in which
the feature values are treated as independent, conditioned on the class Ck. Thus we have
class-conditional distributions of the form
D
xi 1−x i
p(x|C k ) = ∏ μ (1 − μ ki )
ki
i=1
i=1
Example
The data tuples are described by the attributes age, income, student and credit_rating.
The class label attribute buys_computer, has two distinct values (namely, {yes, no}).
Let C1 correspond to the class buys_computer = yes and C2 correspond to buys_computer
= no.
We wish to classify X = (age = youth, income = medium, student = yes, credit_rating = fair)
The prior probability of each class, can be computed based on the training tuples:
Therefore, the naive Bayesian classifier predicts buys_computer = yes for tuple X.
We have considered classification models that work directly with the original input vector x.
However, all of the algorithms are equally applicable if we first make a fixed nonlinear
transformation of the inputs using a vector of basis functions φ(x).
The resulting decision boundaries will be linear in the feature space φ, and these
correspond to nonlinear decision boundaries in the original x space.
Classes that are linearly separable in the feature space φ(x) need not be linearly separable
in the original observation space x.
Logistic Regression
Consider the two-class classification problem. The posterior probability of class C1 can be
written as a logistic sigmoid function:
1
T
p(C 1 |x) = = σ(w x)
T
1 + exp(−w x)
where, p(C2|x) = 1 - p(C1|x) and we omit the bias term for clarity.
This model is known as logistic regression.
tn 1−t n
p(t|X, w) = ∏ [y n (1 − y n ) ]
n=1
N N
n=1 n=1
d yn − tn
En =
dy n y n (1 − y n )
d dE n dy n
En = = (y n − t n )x n
dw dy n dw
Therefore, we obtain
∇E(w) = ∑(y n − t n )x n
n=1
This takes exactly the same form as the gradient of the sum-of-squares error function for
the linear regression model.
Unlike linear regression, there is no closed form solution, due to non-linearity of the logistic
sigmoid function.
The error function is convex and can be optimized using standard gradient-based or more
advanced optimization techniques.
Chapter-3 Decision Trees & Ensembles
3.1 Decision Tree Learning
Construction of decision tree from class-labeled training tuples
It is a flow-chart like structure, where each internal node (non-leaf) denotes a test on an
attribute, each branch represents the outcome of the test, and each leaf node (terminal)
holds a class label.
The population or sample is split into two or more homogeneous sets (sub-populations)
based on most significant splitter/differentiator in input variables.
Example
Given a dataset of two inputs (x) of height in centimeters and weight in kilograms, the
output is the sex as male or female.
The tree can be stored into a file as a graph or a set of rules.
For example, given the input of [height = 160 cm, weight = 65 kg], we would traverse the
above tree as follows:
Important Terminology
1. Root Node: It is the topmost node in the tree. It represents the entire population or sample
and this further gets divided into two or more homogeneous sets.
2. Splitting: It is a process of dividing a node into two or more sub-nodes.
3. Decision Node: When a sub-node splits into further sub-nodes, then it is called the
decision node.
4. Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
Example of a decision tree
The classification problem can be solved by asking a series of carefully crafted questions
about the attributes of the test record.
The series of questions and their possible answers can be organized in the form of a
decision tree.
There could be more than one tree that fits the same data.
Applying Model to Test Data
Starting from the root node of the tree, we apply the test condition to the record and follow
the appropriate branch based on the outcome of the test.
This will lead us either to another internal node, for which a new test is applied, or to a leaf
node.
The class label associated with the leaf node is then assigned to the record.
Hunt's Algorithm
In Hunt's Algorithm, a decision tree is grown in a recursive fashion by partitioning the
training records into successively purer subsets.
Let Dt be the set of training records that reach a node t and y = {y1,...yc} be the class
labels.
If Dt contains records that belong the same class yt , then t is a leaf node labeled as yt
If Dt contains records that belong to more than one class, use an attribute test to split the
data into smaller subsets. Recursively apply the procedure to each subset.
Consider the problem of predicting whether a loan applicant will repay their loan obligations
or become delinquent, subsequently defaulting on their loan.
Greedy Strategy: Split records based on an attribute test that optimizes certain criterion.
Issues:
Determine how to split the records
How to specify the attribute test condition
How to determine the best split
Determine when to stop splitting
How to Specify Attribute Test Condition
2. Binary Split: Divides values into two subsets and need to find optimal partitioning
2. Binary Split: Divides values into two subsets and need to find optimal partitioning
Splitting Based on Continuous Attributes
1. Gini Index
c−1
2
Gini I ndex = 1 − ∑ p i (t)
i=0
where pi(t) is the frequency of class i at node t, and c is the total number of classes
2. Entropy
c−1
i=0
3. Misclassification Error
Gain = P − M
For a node t,
(c−1)
2
Gini index = 1 − ∑ p i (t)
i=0
k
ni
GI N I split = ∑ GI N I (i)
n
i=1
where,
ni = number of records at child i
n = number of records at parent node p
For each distinct value, gather counts for each class in the dataset
Use the count matrix to make decisions
2. Entropy
c−1
i=0
where pi(t) is the frequency of class i at node t, and c is the total number of classes
Information Split:
k
ni
Gain split = Entropy(p) − ∑ Entropy(i)
n
i=1
3. Classification Error
All three measures attain their maximum value when the class distribution is uniform (p =
0.5)
The minimum values for the measures are attained when all the records belong to the
same class
Advantages
Relatively inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Robust to noise (especially when methods to avoid overfitting are employed)
Can easily handle redundant attributes
Can easily handle irrelevant attributes (unless the attributes are interacting)
Disadvantages
Due to the greedy nature of splitting criterion, interacting attributes (that can
distinguish between classes together but not individually) may be passed over in favor
of other attributed that are less discriminating.
Each decision boundary involves only a single attribute
The training error of a model can be reduced by increasing the model complexity.
For example, the leaf nodes of the tree can be expanded until it perfectly fits the training
data.
Although the training error for such a complex tree is zero, the test error can be large
because the tree may contain nodes that accidently fit some of the noise points in the
training data.
Such nodes can degrade the performance of the tree because they do not generalize
well to the test examples.
The tree that contains the smaller number of nodes has a higher training error rate, but a
lower test error rate compared to the more complex tree.
1. Presence of noise
2. Lack of representative samples
M1 has overfitted the training data because there is a simpler model with lower error rate
on the test set.
The Four-legged attribute test condition in model M1 is spurious because it fits the
mislabeled training records, which leads to the misclassification of records in the test set
Models that make their classification decisions based on a small number of training records
are also susceptible to overfitting.
Humans, elephants, and dolphins are misclassified because the decision tree classifies all
warm-blooded vertebrates that do not hibernate as non-mammals.
The tree arrives at this classification decision because there is only one training record,
which is an eagle, with such characteristics.
Handling Overfitting in Decision Trees
Two strategies for avoiding model overfitting in the context of decision tree induction.
Pre-pruning (Early Stopping)
Post-pruning
Pre-pruning
The tree-growing algorithm is halted before generating a fully grown tree that perfectly fits
the entire training data.
To do this, a more restrictive stopping condition must be used; e.g., stop expanding a leaf
node when the observed gain in impurity measure (or improvement in the estimated
generalization error) falls below a certain threshold.
The advantage of this approach is that it avoids generating overly complex subtrees that
overfit the training data.
Nevertheless, it is difficult to choose the right threshold for early termination.
Post-pruning
In this approach, the decision tree is initially grown to its maximum size.
This is followed by a tree-pruning step, which proceeds to trim the fully grown tree in a
bottom-up fashion.
Trimming can be done by replacing a subtree with
a new leaf node whose class label is determined from the majority class of records
affiliated with the subtree, or
the most frequently used branch of the subtree.
The tree-pruning step terminates when no further improvement is observed.
Post-pruning tends to give better results than pre-pruning because it makes pruning
decisions based on a fully grown tree, unlike pre-pruning, which can suffer from premature
termination of the tree-growing process.
However, for post-pruning, the additional computations needed to grow the full tree may be
wasted when the subtree is pruned.
Bias Error
Bias are the simplifying assumptions made by a model to make the target function easier to
learn.
Generally, linear algorithms have a high bias making them fast to learn and easier to
understand but generally less flexible.
In turn, they have lower predictive performance on complex problems that fail to meet the
simplifying assumptions of the algorithms bias.
Low Bias: Suggests less assumptions about the form of the target function.
High-Bias: Suggests more assumptions about the form of the target function.
Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest
Neighbors and Support Vector Machines.
Examples of high-bias machine learning algorithms include: Linear Regression, Linear
Discriminant Analysis and Logistic Regression.
Variance Error
Variance is the amount that the estimate of the target function will change if different
training data was used.
The target function is estimated from the training data by a machine learning algorithm, so
we should expect the algorithm to have some variance.
Ideally, it should not change too much from one training dataset to the next, meaning that
the algorithm is good at picking out the hidden underlying mapping between the inputs and
the output variables.
Low Variance: Suggests small changes to the estimate of the target function with changes
to the training dataset.
High Variance: Suggests large changes to the estimate of the target function with changes
to the training dataset.
Generally, nonlinear machine learning algorithms that have a lot of flexibility have a high
variance. For example, decision trees have a high variance, that is even higher if the trees
are not pruned before use.
Examples of low-variance machine learning algorithms include: Linear Regression, Linear
Discriminant Analysis and Logistic Regression.
Examples of high-variance machine learning algorithms include: Decision Trees, k-Nearest
Neighbors and Support Vector Machines.
Graphical Definition
We can create a graphical visualization of bias and variance using a bulls-eye diagram.
Imagine that the center of the target is a model that perfectly predicts the correct values. As
we move away from the bulls-eye, our predictions get worse and worse. Imagine we can
repeat our entire model building process to get a number of separate hits on the target.
Each hit represents an individual realization of our model, given the chance variability
in the training data we gather.
Sometimes we will get a good distribution of training data so we predict very well and
we are close to the bulls-eye, while sometimes our training data might be full of
outliers or non-standard values resulting in poorer predictions.
These different realizations result in a scatter of hits on the target.
Underfitting happens when a model unable to capture the underlying pattern of the data.
These models usually have high bias and low variance. It happens when we have very less
amount of data to build an accurate model or when we try to build a linear model with a
nonlinear data.
Overfitting happens when our model captures the noise along with the underlying pattern in
data. It happens when we train our model a lot over noisy dataset. These models have low
bias and high variance.
The bias-variance tradeoff is a central problem in supervised learning. Ideally, one wants to
choose a model that both accurately captures the regularities in its training data, but also
generalizes well to unseen data.
Unfortunately, it is typically impossible to do both simultaneously.
High-variance learning methods may be able to represent their training set well, but
are at risk of overfitting to noisy or unrepresentative training data.
In contrast, algorithms with high bias typically produce simpler models that may fail to
capture important regularities (i.e. underfit) in the data.
The goal of any supervised machine learning algorithm is to achieve low bias and low
variance. In turn the algorithm should achieve good prediction performance.
Usually there is a trend as follows:
Linear machine learning algorithms often have a high bias but a low variance.
Nonlinear machine learning algorithms often have a low bias but a high variance.
The parameterization of machine learning algorithms is often a battle to balance out bias
and variance.
3.6 Ensembles
Ensemble learning refers to algorithms that combine the predictions from two or more
models.
In broad terms, using ensemble methods is about combining models such that the
ensemble has a better performance than an individual model on average.
The three main classes of ensemble learning methods are bagging, stacking, and boosting.
Bagging involves fitting many decision trees on different samples of the same dataset and
averaging the predictions.
Stacking involves fitting many different models types on the same data and using another
model to learn how to best combine the predictions.
Boosting involves adding ensemble members sequentially that correct the predictions
made by prior models and outputs a weighted average of the predictions.
Bagging
Bootstrap sampling
Bagging adopts the bootstrap distribution for generating different base learners. In other
words, it applies bootstrap sampling to obtain the data subsets for training the base
learners.
A bootstrap sample is a sample of size n drawn with replacement from an original training
set D with |D| = n.
Consequently, some training examples are duplicated in each bootstrap sample, and some
other training examples do not appear in a given bootstrap sample at all. Usually, we refer
to these examples as "out-of-bag sample"
In the limit, there are approximately 63.2% unique examples in a given bootstrap sample.
Consequently, 37.2% examples from the original training set will not appear in a given
bootstrap sample at all.
If we sample from a uniform distribution, we can compute the probability that a given
example from a dataset of size n is not drawn as a bootstrap sample as
In boosting, the training dataset for each subsequent classifier increasingly focuses on
instances misclassified by previously generated classifiers.
The key property of boosting ensembles is the idea of correcting prediction errors.
The models are fit and added to the ensemble sequentially such that the second model
attempts to correct the predictions of the first model, the third corrects the second model,
and so on. lead
This typically involves the use of very simple decision trees that only make a single or a
few decisions, referred to in boosting as weak learners.
The predictions of the weak learners are combined using simple voting or averaging,
although the contributions are weighed proportional to their performance or capability. The
objective is to develop a so-called "strong-learner" from many purpose-built "weak-
learners."
AdaBoost, or Adaptive Boost
The way boosting generally works is you build multiple models on subsets of the dataset in
a sequential manner, where each model learns from the previous models' mistakes. After
you build all the weak models, you then combine their predictions to get a "strong model".
AdaBoost, short for Adaptive Boosting is one of the early successful algorithms within the
Boosting branch of machine learning, and is used specifically for binary classification.
Adaboost Algorithm
AdaBoost for classification is a supervised machine learning problem.
It consists of iteratively training multiple stumps using feature data (x) and target labels (y).
After training, the model's overall predictions come from a weighted sum of the stumps'
predictions.
Step 1 - Initialization
Step 1 of AdaBoost is initializing a weight of 1/N for each datapoint. Again, the weight of a
datapoint represent the probability of selecting it during sampling.
Stumps in AdaBoost progressively learn from previous stumps' mistakes through adjusting
the weights of the dataset to accommodate the mistakes made by stumps at each iteration.
The weights of misclassified data will be increased, and weights of correctly classified data
will be decreased. As a result, as we go into further iterations, the training data will mostly
comprise data that is often misclassified by our stumps.
Intuitively, if one stump is unable to classify certain data properly, then you just need to
train more stumps on the data that is "hard to classify". Hence, taking a weighted sum of
these stumps let's AdaBoost perform very well.
Example
The weight updates at certain iteration m are calculated as the product of the current
weights, with the exponential of (- alpha y Km(x)), where
y represents the true values of the target feature (in our toy dataset case, this would
be the 'species' column).
Km(x) is the prediction made by the stump we trained at iteration m.
alpham is the confidence we place on the predictive power of stump m.
y and Km(x) can each take values -1 or 1. If y is 1 and Km(x) is 1, or both are -1, then
this means that training sample x has been correctly classified. y * Km(x) will then be
equal to 1.
Otherwise, y * Km(x) will be equal to -1.
This corresponds with the "learning from mistakes" concept
When data is misclassified, y * Km(x)) = -1, then the weight update will be positive and
the weights will exponentially increase for that data.
When data is correctly classified, y * Km(x) = 1, the opposite happens.
Epsilon is the ratio of sum of weights for misclassified samples to the total sum of weights
for samples.
Large epsilon values, which mean large misclassification percentage, makes alpha_m
exponentially decrease.
The confidence we have in stump m's predictions also exponentially decrease. Intuitively
this makes so much sense, because more errors = less confidence.
Hence stump m's predictions will only have a very small "say" in the overall prediction.
Naturally, the opposite follows for small epsilon values
The overall prediction of AdaBoost at iteration m, denoted by
m
C m (x) = ∑ α m K m (x)
i=1
Example
Step 4 - Making New Overall Predictions
The final step of the AdaBoost algorithm involves making overall predictions on new data.
m=1
The overall predictions are just the sums of each stump's predictions, weighted by each
stump's alpha (confidence that AdaBoost places on each stump's ability to classify the
data).
Since AdaBoost only predicts {-1, 1}. As long as the weighted sum is above 0, then
AdaBoost will predict 1.
The opposite happens, when the weighted sum is below 0 and AdaBoost will predict -1.
Random Forests
Random forests are among the most widely used machine learning algorithm, probably due
to their relatively good performance "out of the box" and ease of use.
The random forest algorithm can be understood as bagging with decision trees, but instead
of growing the decision trees by basing the splitting criterion on the complete feature set,
we use random feature subsets.
While the size of the feature subset to consider at each node is a hyperparameter that we
can tune, we can use NumFeatures = log2 m + 1.
Chapter-4 Support Vector Machines
4.1 Support Vector Machine
Basic Idea
Given the data and two classes, learn a function of the form:
T
g(x) = w x + b
The linear discriminant function (classifier) with the maximum margin is the best
Margin is defined as the width that the boundary could be increased by, before hitting a
data point
Which points should influence optimality?
All points?
Linear regression
Or only "difficult points" close to decision boundary
Support vector machines
Support Vectors
Support vectors are the data points that lie closest to the decision surface (or hyperplane)
They are the data points most difficult to classify
They have direct bearing on the optimum location of the decision surface
Mathematical Formulation:
2
maximize
||w||
such that
T
F or y i = +1, w xi + b ≥ 1
T
F or y i = −1, w x i + b ≤ −1
1 2
minimize ||w||
2
such that
T
F or y i = +1, w xi + b ≥ 1
T
F or y i = −1, w x i + b ≤ −1
T
F or y i = −1, w x i + b ≤ −1
ax + by + c 1 = 0 and ax + by + c 2 = 0
|c 2 − c 1 |
d =
√a 2 + b 2
Our lines in 2-D are:
w 1 x 1 + w 2 x 2 + b − 1 = 0 and w 1 x 1 + w 2 x 2 + b + 1 = 0
|b − 1 − b − 1| 2
Distance = =
2 2
||w||
√w + w
1 2
Formulation:
1
2
minimize ||w||
2
such that
T
y i (w x i + b) ≥ 1
∀f (x) − α∇g(x) = 0
1
2
f (w) = ||w||
2
g(w, b) = y i (wx + b) − 1, i = 1, . . . , m
For any convex optimization problem, there always exist settings of the dual variables such
that the unconstrained minimum of the Lagrangian with respect to the primal variables
(keeping the dual variables fixed) coincides with the solution of the original constrained
minimization problem.
Kuhn-Tucker theorem: The solution we find here will be the same as the solution to the
original problem
The advantage of the dual problem over the Lagrange primal problem is that the objective
function now only depends on the Lagrangian multipliers, which is easier to be solved
analytically.
Also it will let us solve the problem by computing the just the inner products of xi, xj
T T
g(x) = w x + b = ∑ αi xi x + b
i∈SV
Notice it relies on a dot product between the test point x and the support vectors xi
Also keep in mind that solving the optimization problem involved computing dot products
xiTxjbetween all pairs of training points
This idea is based on a simple premise: allow SVM to make a certain number of mistakes and
keep margin as wide as possible, so that other points can still be classified correctly. This can
be done simply by modifying the objective of SVM.
Here the red decision boundary perfectly separates all the training points.
The green decision boundary has a wider margin that would allow it to generalize well on
unseen data.
In that sense, soft margin formulation would also help in avoiding the overfitting problem.
So in such cases, where the data is not linearly separable (noisy data, outliers, etc.)
Slack variables ξi can be added to allow misclassification of difficult or noisy data points
In the problem posed here, we soften the constraint:
T
y i (x i w + b) ≥ 1 − ξ i , ξ i ≥ 0
The objective contains a penalty of cost Cξi, for any data point that
falls within the margin on the correct side of the separating hyperplane (i.e., when 0 <
ξi ≤1)
or on the wrong side of the separating hyperplane (i.e., when ξi > 1)
The value of ξi is the distance of xi from the corresponding class's margin if xi is on the
wrong side of the margin, otherwise zero.
Thus the points that are far away from the margin on the wrong side would get more
penalty.
C is a hyperparameter that decides the trade-off between maximizing the margin and
minimizing the mistakes. When C is small, classification mistakes are given less
importance and focus is more on maximizing the margin and vice versa.
We would therefore like to minimize the sum total of the penalties ξi over all i (this is an
upper bound on the training classification errors). Thus, the new well-posed optimization
problem (using L1 regularization) becomes:
m
1
T
min w w + C ∑ ξi
2
i=1
T
s. t. y i (x i w + b) ≥ 1 − ξ i , ξ i ≥ 0
Feature Space
General idea: the original input space can be mapped to some higher-dimensional feature
space where the training set is separable:
Proper feature mapping can make non-linear to linear!
i∈SV
No need to know this mapping explicitly, because we only use the dot product of feature
vectors in both the training and test.
A kernel function is defined as a function that corresponds to a dot product of two feature
vectors in some expanded feature space:
К(х i
, x j ) ≡ ϕ( х i
)
T
х
ϕ( j
)
One particularly interesting consequence of strong duality for convex optimization problems
is a property known as complementary slackness
According to the complementary slackness condition:
T
α i (y i (w x i + b) − 1) = 0
w = ∑ αi yi xi = ∑ αi yi xi
i=1 i∈SV
w = ∑ αi yi xi
i=1
and now, after training and finding the w by this method, given an unknown point u
measured on features xi we can classify it by looking at the sign of:
f (x) = wu + b = (∑ α i y i x i u) + b
i=1
Simple Example
Another Example
Optimization
Examples of Commonly used Kernel Functions
Algorithm
The axon of the neuron is connected to the dendrites of many other neurons
The human brain consists primarily of nerve cells called neurons, linked together with other
neurons via strands of fiber called axons.
Axons are used to transmit nerve impulses from one neuron to another whenever the
neurons are stimulated. A neuron is connected to the axons of other neurons via dendrites,
which are extensions from the cell body of the neuron.
The contact point between a dendrite and an axon is called a synapse.
Neurologists have discovered that the human brain learns by changing the strength of the
synaptic connection between neurons upon repeated stimulation by the same impulse.
Similarities
Neurons, connections between neurons
Learning = change of connections, not change of neurons
Massive parallel processing
But artificial neural networks are much simpler
computation within neuron vastly simplified
discrete time steps
typically some form of supervised learning with massive number of stimuli
In a perceptron, each input node is connected via a weighted link to the output node.
The weighted link is used to emulate the strength of synaptic connection between neurons.
As in biological neural systems, training a perceptron model amounts to adapting the
weights of the links until they fit the input-output relationships of the underlying data.
During the training phase of a perceptron model, the weight parameters w are adjusted
until the outputs of the perceptron become consistent with the true outputs of training
examples.
For a XOR classification problem, there is no linear hyperplane that can separate the two
classes.
Multi-layer neural networks are able to model more complex relationships between input
and output variables
We can think of each hidden node as a perceptron that tries to construct one of the two
hyperplanes, while the output node simply combines the results of the perceptrons to yield
the decision boundary
Our goal is to extend this model by making the basis functions φj(x) depend on parameters
and then to allow these parameters to be adjusted, along with the coefficients wj during
training.
Neural networks use basis functions such that each basis function is itself a nonlinear
function of a linear combination of the inputs, where the coefficients in the linear
combination are adaptive parameters.
y(x, w) = f (∑ w j ϕ j (x))
j=1
for fixed non-linear basis functions φ(.) where f( ) is a nonlinear activation function in the
case of classification.
We now extend this model by allowing adaptive basis functions, and learning their
parameters.
In feed-forward networks (a.k.a. multi-layer perceptrons) we let each basis function be
another non-linear function of linear combination of the inputs.
ϕ j (x) = f (∑ …)
j=1
D
(1) (1)
a j = ∑ w ji x i + w j0
i=1
where j = 1, ..., M and the superscript (1) indicates that the corresponding parameters are
in the first 'layer' of the network. These aj's are known as activations.
Each of them is then transformed using a differentiable, nonlinear activation function
h(.) to give
z j = h(a j )
The nonlinear functions h(.) are generally chosen to be sigmoidal functions such as the
logistic sigmoid or the 'tanh' function.
These values are again linearly combined to give output unit activations
M
(2) (2)
ak = ∑ w zj + w
kj k0
j=1
M D
(2) (1) (1) (2)
y k (x, w) = σ (∑ w h (∑ w xi + w ) + w )
kj ji j0 k0
j=1 i=1
Thus the neural network model is simply a nonlinear function from a set of input variables xi
to a set of output variables yk controlled by a vector w of adjustable parameters.
Activation function of a node in an artificial neural network is a function that calculates the
output of the node (based on its inputs and the weights on individual inputs).
Parameter Optimization
We turn next to the task of finding a weight vector w which minimizes the chosen function
E(w).
Because the error E(w) is a smooth continuous function of w, its smallest value will occur at
a point in weight space such that the gradient of the error function vanishes, so that ∇E(w)
= 0 as otherwise we could make a small step in the direction of -∇E(w) and thereby further
reduce the error.
Points at which the gradient vanishes are called stationary points, and may be further
classified into minima, maxima, and saddle points. lead
A minimum that corresponds to the smallest value of the error function for any weight
vector is said to be a global minimum.
Any other minima corresponding to higher values of the error function are said to be local
minima.
For a successful application of neural networks, it may not be necessary to find the global
minimum (and in general it will not be known whether the global minimum has been found)
but it may be necessary to compare several local minima in order to find a sufficiently good
Network Training
N
2
E(w) = ∑{y(x n , w) − t n }
n=1
tn 1−t n
p(t|w) = ∏ y n {1 − y n }
n=1
n=1
We now derive the backpropagation algorithm for a general network having arbitrary feed-
forward topology, arbitrary differentiable nonlinear activation functions, and a broad class of
error function.
The sum is transformed by a nonlinear activation function h( . ) to give the activation zj of
unit j in the form zj = h(aj)
For each pattern in the training set, we shall suppose that we have supplied the
corresponding input vector to the network and calculated the activations of all of the hidden
and output units in the network by successive application of the two equations.
This process is often called forward propagation because it can be regarded as a forward
flow of information through the network.
We shall now see how this simple result extends to the more complex setting of multilayer
feed-forward networks.
In a general feed-forward network, each unit computes a weighted sum of its inputs of the
form
a j = ∑ w ji z i
where z, is the activation of a unit, or input, that sends a connection to unit j, and wji is the
weight associated with that connection.
We now introduce a useful notation, where the ẟ's are often referred to as errors
∂E n
δj ≡
∂a j
We can write
∂a j
= z i because a j = ∑ w ji z i
∂w ji
i
∂E n
= δj zi
∂w ji
The required derivative is obtained simply by multiplying the value of ẟ for the unit at the
output end of the weight by the value of z for the unit at the input end of the weight.
This takes the same form as for the simple linear model.
Now consider the evaluation of the derivative of En with respect to a weight wji
The outputs of the various units will depend on the particular input pattern n.
First we note that En depends on the weight wji only via the summed input aj to unit j.
We can therefore apply the chain rule for partial derivatives to give
∂E n ∂E n ∂a j
=
∂w ji ∂a j ∂w ji
δk = yk − tk
To evaluate the ẟ's for hidden units, we again make use of the chain rule for partial
derivatives, where the sum runs over all units k to which unit j sends connections.
∂E n ∂E n ∂a k
δj = ≡ ∑
∂a j ∂a k ∂a j
k
Illustration of the calculation of ẟj for hidden unit j by backpropagation of the ẟ's from those
units k to which unit j sends connections.
The blue arrow denotes the direction of information flow during forward propagation, and
the red arrows indicate the backward propagation of error information.
To evaluate the ẟ's for hidden units, we again make use of the chain rule for partial
derivatives, where the sum runs over all units k to which unit j sends connections.
We obtain the following backpropagation formula which tells us that the value of ẟ for a
particular hidden unit can be obtained by propagating the ẟ's backwards from units higher
up in the network
∂E n
δk ≡
∂a k
∂E n ∂E n ∂a k
δj = = ∑
∂a j ∂a k ∂a j
k
a k = ∑ w ki z i = ∑ = w ki h(a i )
i i
∂a k ′
= ∑ = w kj h (a j )
∂a j
k
′
δ j = h (a j ) ∑ w kj δ k
Apply an input vector xn to the network and forward propagate through the network using
a j = ∑ w ji z i and z j = h(a j )
δk = yk − tk
′
δ j = h (a j ) ∑ w kj δ k
∂E n
= δj zi
∂w ji
Example
Lets consider a two-layer network, together with a sum- of-squares error, in which the
output units have linear activation functions, so that yk = ak, while the hidden units have
logistic sigmoid activation function given by
h(a) ≡ tanh(a)
a −a
e − e
tanh(a) =
a −a
e + e
A useful feature of this function is that its derivative can be expressed in a particularly
simple form:
′ 2
h (a) = 1 − h(a)
We also consider a standard sum-of-squares error function, so that for pattern n the error is
given by
K
1 2
En = ∑(y k − t k )
2
k=1
where yk is the activation of output unit k, and tk is the corresponding target, for a particular
input pattern xn
For each pattern in the training set in turn, we first perform a forward propagation using
D
(1)
a j = ∑ w ji x i
i=0
z j = tanh(a j )
M
(2)
yk = ∑ w zj
kj
j=0
δk = yk − tk
Then we backpropagate these to obtain ẟ's for the hidden units using
K
2
δ j = (1 − z j ) ∑ w kj δ k
k=1
Finally, the derivatives with respect to the first-layer and second-layer weights are given by
∂E n
= δj xi
(1)
∂w ji
∂E n
= δk zj
(2)
∂w
kj