Day 2
Day 2
Day 2
Morten Hjorth-Jensen1,2
1
Department of Physics and Center for Computing in Science Education, University of Oslo, Norway
partment of Physics and Astronomy and Facility for Rare Ion Beams and National Superconducting Cyclotron Laboratory, Michigan State University, U
Logistic Regression
In linear regression our main interest was centered on learning the coefficients of
a functional fit (say a polynomial) in order to be able to predict the response of a
continuous variable on some unseen data. The fit to the continuous variable yi is
based on some independent variables xi . Linear regression resulted in analytical
expressions for standard ordinary Least Squares or Ridge regression (in terms of
matrices to invert) for several quantities, ranging from the variance and thereby
the confidence intervals of the parameters β to the mean squared error. If we can
invert the product of the design matrices, linear regression gives then a simple
recipe for fitting our data.
Basics
We consider the case where the dependent variables, also called the responses or
the outcomes, yi are discrete and only take values from k = 0, . . . , K − 1 (i.e. K
classes).
The goal is to predict the output classes from the design matrix X ∈ Rn×p
made of n samples, each of which carries p features or predictors. The primary
goal is to identify the classes to which new unseen samples belong.
Let us specialize to the case of two classes only, with outputs yi = 0 and
yi = 1. Our outcomes could represent the status of a credit card user that could
default or not on her/his credit card debt. That is
0 no
yi = .
1 yes
Linear classifier
Before moving to the logistic model, let us try to use our linear regression model
to classify these two outcomes. We could for example fit a linear model to the
default case if yi > 0.5 and the no default case yi ≤ 0.5.
We would then have our weighted linear combination, namely
y = Xβ + , (1)
where y is a vector representing the possible outcomes, X is our n × p design
matrix and β represents our estimators/predictors.
2
One simple way to get a discrete output is to have sign functions that map
the output of a linear regressor to values {0, 1}, f (si ) = sign(si ) = 1 if si ≥ 0
and 0 if otherwise. We will encounter this model in our first demonstration of
neural networks. Historically it is called the “perceptron" model in the machine
learning literature. This model is extremely simple. However, in many cases it
is more favorable to use a “soft" classifier that outputs the probability of a given
category. This leads us to the logistic function.
import numpy
import matplotlib.pyplot as plt
import math as mt
z = numpy.arange(-5, 5, .1)
sigma_fn = numpy.vectorize(lambda z: 1/(1+numpy.exp(-z)))
sigma = sigma_fn(z)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(z, sigma)
ax.set_ylim([-0.1, 1.1])
ax.set_xlim([-5,5])
ax.grid(True)
ax.set_xlabel(’z’)
ax.set_title(’sigmoid function’)
3
plt.show()
"""Step Function"""
z = numpy.arange(-5, 5, .02)
step_fn = numpy.vectorize(lambda z: 1.0 if z >= 0.0 else 0.0)
step = step_fn(z)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(z, step)
ax.set_ylim([-0.5, 1.5])
ax.set_xlim([-5,5])
ax.grid(True)
ax.set_xlabel(’z’)
ax.set_title(’step function’)
plt.show()
"""tanh Function"""
z = numpy.arange(-2*mt.pi, 2*mt.pi, 0.1)
t = numpy.tanh(z)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(z, t)
ax.set_ylim([-1.0, 1.0])
ax.set_xlim([-2*mt.pi,2*mt.pi])
ax.grid(True)
ax.set_xlabel(’z’)
ax.set_title(’tanh function’)
plt.show()
Two parameters
We assume now that we have two classes with yi either 0 or 1. Furthermore we
assume also that we have only two parameters β in our fitting of the Sigmoid
function, that is we define probabilities
exp (β0 + β1 xi )
p(yi = 1|xi , β) = ,
1 + exp (β0 + β1 xi )
p(yi = 0|xi , β) = 1 − p(yi = 1|xi , β),
where β are the weights we wish to extract from data, in our case β0 and β1 .
Note that we used
Maximum likelihood
In order to define the total likelihood for all possible outcomes from a dataset
D = {(yi , xi )}, with the binary labels yi ∈ {0, 1} and where the data points
are drawn independently, we use the so-called Maximum Likelihood Estimation
4
(MLE) principle. We aim thus at maximizing the probability of seeing the
observed data. We can then approximate the likelihood in terms of the product
of the individual probabilities of a specific outcome yi , that is
n
Y y 1−yi
P (D|β) = [p(yi = 1|xi , β)] i [1 − p(yi = 1|xi , β))]
i=1
This equation is known in statistics as the cross entropy. Finally, we note that
just as in linear regression, in practice we often supplement the cross-entropy
with additional regularization terms, usually L1 and L2 regularization as we did
for Ridge and Lasso regression.
5
A more compact expression
Let us now define a vector y with n elements yi , an n × p matrix X which
contains the xi values and a vector p of fitted probabilities p(yi |xi , β). We can
rewrite in a more compact form the first derivative of cost function as
∂C(β)
= −X T (y − p) .
∂β
If we in addition define a diagonal matrix W with elements p(yi |xi , β)(1 −
p(yi |xi , β), we can obtain a compact expression of the second derivative as
∂ 2 C(β)
= X T W X.
∂β∂β T
p(βx)
log = β 0 + β1 x 1 + β2 x 2 + · · · + βp x p .
1 − p(βx)
exp (β0 + β1 x1 + β2 x2 + · · · + βp xp )
p(βx) = .
1 + exp (β0 + β1 x1 + β2 x2 + · · · + βp xp )
p(C = 1|x)
log = β10 + β11 x1 ,
p(K|x)
p(C = 2|x)
log = β20 + β21 x1 ,
p(K|x)
and so on till the class C = K − 1 class
p(C = K − 1|x)
log = β(K−1)0 + β(K−1)1 x1 ,
p(K|x)
6
More classes
In our discussion of neural networks we will encounter the above again in terms
of a slightly modified function, the so-called Softmax function.
The softmax function is used in various multiclass classification methods, such
as multinomial logistic regression (also known as softmax regression), multiclass
linear discriminant analysis, naive Bayes classifiers, and artificial neural networks.
Specifically, in multinomial logistic regression and linear discriminant analysis,
the input to the function is the result of K distinct linear functions, and the
predicted probability for the k-th class given a sample vector x and a weighting
vector β is (with two predictors):
More preprocessing
The Normalizer scales each data point such that the feature vector has a
euclidean length of one. In other words, it projects a data point on the circle
(or sphere in the case of higher dimensions) with a radius of 1. This means
7
every data point is scaled by a different number (by the inverse of it’s length).
This normalization is often used when only the direction (or angle) of the data
matters, not the length of the feature vector.
The RobustScaler works similarly to the StandardScaler in that it ensures
statistical properties for each feature that guarantee that they are on the same
scale. However, the RobustScaler uses the median and quartiles, instead of mean
and variance. This makes the RobustScaler ignore data points that are very
different from the rest (like measurement errors). These odd data points are also
called outliers, and might often lead to trouble for other scaling techniques.
8
benign = cancer.data[cancer.target == 1]
ax = axes.ravel()
for i in range(30):
_, bins = np.histogram(cancer.data[:,i], bins =50)
ax[i].hist(malignant[:,i], bins = bins, alpha = 0.5)
ax[i].hist(benign[:,i], bins = bins, alpha = 0.5)
ax[i].set_title(cancer.feature_names[i])
ax[i].set_yticks(())
ax[0].set_xlabel("Feature magnitude")
ax[0].set_ylabel("Frequency")
ax[0].legend(["Malignant", "Benign"], loc ="best")
fig.tight_layout()
plt.show()
In the above example we note two things. In the first plot we display the
overlap of benign and malignant tumors as functions of the various features in
the Wisconsing breast cancer data set. We see that for some of the features we
can distinguish clearly the benign and malignant cases while for other features
we cannot. This can point to us which features may be of greater interest when
we wish to classify a benign or not benign tumour.
In the second figure we have computed the so-called correlation matrix, which
in our case with thirty features becomes a 30 × 30 matrix.
We constructed this matrix using pandas via the statements
cancerpd = pd.DataFrame(cancer.data, columns=cancer.feature_names)
and then
correlation_matrix = cancerpd.corr().round(1)
Diagonalizing this matrix we can in turn say something about which features
are of relevance and which are not. This leads us to the classical Principal
Component Analysis (PCA) theorem with applications. This topic is covered in
the PCA material and additional topics on dimensionality reduction.
9
function. Ideally we would be able to solve for β analytically, however this is
not possible in general and we must use some approximative/numerical method
to compute the minimum.
exp (β0 + β1 xi )
p(yi = 1|xi , β) = ,
1 + exp (β0 + β1 xi )
p(yi = 0|xi , β) = 1 − p(yi = 1|xi , β),
where β are the weights we wish to extract from data, in our case β0 and β1 .
∂C(β)
= −X T (y − p) .
∂β
If we in addition define a diagonal matrix W with elements p(yi |xi , β)(1 −
p(yi |xi , β), we can obtain a compact expression of the second derivative as
∂ 2 C(β)
= X T W X.
∂β∂β T
This defines what is called the Hessian matrix.
or in matrix form as
−1
β new = β old − X T W X × −X T (y − p)
β old
.
10
The right-hand side is computed with the old values of β.
If we can compute these matrices, in particular the Hessian, the above is
often the easiest method to implement.
The equations
The Newton-Raphson formula consists geometrically of extending the tangent
line at a current point until it crosses zero, then setting the next guess to the
abscissa of that zero-crossing. The mathematics behind this method is rather
simple. Employing a Taylor expansion for x sufficiently close to the solution s,
we have
(s − x)2 00
f (s) = 0 = f (x) + (s − x)f 0 (x) + f (x) + . . . .
2
For small enough values of the function and for well-behaved functions, the
terms beyond linear are unimportant, hence we obtain
11
Extending to more than one variable
Newton’s method can be generalized to systems of several non-linear equations
and variables. Consider the case with two equations
f1 (x1 , x2 ) =0
f2 (x1 , x2 ) = 0,
Steepest descent
The basic idea of gradient descent is that a function F (x), x ≡ (x1 , · · · , xn ),
decreases fastest if one goes from x in the direction of the negative gradient
−∇F (x).
It can be shown that if
xk+1 = xk − γk ∇F (xk ),
with γk > 0.
For γk small enough, then F (xk+1 ) ≤ F (xk ). This means that for a suffi-
ciently small γk we are always moving towards smaller function values, i.e a
minimum.
12
More on Steepest descent
The previous observation is the basis of the method of steepest descent, which is
also referred to as just gradient descent (GD). One starts with an initial guess
x0 for a minimum of F and computes new approximations according to
xk+1 = xk − γk ∇F (xk ), k ≥ 0.
The parameter γk is often referred to as the step length or the learning rate
within the context of Machine Learning.
The ideal
Ideally the sequence {xk }k=0 converges to a global minimum of the function F .
In general we do not know if we are in a global or local minimum. In the special
case when F is a convex function, all local minima are also global minima, so in
this case gradient descent can converge to the global solution. The advantage of
this scheme is that it is conceptually simple and straightforward to implement.
However the method in this form has some severe limitations:
In machine learing we are often faced with non-convex high dimensional
cost functions with many local minima. Since GD is deterministic we will get
stuck in a local minimum, if the method converges, unless we have a very good
intial guess. This also implies that the scheme is sensitive to the chosen initial
condition.
Note that the gradient is a function of x = (x1 , · · · , xn ) which makes it
expensive to compute numerically.
Convex functions
Ideally we want our cost/loss function to be convex(concave).
First we give the definition of a convex set: A set C in Rn is said to be convex
if, for all x and y in C and all t ∈ (0, 1) , the point (1 − t)x + ty also belongs to
C. Geometrically this means that every point on the line segment connecting x
and y is in C as discussed below.
The convex subsets of R are the intervals of R. Examples of convex sets of
R2 are the regular polygons (triangles, rectangles, pentagons, etc...).
13
Convex function
Convex function: Let X ⊂ Rn be a convex set. Assume that the function
f : X → R is continuous, then f is said to be convex if
for all x1 , x2 ∈ X and for all t ∈ [0, 1]. If ≤ is replaced with a strict inequaltiy
in the definition, we demand x1 = 6 x2 and t ∈ (0, 1) then f is said to be strictly
convex. For a single variable function, convexity means that if you draw a
straight line connecting f (x1 ) and f (x2 ), the value of the function on the interval
[x1 , x2 ] is always below the line as illustrated below.
Second order condition. Assume that f is twice differentiable, i.e the Hessian
matrix exists at each point in Df . Then f is convex if and only if Df is a convex
set and its Hessian is positive semi-definite for all x ∈ Df . For a single-variable
function this reduces to f 00 (x) ≥ 0. Geometrically this means that f has
nonnegative curvature everywhere.
This condition is particularly useful since it gives us an procedure for de-
termining if the function under consideration is convex, apart from using the
definition.
14
Ideally we want the global minimum (for high-dimensional models it is hard
to know if we have local or global minimum). However, if the cost/loss function
is convex the following result provides invaluable information:
2. Using the second order condition show that the following functions are
convex on the specified domain.
• f (x) = ex is convex for x ∈ R.
• g(x) = − ln(x) is convex for x ∈ (0, ∞).
3. Let f (x) = x2 and g(x) = ex . Show that f (g(x)) and g(f (x)) is convex for
x ∈ R. Also show that if f (x) is any convex function than h(x) = ef (x) is
convex.
4. A norm is any function that satisfy the following properties
• f (αx) = |α|f (x) for all α ∈ R.
• f (x + y) ≤ f (x) + f (y)
• f (x) ≤ 0 for all x ∈ Rn with equality if and only if x = 0
Using the definition of convexity, try to show that a function satisfying the
properties above is convex (the third condition is not needed to show this).
15
We revisit the example from homework set 1 where we had
hβ (x) = y = β0 + β1 x,
such that
yi = β0 + β1 xi .
This result implies that C(β) is a convex function since the matrix X T X always
is positive semi-definite.
16
Simple program
We can now write a program that minimizes C(β) using the gradient descent
method with a constant learning rate γ according to
We can use the expression we computed for the gradient and let use a β0 be
chosen randomly and let γ = 0.001. Stop iterating when ||∇β C(βk )|| ≤ = 10−8 .
And finally we can compare our solution for β with the analytic result given
by β = (X T X)−1 X T y.
print(beta)
xnew = np.array([[0],[2]])
xbnew = np.c_[np.ones((2,1)), xnew]
ypredict = xbnew.dot(beta)
ypredict2 = xbnew.dot(beta_linreg)
plt.plot(xnew, ypredict, "r-")
plt.plot(xnew, ypredict2, "b-")
plt.plot(x, y ,’ro’)
plt.axis([0,2.0,0, 15.0])
plt.xlabel(r’$x$’)
plt.ylabel(r’$y$’)
plt.title(r’Gradient descent example’)
plt.show()
17
And a corresponding example using scikit-learn
# Importing various packages
from random import random, seed
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDRegressor
x = 2*np.random.rand(100,1)
y = 4+3*x+np.random.randn(100,1)
xb = np.c_[np.ones((100,1)), x]
beta_linreg = np.linalg.inv(xb.T.dot(xb)).dot(xb.T).dot(y)
print(beta_linreg)
sgdreg = SGDRegressor(n_iter = 50, penalty=None, eta0=0.1)
sgdreg.fit(x,y.ravel())
print(sgdreg.intercept_, sgdreg.coef_)
In order to minimize Cridge (β) using GD we only have adjust the gradient as
follows
P100
(β0 + β1 xi − yi ) β
∇β Cridge (β) = 2 100
i=1 +2λ 0 = 2(X T (Xβ−y)+λβ).
β1
P
i=1 (x (β
i 0 + β x
1 i ) − y x
i i )
We can easily extend our program to minimize Cridge (β) using gradient
descent and compare with the analytical solution given by
−1
βridge = X T X + λI2×2 X T y.
18
#Ridge parameter lambda
lmbda = 0.001
Id = lmbda* np.eye(XT_X.shape[0])
eta = 0.1
Niterations = 100
print(beta)
ypredict = xb @ beta
ypredict2 = xb @ beta_linreg
plt.plot(x, ypredict, "r-")
plt.plot(x, ypredict2, "b-")
plt.plot(x, y ,’ro’)
plt.axis([0,2.0,0, 15.0])
plt.xlabel(r’$x$’)
plt.ylabel(r’$y$’)
plt.title(r’Gradient descent example for Ridge’)
plt.show()
19
• GD is very sensitive to choices of learning rates. GD is extremely
sensitive to the choice of learning rates. If the learning rate is very small,
the training process take an extremely long time. For larger learning rates,
GD can diverge and give poor results. Furthermore, depending on what
the local landscape looks like, we have to modify the learning rates to
ensure convergence. Ideally, we would adaptively choose the learning rates
to match the landscape.
• GD treats all directions in parameter space uniformly. Another
major drawback of GD is that unlike Newton’s method, the learning rate
for GD is the same in all directions in parameter space. For this reason,
the maximum learning rate is set by the behavior of the steepest direction
and this can significantly slow down training. Ideally, we would like to
take large steps in flat directions and small steps in steep directions. Since
we are exploring rugged landscapes where curvatures change, this requires
us to keep track of not only the gradient but second derivatives. The
ideal scenario would be to calculate the Hessian but this proves to be too
computationally expensive.
• GD can take exponential time to escape saddle points, even with random
initialization. As we mentioned, GD is extremely sensitive to initial
condition since it determines the particular local minimum GD would
eventually reach. However, even with a good initialization scheme, through
the introduction of randomness, GD can still take exponential time to
escape saddle points.
Computation of gradients
This in turn means that the gradient can be computed as a sum over i-gradients
n
X
∇β C(β) = ∇β ci (xi , β).
i
20
SGD example
As an example, suppose we have 10 data points (x1 , · · · , x10 ) and we choose
to have M = 5 minibathces, then each minibatch contains two data points. In
particular we have B1 = (x1 , x2 ), · · · , B5 = (x9 , x10 ). Note that if you choose
M = 1 you have only a single batch with all data points and on the other
extreme, you may choose M = n resulting in a minibatch for each datapoint, i.e
Bk = xk .
The idea is now to approximate the gradient by replacing the sum over all
data points with a sum over the data points in one the minibatches picked at
random in each gradient descent step
n
X n
X
∇β C(β) = ∇β ci (xi , β) → ∇β ci (xi , β).
i=1 i∈Bk
j = 0
for epoch in range(1,n_epochs+1):
for i in range(m):
k = np.random.randint(m) #Pick the k-th minibatch at random
#Compute the gradient using the data in minibatch Bk
#Compute new suggestion for
j += 1
Taking the gradient only on a subset of the data has two important benefits.
First, it introduces randomness which decreases the chance that our opmization
scheme gets stuck in a local minima. Second, if the size of the minibatches
are small relative to the number of datapoints (M < n), the computation of
the gradient is much cheaper since we sum over the datapoints in the k − th
minibatch and not all n datapoints.
21
When do we stop?
A natural question is when do we stop the search for a new minimum? One
possibility is to compute the full gradient after a given number of epochs and
check if the norm of the gradient is smaller than some threshold and stop if true.
However, the condition that the gradient is zero is valid also for local minima, so
this would only tell us that we are close to a local/global minimum. However, we
could also evaluate the cost function at this point, store the result and continue
the search. If the test kicks in at a later stage we can compare the values of the
cost function and keep the β that gave the lowest value.
gamma_j = t0/t1
j = 0
for epoch in range(1,n_epochs+1):
for i in range(m):
k = np.random.randint(m) #Pick the k-th minibatch at random
#Compute the gradient using the data in minibatch Bk
#Compute new suggestion for beta
t = epoch*m+i
gamma_j = step_length(t,t0,t1)
j += 1
22
Program for stochastic gradient
# Importing various packages
from math import exp, sqrt
from random import random, seed
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDRegressor
m = 100
x = 2*np.random.rand(m,1)
y = 4+3*x+np.random.randn(m,1)
xb = np.c_[np.ones((m,1)), x]
theta_linreg = np.linalg.inv(xb.T.dot(xb)).dot(xb.T).dot(y)
print("Own inversion")
print(theta_linreg)
sgdreg = SGDRegressor(max_iter = 50, penalty=None, eta0=0.1)
sgdreg.fit(x,y.ravel())
print("sgdreg from scikit")
print(sgdreg.intercept_, sgdreg.coef_)
theta = np.random.randn(2,1)
eta = 0.1
Niterations = 1000
xnew = np.array([[0],[2]])
xbnew = np.c_[np.ones((2,1)), xnew]
ypredict = xbnew.dot(theta)
ypredict2 = xbnew.dot(theta_linreg)
n_epochs = 50
t0, t1 = 5, 50
def learning_schedule(t):
return t0/(t+t1)
theta = np.random.randn(2,1)
23
plt.axis([0,2.0,0, 15.0])
plt.xlabel(r’$x$’)
plt.ylabel(r’$y$’)
plt.title(r’Random numbers ’)
plt.show()
Momentum based GD
The stochastic gradient descent (SGD) is almost always used with a momentum
or inertia term that serves as a memory of the direction we are moving in
parameter space. This is typically implemented as follows
vt = γvt−1 + ηt ∇θ E(θt )
θt+1 = θt − vt , (2)
d2 w dw
m +µ = −∇w E(w).
dt2 dt
We can discretize this equation in the usual way to get
wt+∆t − 2wt + wt−∆t wt+∆t − wt
m +µ = −∇w E(w).
(∆t)2 ∆t
Rearranging this equation, we can rewrite this as
(∆t)2 m
∆wt+∆t = − ∇w E(w) + ∆wt .
m + µ∆t m + µ∆t
24
Momentum parameter
Notice that this equation is identical to previous one if we identify the position of
the particle, w, with the parameters θ. This allows us to identify the momentum
parameter and learning rate with the mass of the particle and the viscous drag
as:
m (∆t)2
γ= , η= .
m + µ∆t m + µ∆t
Thus, as the name suggests, the momentum parameter is proportional to
the mass of the particle and effectively provides inertia. Furthermore, in the
large viscosity/small learning rate limit, our memory time scales as (1 − γ)−1 ≈
m/(µ∆t).
Why is momentum useful? SGD momentum helps the gradient descent
algorithm gain speed in directions with persistent but small gradients even in
the presence of stochasticity, while suppressing oscillations in high-curvature
directions. This becomes especially important in situations where the landscape
is shallow and flat in some directions and narrow and steep in others. It has
been argued that first-order methods (with appropriate initial conditions) can
perform comparable to more expensive second order methods, especially in the
context of complex deep learning models.
These beneficial properties of momentum can sometimes become even more
pronounced by using a slight modification of the classical momentum algorithm
called Nesterov Accelerated Gradient (NAG).
In the NAG algorithm, rather than calculating the gradient at the current
parameters, ∇θ E(θt ), one calculates the gradient at the expected value of the
parameters given our current momentum, ∇θ E(θt + γvt−1 ). This yields the
NAG update rule
One of the major advantages of NAG is that it allows for the use of a larger
learning rate than GDM for the same choice of γ.
25
learning rate by the curvature. However, this is very computationally expensive
for extremely large models. Ideally, we would like to be able to adaptively change
the step size to match the landscape without paying the steep computational
price of calculating or approximating Hessians.
Recently, a number of methods have been introduced that accomplish this
by tracking not only the gradient, but also the second moment of the gradient.
These methods include AdaGrad, AdaDelta, RMS-Prop, and ADAM.
RMS prop
In RMS prop, in addition to keeping a running average of the first moment of
the gradient, we also keep track of the second moment denoted by st = E[gt2 ].
The update rule for RMS prop is given by
gt = ∇θ E(θ) (4)
st = βst−1 + (1 − β)gt2
gt
θt+1 = θt − η t √ ,
st +
where β controls the averaging time of the second moment and is typically
taken to be about β = 0.9, ηt is a learning rate typically chosen to be 10−3 , and
∼ 10−8 is a small regularization constant to prevent divergences. Multiplication
and division by vectors is understood as an element-wise operation. It is clear
from this formula that the learning rate is reduced in directions where the norm
of the gradient is consistently large. This greatly speeds up the convergence by
allowing us to use a larger learning rate for flat directions.
ADAM optimizer
A related algorithm is the ADAM optimizer. In ADAM, we keep a running
average of both the first and second moment of the gradient and use this
information to adaptively change the learning rate for different parameters. In
addition to keeping a running average of the first and second moments of the
gradient (i.e. mt = E[gt ] and st = E[gt2 ], respectively), ADAM performs an
additional bias correction to account for the fact that we are estimating the first
two moments of the gradient using a running average (denoted by the hats in the
update rule below). The update rule for ADAM is given by (where multiplication
and division are once again understood to be element-wise operations below)
26
gt = ∇θ E(θ) (5)
mt = β1 mt−1 + (1 − β1 )gt
st = β2 st−1 + (1 − β2 )gt2
mt
mt =
1 − β1t
st
st =
1 − β2t
mt
θt+1 = θt − ηt √ ,
st +
(6)
where β1 and β2 set the memory lifetime of the first and second moment and
are typically taken to be 0.9 and 0.99 respectively, and η and are identical to
RMSprop.
Like in RMSprop, the effective step size of a parameter depends on the
magnitude of its gradient squared. To understand this better, let us rewrite this
expression in terms of the variance σt2 = st − (mt )2 . Consider a single parameter
θt . The update rule for this parameter is given by
mt
∆θt+1 = −ηt p .
σt2 + m2t +
Practical tips
• Randomize the data when making mini-batches. It is always impor-
tant to randomly shuffle the data when forming mini-batches. Otherwise,
the gradient descent method can fit spurious correlations resulting from
the order in which data is presented.
• Transform your inputs. Learning becomes difficult when our landscape
has a mixture of steep and flat directions. One simple trick for minimizing
these situations is to standardize the data by subtracting the mean and
normalizing the variance of input variables. Whenever possible, also
decorrelate the inputs. To understand why this is helpful, consider the
case of linear regression. It is easy to show that for the squared error cost
function, the Hessian of the energy matrix is just the correlation matrix
between the inputs. Thus, by standardizing the inputs, we are ensuring
that the landscape looks homogeneous in all directions in parameter space.
Since most deep networks can be viewed as linear transformations followed
by a non-linearity at each layer, we expect this intuition to hold beyond
the linear case.
• Monitor the out-of-sample performance. Always monitor the perfor-
mance of your model on a validation set (a small portion of the training
27
data that is held out of the training process to serve as a proxy for the test
set. If the validation error starts increasing, then the model is beginning to
overfit. Terminate the learning process. This early stopping significantly
improves performance in many settings.
• Adaptive optimization methods don’t always have good general-
ization. Recent studies have shown that adaptive methods such as ADAM,
RMSPorp, and AdaGrad tend to have poor generalization compared to
SGD or SGD with momentum, particularly in the high-dimensional limit
(i.e. the number of parameters exceeds the number of data points). Al-
though it is not clear at this stage why these methods perform so well
in training deep neural networks, simpler procedures like properly-tuned
SGD may work as well or better in these applications.
Automatic differentiation
Automatic differentiation (AD), also called algorithmic differentiation or com-
putational differentiation,is a set of techniques to numerically evaluate the
derivative of a function specified by a computer program. AD exploits the fact
that every computer program, no matter how complicated, executes a sequence
of elementary arithmetic operations (addition, subtraction, multiplication, divi-
sion, etc.) and elementary functions (exp, log, sin, cos, etc.). By applying the
chain rule repeatedly to these operations, derivatives of arbitrary order can be
computed automatically, accurately to working precision, and using at most a
small constant factor more arithmetic operations than the original program.
Automatic differentiation is neither:
Symbolic differentiation can lead to inefficient code and faces the difficulty
of converting a computer program into a single expression, while numerical
differentiation can introduce round-off errors in the discretization process and
cancellation
Python has tools for so-called automatic differentiation. Consider the
following example
f (x) = sin 2πx + x2
28
import autograd.numpy as np
# To do elementwise differentiation:
from autograd import elementwise_grad as egrad
# To plot:
import matplotlib.pyplot as plt
def f(x):
return np.sin(2*np.pi*x + x**2)
def f_grad_analytic(x):
return np.cos(2*np.pi*x + x**2)*(2*np.pi + 2*x)
# Do the comparison:
x = np.linspace(0,1,1000)
f_grad = egrad(f)
computed = f_grad(x)
analytic = f_grad_analytic(x)
plt.xlabel(’x’)
plt.ylabel(’y’)
plt.legend()
plt.show()
Using autograd
Here we experiment with what kind of functions Autograd is capable of finding
the gradient of. The following Python functions are just meant to illustrate what
Autograd can do, but please feel free to experiment with other, possibly more
complicated, functions as well.
import autograd.numpy as np
from autograd import grad
def f1(x):
return x**3 + 1
f1_grad = grad(f1)
29
Autograd with more complicated functions
To differentiate with respect to two (or more) arguments of a Python function,
Autograd need to know at which variable the function if being differentiated
with respect to.
import autograd.numpy as np
from autograd import grad
def f2(x1,x2):
return 3*x1**3 + x2*(x1 - 5) + 1
# By sending the argument 0, Autograd will compute the derivative w.r.t the first variable, in thi
f2_grad_x1 = grad(f2,0)
x1 = 1.0
x2 = 3.0
print()
print("The derivative of f2 w.r.t x2: %g"%( f2_grad_x2(x1,x2) ))
print("The analytical derivative of f2 w.r.t x2: %g"%( f2_grad_x2(x1,x2) ))
Note that the grad function will not produce the true gradient of the function.
The true gradient of a function with two or more variables will produce a vector,
where each element is the function differentiated w.r.t a variable.
x = np.linspace(0,4,5)
30
# The analytical gradient is: (2, 3, 5, 7, 22*x[4])
f3_grad_analytical = np.array([2, 3, 5, 7, 22*x[4]])
Note that in this case, when sending an array as input argument, the output
from Autograd is another array. This is the true gradient of the function, as
opposed to the function in the previous example. By using arrays to represent
the variables, the output from Autograd might be easier to work with, as the
output is closer to what one could expect form a gradient-evaluting function.
f4_grad = grad(f4)
x = 2.7
More autograd
import autograd.numpy as np
from autograd import grad
def f5(x):
if x >= 0:
return x**2
else:
return -3*x + 1
f5_grad = grad(f5)
x = 2.7
31
for i in range(10):
val = val + x**i
return val
def f6_while(x):
val = 0
i = 0
while i < 10:
val = val + x**i
i = i + 1
return val
f6_for_grad = grad(f6_for)
f6_while_grad = grad(f6_while)
x = 0.5
import autograd.numpy as np
from autograd import grad
# Both of the functions are implementation of the sum: sum(x**i) for i = 0, ..., 9
# The analytical derivative is: sum(i*x**(i-1))
f6_grad_analytical = 0
for i in range(10):
f6_grad_analytical += i*x**(i-1)
Using recursion
import autograd.numpy as np
from autograd import grad
f7_grad = grad(f7)
n = 2.0
print("The computed derivative of f7 at n = %d is: %g"%(n,f7_grad(n)))
f7_grad_analytical = 0
for i in range(int(n)-1):
tmp = 1
for k in range(int(n)-1):
if k != i:
tmp *= (n - k)
f7_grad_analytical += tmp
32
Note that if n is equal to zero or one, Autograd will give an error message. This
message appears when the output is independent on input.
Unsupported functions
Autograd supports many features. However, there are some functions that is not
supported (yet) by Autograd.
Assigning a value to the variable being differentiated with respect to
import autograd.numpy as np
from autograd import grad
def f8(x): # Assume x is an array
x[2] = 3
return x*2
f8_grad = grad(f8)
x = 8.4
print("The derivative of f8 is:",f8_grad(x))
Here, Autograd tells us that an ’ArrayBox’ does not support item assignment.
The item assignment is done when the program tries to assign x[2] to the value
3. However, Autograd has implemented the computation of the derivative such
that this assignment is not possible.
x = np.array([1.0,0.0])
Here we are told that the ’dot’ function does not belong to Autograd’s version
of a Numpy array. To overcome this, an alternative syntax which also computed
the dot product can be used:
import autograd.numpy as np
from autograd import grad
def f9_alternative(x): # Assume a is an array with 2 elements
b = np.array([1.0,2.0])
return np.dot(x,b) # The same as x_1*b_1 + x_2*b_2
f9_alternative_grad = grad(f9_alternative)
x = np.array([3.0,0.0])
33
# The analytical gradient of the dot product of vectors x and b with two elements (x_1,x_2) and (b
# w.r.t x is (b_1, b_2).
Recommended to avoid
The documentation recommends to avoid inplace operations such as
a += b
a -= b
a*= b
a /=b
r = b − Ax,
where r is the so-called residual or error in the iterative process.
When we have found the exact solution, r = 0.
Gradient method
The residual is zero when we reach the minimum of the quadratic equation
1 T
P (x) = x Ax − xT b,
2
with the constraint that the matrix A is positive definite and symmetric.
This defines also the Hessian and we want it to be positive definite.
34
Steepest descent method
One can show that the solution x is also the unique minimizer of the quadratic
form
1
f (x) = xT Ax − xT x, x ∈ Rn .
2
This suggests taking the first basis vector r1 (see below for definition) to be the
gradient of f at x = x0 , which equals
Ax0 − b,
Final expressions
We can compute the residual iteratively as
rk+1 = b − Axk+1 ,
which equals
b − A(xk + αk rk ),
or
(b − Axk ) − αk Ark ,
which gives
rkT rk
αk =
rkT Ark
leading to the iterative scheme
xk+1 = xk − αk rk ,
#include <cmath>
#include <iostream>
#include <fstream>
#include <iomanip>
#include "vectormatrixclass.h"
using namespace std;
// Main function begins here
int main(int argc, char * argv[]){
int dim = 2;
Vector x(dim),xsd(dim), b(dim),x0(dim);
Matrix A(dim,dim);
35
// Set our initial guess
x0(0) = x0(1) = 0;
// Set the matrix
A(0,0) = 3; A(1,0) = 2; A(0,1) = 2; A(1,1) = 6;
b(0) = 2; b(1) = -8;
cout << "The Matrix A that we are using: " << endl;
A.Print();
cout << endl;
xsd = SteepestDescent(A,b,x0);
cout << "The approximate solution using Steepest Descent is: " << endl;
xsd.Print();
cout << endl;
}
import matplotlib.pyplot as pt
from mpl_toolkits.mplot3d import axes3d
def f(x):
return 0.5*x[0]**2 + 2.5*x[1]**2
def df(x):
return np.array([x[0], 5*x[1]])
fig = pt.figure()
ax = fig.gca(projection="3d")
36
xmesh, ymesh = np.mgrid[-2:2:50j,-2:2:50j]
fmesh = f(np.array([xmesh, ymesh]))
ax.plot_surface(xmesh, ymesh, fmesh)
Find guesses
x = guesses[-1]
s = -df(x)
Run it!
def f1d(alpha):
return f(x + alpha*s)
alpha_opt = sopt.golden(f1d)
next_guess = x + alpha_opt * s
guesses.append(next_guess)
print(next_guess)
What happened?
pt.axis("equal")
pt.contour(xmesh, ymesh, fmesh, 50)
it_array = np.array(guesses)
pt.plot(it_array.T[0], it_array.T[1], "x-")
sT At = 0.
xTi Axj = 0.
Two vectors are conjugate if they are orthogonal with respect to this inner
product. Being conjugate is a symmetric relation: if s is conjugate to t, then t
is conjugate to s.
37
Conjugate gradient method
Assume now that we have a symmetric positive-definite matrix A of size n × n.
At each iteration i + 1 we obtain the conjugate direction of a vector
xi+1 = xi + αi pi .
pTk b
αk =
pTk Apk
38
Conjugate gradient method
One can show that the solution x is also the unique minimizer of the quadratic
form
1
f (x) = xT Ax − xT x, x ∈ Rn .
2
This suggests taking the first basis vector p1 to be the gradient of f at x = x0 ,
which equals
Ax0 − b,
and x0 = 0 it is equal −b. The other vectors in the basis will be conjugate to
the gradient, hence the name conjugate gradient method.
rk = b − Axk .
pTk Ark
pk+1 = rk − pk .
pTk Apk
rk+1 = b − Axk+1 ,
which equals
b − A(xk + αk pk ),
or
(b − Axk ) − αk Apk ,
which gives
rk+1 = rk − Apk ,
Broyden–Fletcher–Goldfarb–Shanno algorithm
The optimization problem is to minimize f (x) where x is a vector in Rn , and
f is a differentiable scalar function. There are no constraints on the values that
x can take.
The algorithm begins at an initial estimate for the optimal value x0 and
proceeds iteratively to get a better estimate at each stage.
39
The search direction pk at stage k is given by the solution of the analogue of
the Newton equation
Bk pk = −∇f (xk ),
where Bk is an approximation to the Hessian matrix, which is updated
iteratively at each stage, and ∇f (xk ) is the gradient of the function evaluated
at xk . A line search in the direction pk is then used to find the next point xk+1
by minimising
f (xk + αpk ),
over the scalar α > 0.
40