Lec 2
Lec 2
Lec 2
1 The ingredients
The primary ingredients in a standard optimization-based supervised learning problem are:
• Data: (xi , yi ) pairs where xi is the input/covariates and yi is label/output and the
index i = 1, . . . , n where n is the size of the training data.
• Optimization Algorithm
We will now expand on these ingredients and highlight complications which arise along with
the standard solutions to address them.
2 The model
Training the model means choosing the parameters θ. We do so by using empirical risk
minimization. One example of this is to choose θ which minimizes average loss:
n
1!
θ̂ = argmin ℓtrain (yi , fθ (xi )) (1)
θ n i=1
For our setup, model performance is based on the average loss evaluated on our data, how-
ever this is not reflective of our true goal, which is to ensure the model performs well when
deployed into the real world since real world data could look different to our training data.
Because of this, we must have proxies to help us get a better understanding of the true
performance of our model.
A mathematical proxy is given when we make an assumption about the probability dis-
tribution P (X, Y ) of the underlying data, that is, we assume what the data in the real
world would look like. With this distribution assumption, we can evaluate the expected loss:
EX,Y [ℓ(Y, fθ (X)] which we want to be as low as possible. Making such an assumption about
1
the underlying distribution causes a few complications.
3 Loss Functions
A loss function maps a set of values to a real number which in theory should reflect some sort
of cost associated with event we are trying to model. There are many different loss functions
which we may have seen in previous coursework including: for binary classifiers, hinge loss
or logistic loss and for multi-class classifiers cross-entropy loss. With many different loss
functions in mind, we must decide on which loss function is the best for our use case. This
loss function decision comes with some complications.
Complication 2: Our loss function ℓtrue (·, ·) that we actually care about is incompatible
with our optimizer. e.g. Our loss function is non differentiable but the optimizer requires
its derivatives.
Solution: Use a surrogate loss, ℓtrain (·, ·), that satisfies the conditions of the optimizer
to train our model, but still evaluate the performance of our model, that is, calculate test
error, with ℓtrue (·, ·) since we no longer have to deal with the constraint of the optimizer.
e.g. y ∈ {cat, dog} where ℓtrue : Hamming Loss, but we can’t take derivatives of cat and
dog, so we use a surrogate loss: y → R where the training data becomes cat → −1 and
dog → +1, which is " now differentiable when we " choose ℓtrain : squared error loss.
A side note: n1 ni=1 ℓtrain (yi , fθ̂ (xi )) and n1 ni=1 ℓtrue (yi , fθ̂ (xi )) are two different quan-
tities since we are finding the error using different loss functions. You might ask, why do we
need to find the training error using ℓtrue , when we can just evaluate the model using test
data instead? We use this as a debugging method since evaluating the training error using
the true loss function helps us understand if the surrogate loss is doing an adequate job in
training our model.
2
4 Overfitting and Hyperparameters
4.1 Overfitting
Complication 3: We get ‘crazy’ values for θ̂ and/or we get really bad test performance.
One reason this could happen is due to model overfitting.
Solution: Add an explicit regularizer during training, that is,
n
1!
θ̂ = argmin ℓtrain (yi , fθ (xi )) + R(θ) (3)
θ n i=1
where R(θ) = λ$θ$2 is an example of a regularizer we could choose and is known as ridge
regularization.
Side note: An alternative solution is to simplify your model or reduce the model order (e.g.
the depth of the model). The suggestion to simplify models is often suggested in statistics,
however it is uncommon in deep learning to simplify models when crazy values of θ̂ arise
from model training.
5 Optimization Algorithm
There are many different optimization algorithms when it comes to minimizing a loss func-
tion. A commonly used optimization algorithm is Gradient Descent. Gradient descent is
an iterative optimization algorithm which changes the parameter of interest θ a little bit at
a time by looking at the local neighborhood of loss around θt . We look at this neighborhood
by taking the first order Taylor expansion: Ltrain (θt + ∆θ) ≈ Ltrain (θt ) + ∂θ
∂
Ltrain |θt ∆θ. We
3
Figure 1: Data partitions for fitting parameters and hyperparameters
want to move in such a way that maximizes the change of the loss (Ltrain (θt + ∆θ)) so we
must match it. That is, move in the negative direction of the gradient. So gradient descent
updates θ by the following:
We can interpret Eq. 4 as a dynamical system in discrete time, which means we can ask
4
questions about the stability of the system. In our case, η controls the stability of our sys-
tem, where if η is too large, the dynamics become unstable (we could possibly diverge), but
if η is too small, practically speaking, it could take too long to reach the minima.
Time Consideration: There are different interpretations for time when training models.
On one hand we could look at how many training iterations have passed, or how much data
we have ingested (e.g. Epochs). Another important time consideration is Wall Clock time,
which is simply the amount of real world time being used in training. When training our
model, both interpretations of time must be used since we want to train our model on enough
iterations, while not using too much wall clock time (depending on our budget).