Unit 3

Unit-3
Deep Neural Networks

Contents
• Backpropagation
• Setup and Initialization Issues
• Gradient-Descent strategies
• Bias-variance trade-off
• Generalization Issues in Model Tuning and Evaluation
• Ensemble Methods
Backpropagation
• Back propagation is a specific technique for implementing gradient
descent in weight space for a multilayer perceptron.
• The basic idea is to efficiently compute partial derivatives of an
approximating function F(w, x) realized by the network w.r.t
• all the elements of the adjustable weight vector w for a given value of input
vector x
Backpropagation: Two passes of computation
• In the application of the back-propagation algorithm, two different
passes of computation are distinguished.
• The first pass is referred to as the forward pass, and the second is
referred to as the backward pass.
1. In the forward pass, the synaptic weights remain unaltered
throughout the network, and the function signals (activation values)
of the network are computed on a neuron-by-neuron basis.
Backpropagation: Two passes of computation
2. The backward pass starts at the output layer by passing the error
signals leftward through the network, layer by layer, and recursively
computing the the local gradient for each neuron.
• This local gradient is useful to update the weights.
Backpropagation: Local gradient computation
• The local gradient δj(n) depends on whether neuron j is an output node or a

hidden node:
1. If neuron j is an output node, δj(n) equals the product of the derivative
φ`j(vj(n)) and the error signal ej(n), both of which are associated with
neuron j;
2. If neuron j is a hidden node, δj(n) equals the product of the associated
derivative φ`j(vj(n)) and the weighted sum of the δs computed for the
neurons in the next hidden or output layer that are connected to neuron j;
Backpropagation with Pre-activation
Variables
The backpropagation process can now be described as follows:
1. Use a forward-pass to compute the values of all hidden units, output o, and loss L
for a particular input-output pattern (X, y).
2. Initialize ∂L /∂ao = δ(o, o) to ∂L /∂o · Φ`(ao).
3. Compute each δ(hr, o) in the backwards direction. After each such computation,
compute the gradients with respect to incident weights as follows:
∂L/∂w(hr−1,hr)= δ(hr, o) · hr−1
• The partial derivatives with respect to incident biases can be computed by using
the fact that bias neurons are always activated at a value of +1.
4. Use the computed partial derivatives of loss function with respect to weights in
order to perform stochastic gradient descent for input-output pattern (X, y).
Local Gradient Updates for various
Activation Functions
[ReLU]
Handling Shared Weights
1. In an autoencoder simulating PCA , the weights in the input
layer and the output layer are shared.
2. In a recurrent neural network for text, the weights in
different temporal layers are shared, because it is assumed
that the language model at each time-stamp is the same.
3. In a convolutional neural network, the same grid of weights
(corresponding to a visual field) is used over the entire
spatial extent of the neurons
Setup and Initialization Issues
• Tuning Hyperparameters:
• Neural networks have a large number of hyperparameters
• The term “hyperparameter” is used to specifically refer to the
parameters regulating the design of the model
• like learning rate
• number of layers
• nodes per layer
• and regularization parameter
• Hyperparameter tuning based on validation set: a portion of the data is held
out as validation data, and the performance of the model is tested on the
validation set with various choices of hyperparameters.
• Grid-based hyperparameter exploration: select set of values for each
parameter in some reasonable range.
• Test over all combination of values
• With 10 parameters, choosing just 3 values for each parameter leads to 310 values
• https://www.section.io/engineering-education/grid-search/#grid-search
• Sampling logarithm of hyperparameters: Search uniformly in reasonable
values of log-values and then exponentiate.
• Example: Uniformly sample log-learning rate between −3 and −1, and then raise it to the
power of 10.
• In many cases, multiple threads of the process with different hyperparameters
can be run, and one can successively terminate or add new sampled runs.
• In the end, only one winner is allowed to train to completion.
• Sometimes a few winners may be allowed to train to completion, and their predictions
will be averaged as an ensemble.
• Feature preprocessing:
• Additive Preprocessing and mean-centering: It can be useful to
mean-center the data in order to remove certain types of bias
effects.
• In such cases, a vector of column-wise means is subtracted from each data
point.
• Non-negative features: A second type of pre-processing is used when
it is desired for all feature values to be non-negative.
• In such a case, the absolute value of the most negative entry of a feature is
added to the corresponding feature value of each data point.
• Feature Normalization: A common type of normalization is to
divide each feature value by its standard deviation.
• When this type of feature scaling is combined with mean-centering,
the data is said to have been standardized.
• Each feature is presumed to have been drawn from a standard
normal distribution with zero mean and unit variance.
• Min-max normalization: useful when the data needs to be scaled in
the range (0,1)
• Whitening: The axis-system is rotated to create a new set of de-
correlated features, each of which is scaled to unit variance.
• Typically, principal component analysis is used to achieve this goal.
• Principal component analysis can be viewed as the application of singular
value decomposition after mean-centering a data matrix (i.e., subtracting
the mean from each column).
• Let D be an n × d data matrix that has already been mean-centered.
• Whitening Steps used for each data point:
i. The mean of each column is subtracted from the corresponding
feature;
ii. Each d-dimensional row vector representing a training data
point (or test data point) is post-multiplied with P to create a k-
dimensional row vector;
iii. Each feature of this k-dimensional representation is divided by
the square-root of the corresponding eigenvalue.
Initialization Issues
• Initialization is particularly important in neural networks because of
the stability issues associated with neural network training.
• One possible approach to initialize the weights is to generate random
values from a Gaussian distribution with zero mean and a small
standard deviation.
• Problem with this initialization is that it is not sensitive to the number of inputs
to a specific neuron.
Initialization Issues
Gradient Descent Strategies
Gradient Descent Algorithm
Gradient Descent Algorithm with
Backpropagation
Backpropagation
Backpropagation
• A lower learning rate used early on will cause the algorithm to take
too long to come even close to an optimal solution.
• A large initial learning rate will allow the algorithm to come
reasonably close to a good solution at first;
• In either case, maintaining a constant learning rate is not ideal.
• Allowing the learning rate to decay over time can naturally achieve
the desired learning-rate adjustment to avoid these challenges.
First-order GD optimizer methods
• Momentum
• Nesterov Accelerated Gradient Momentum
• AdaGrad
• RMSProp
• AdaDelta
• Adam
Considering an exponentially weighted average
Momentum-based Gradient Descent
Update Rule:
Considering an exponentially weighted average

Momentum
• Due to the momentum, the optimizer may overshoot a bit, then come
back , overshoot again and oscillate like this many times before
stabilizing at the minimum.
• This is why it is good to have bit of friction in the system: it gets rid of
these oscillations and thus speeds up convergence.
• So the hyperparameter, friction momentum β must be set between 0 (high
friction) and 1(low friction)
• Typically, β=0.9
[Look ahead before Leap]
Nesterov Momentum/Nesterov Accelerated Gradient (NAG)
Ill-conditioning
• Ill-conditioning is a situation where the loss function has an inherent

tendency to be more sensitive to some parameters than others
• for instance, after feature normalization
• Extreme ill-conditioning:
• for which the partial derivatives of the loss are wildly different with respect to
the different parameters.
• Clever learning strategies exist that work well in these ill-conditioned settings.
AdaGrad
• Intuition:
• Decay the learning rate for parameters in proportion to their update
history (more updates means more decay)
• Scaling the derivative inversely with √Ai encourages faster relative
movements along gently sloping directions.
– Absolute movements tend to slow down prematurely.
=>AdaGrad may not converge
• Cons: Adagrad decays the learning rate very aggressively (as the
denominator grows)
AdaDelta Optimizer
• Intuition: RMSProp variant
• α is replaced with a value (δ) that depends on the previous incremental updates. In each update,
the value of Δwi is the increment in the value of wi.
• Update Rule:
• Note: Adadelta update doesn’t depend on learning rate(α)

Adam Optimizer
• Intuition: RMSProp variant
• Exponential smoothing with ρ, ρf ∈ (0, 1)
• Update Rule: Below update is used at learning rate αt in the t th iteration:

Second-order GD optimizer methods
• Conjugate Gradients and Hessian-Free Optimization

• Quasi-Newton Methods and BFGS
Small step-no overshoot
large step-may overshoot
different
curvatures at axis
• curvature along x is high
• curvature along y is low
Zero-order term 1st order term 2nd order term
• Challenges
Bias and Variance
Bias and Variance
Bias and Variance
Bias and Variance
Bias and Variance
Bias and Variance
Bias
Bias
Bias
Variance
Bias and Variance
The Bias and Variance Trade-off
Generalization Issues in Model Tuning and Evaluation
• Model tuning and Hyperparameter choice:
• If one tuned the neural network with the same data that were used to train it, one would not
obtain very good results because of overfitting.
• The hyperparameters (e.g., regularization parameter) are tuned on a separate held-out set than
the one on which the weight parameters on the neural network are learned.
• Issues with Training at large scale for different hyperparameter settings:

• Common strategy is to run the training process of each setting for a fixed number of epochs.
• Multiple runs are executed over different choices of hyperparameters in different threads of
execution.
• How to Detect Need to collect more data:
• High bias=>Underfitting and High Variance=>Overfitting
• With increased training data, the training accuracy will reduce, whereas the test/validation accuracy will
increase.
• A given data set should always be divided into three parts defined according to
the way in which the data are used:
1. Training data: part of data used to build training model
2. Validation data: part of the data used for model tuning
• Strictly speaking, the validation data is also a part of the training data, because it influences
the final model
3. Testing data: part of the data used to test the accuracy of the final (tuned)
model.
• It is important that the testing data are not even looked at during the process
of parameter tuning and model selection to prevent overfitting.
• The testing data are used only once at the very end of the process.
Holdout
• In the hold-out method, a fraction of the instances are used to build
the training model.
• The remaining instances, which are also referred to as the held-out
instances, are used for testing.
• The accuracy of predicting the labels of the held-out instances is then
reported as the overall accuracy.
Holdout
• Pros:
• Such an approach ensures that the reported accuracy is not a result of
overfitting to the specific data set, because different instances are used for
training and testing.
• Simple and efficient
• Cons:
• underestimates the true accuracy
• pessimistic bias in evaluation due to class imbalance
Cross Validation
• In the cross-validation method, the labeled data is divided into q
equal segments.
• One of the q segments is used for testing, and the remaining (q−1)
segments are used for training.
• This process is repeated q times by using each of the q segments as
the test set.
• The average accuracy over the q different test sets is reported.
Ensemble Methods
Ensemble Methods
• Ensemble methods derive their inspiration from the bias-variance
trade-off.
• One way of reducing the error of a classifier is to find a way to reduce
either its bias or the variance without affecting the other component.
• Ensemble methods are used commonly in machine learning,
• Examples:
• Bagging ----------variance reduction
• Boosting ---------bias reduction.
Boosting Reference
Ensemble Methods
• Most ensemble methods in neural networks are focused on variance
reduction.
• as neural networks are valued for their ability to build arbitrarily complex
models in which the bias is relatively low.
• However, operating at the complex end of the bias variance trade-off almost
always leads to higher variance, which is manifested as overfitting.
• Therefore, the goal of most ensemble methods in the neural network
setting is variance reduction (i.e., better generalization).
Ensemble Methods: Bagging
• In bagging, the training data is sampled with replacement.
• The sample size s may be different from the size of the training data size n,
although it is common to set s to n
• The resampled data will contain duplicates, and about a fraction (1−1/n)n ≈ 1/e of
the original data set will not be included at all. Here, the notation e denotes the
base of the natural logarithm.
• A model is constructed on the resampled training data set, and each test instance
is predicted with the resampled data.
• The entire process of resampling and model building is repeated m times
• For a given test instance, each of these m models is applied to the test data.
• The predictions from different models are then averaged to yield a single robust
prediction.
• In bagging, the best results are often obtained by choosing values of s <<< n.
Challenges:
• The main challenge in directly using bagging for neural networks is that one must construct
multiple training models, which is highly inefficient unless training is on multiple GPU processors.
Ensemble Methods: Subsampling
• Subsampling is similar to bagging, except that the different models are
constructed on the samples of the data created without replacement.
• The predictions from the different models are averaged.
• In this case, it is essential to choose s < n
Ensemble Methods: Parametric Model
Selection and Averaging
• The presence of a large number of hyper parameters creates problems in model
construction, because the performance might be sensitive to the particular
configuration used.
• One possibility is to hold out a portion of the training data and try different
combinations of parameters and model choices.
• The selection that provides the highest accuracy on the held-out portion of the
training data is then used for prediction.
• This is the standard approach used for parameter tuning in all machine learning
models, and is referred to as model selection/bucket-of-models.
Ensemble Methods: Randomized Connection
Dropping
• The random dropping of connections between different layers in a multilayer
neural network often leads to diverse models in which different combinations of
features are used to construct the hidden variables.
• The dropping of connections between layers does tend to create less powerful
models because of the addition of constraints to the model-building process.
• However, since different random connections are dropped from different
models, the predictions from different models are very diverse.
• The averaged prediction from these different models is often highly accurate.
Note: The weights of different models are not shared in this approach

Unit 3

Uploaded by

Copyright:

Available Formats

Unit 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 3

Uploaded by

Copyright:

Available Formats

Unit-3

Deep Neural Networks

• The local gradient δj(n) depends on whether neuron j is an output node or a

Considering an exponentially weighted average

• Ill-conditioning is a situation where the loss function has an inherent

• Intuition: RMSProp variant

• Note: Adadelta update doesn’t depend on learning rate(α)

• Exponential smoothing with ρ, ρf ∈ (0, 1)

• Update Rule: Below update is used at learning rate αt in the t th iteration:

• Conjugate Gradients and Hessian-Free Optimization

• Issues with Training at large scale for different hyperparameter settings:

You might also like