Unit 3
Unit 3
Unit 3
[ReLU]
Handling Shared Weights
1. In an autoencoder simulating PCA , the weights in the input
layer and the output layer are shared.
2. In a recurrent neural network for text, the weights in
different temporal layers are shared, because it is assumed
that the language model at each time-stamp is the same.
3. In a convolutional neural network, the same grid of weights
(corresponding to a visual field) is used over the entire
spatial extent of the neurons
Setup and Initialization Issues
• Tuning Hyperparameters:
• Neural networks have a large number of hyperparameters
• The term “hyperparameter” is used to specifically refer to the
parameters regulating the design of the model
• like learning rate
• number of layers
• nodes per layer
• and regularization parameter
Setup and Initialization Issues
• Tuning Hyperparameters:
• Hyperparameter tuning based on validation set: a portion of the data is held
out as validation data, and the performance of the model is tested on the
validation set with various choices of hyperparameters.
• Grid-based hyperparameter exploration: select set of values for each
parameter in some reasonable range.
• Test over all combination of values
• With 10 parameters, choosing just 3 values for each parameter leads to 310 values
• https://www.section.io/engineering-education/grid-search/#grid-search
Setup and Initialization Issues
• Tuning Hyperparameters:
• Sampling logarithm of hyperparameters: Search uniformly in reasonable
values of log-values and then exponentiate.
• Example: Uniformly sample log-learning rate between −3 and −1, and then raise it to the
power of 10.
• In many cases, multiple threads of the process with different hyperparameters
can be run, and one can successively terminate or add new sampled runs.
• In the end, only one winner is allowed to train to completion.
• Sometimes a few winners may be allowed to train to completion, and their predictions
will be averaged as an ensemble.
Setup and Initialization Issues
• Feature preprocessing:
• Additive Preprocessing and mean-centering: It can be useful to
mean-center the data in order to remove certain types of bias
effects.
• In such cases, a vector of column-wise means is subtracted from each data
point.
• Non-negative features: A second type of pre-processing is used when
it is desired for all feature values to be non-negative.
• In such a case, the absolute value of the most negative entry of a feature is
added to the corresponding feature value of each data point.
Setup and Initialization Issues
• Feature preprocessing:
• Feature Normalization: A common type of normalization is to
divide each feature value by its standard deviation.
• When this type of feature scaling is combined with mean-centering,
the data is said to have been standardized.
• Each feature is presumed to have been drawn from a standard
normal distribution with zero mean and unit variance.
• Min-max normalization: useful when the data needs to be scaled in
the range (0,1)
Setup and Initialization Issues
• Feature preprocessing:
• Whitening: The axis-system is rotated to create a new set of de-
correlated features, each of which is scaled to unit variance.
• Typically, principal component analysis is used to achieve this goal.
• Principal component analysis can be viewed as the application of singular
value decomposition after mean-centering a data matrix (i.e., subtracting
the mean from each column).
• Let D be an n × d data matrix that has already been mean-centered.
Setup and Initialization Issues
• Feature preprocessing:
• Whitening Steps used for each data point:
i. The mean of each column is subtracted from the corresponding
feature;
ii. Each d-dimensional row vector representing a training data
point (or test data point) is post-multiplied with P to create a k-
dimensional row vector;
iii. Each feature of this k-dimensional representation is divided by
the square-root of the corresponding eigenvalue.
Initialization Issues
• Initialization is particularly important in neural networks because of
the stability issues associated with neural network training.
• One possible approach to initialize the weights is to generate random
values from a Gaussian distribution with zero mean and a small
standard deviation.
• Problem with this initialization is that it is not sensitive to the number of inputs
to a specific neuron.
Initialization Issues
Gradient Descent Strategies
Gradient Descent Algorithm
Gradient Descent Algorithm
Gradient Descent Algorithm
Gradient Descent Algorithm
Gradient Descent Algorithm
Gradient Descent Algorithm with
Backpropagation
Gradient Descent Algorithm with
Backpropagation
Gradient Descent Algorithm with
Backpropagation
Gradient Descent Strategies
Gradient Descent Strategies
• A lower learning rate used early on will cause the algorithm to take
too long to come even close to an optimal solution.
• A large initial learning rate will allow the algorithm to come
reasonably close to a good solution at first;
• In either case, maintaining a constant learning rate is not ideal.
• Allowing the learning rate to decay over time can naturally achieve
the desired learning-rate adjustment to avoid these challenges.
First-order GD optimizer methods
• Momentum
• Nesterov Accelerated Gradient Momentum
• AdaGrad
• RMSProp
• AdaDelta
• Adam
Considering an exponentially weighted average
Momentum-based Gradient Descent
Update Rule:
• Due to the momentum, the optimizer may overshoot a bit, then come
back , overshoot again and oscillate like this many times before
stabilizing at the minimum.
• This is why it is good to have bit of friction in the system: it gets rid of
these oscillations and thus speeds up convergence.
• So the hyperparameter, friction momentum β must be set between 0 (high
friction) and 1(low friction)
• Typically, β=0.9
[Look ahead before Leap]
Nesterov Momentum/Nesterov Accelerated Gradient (NAG)
Ill-conditioning
• Intuition:
• Decay the learning rate for parameters in proportion to their update
history (more updates means more decay)
• Scaling the derivative inversely with √Ai encourages faster relative
movements along gently sloping directions.
– Absolute movements tend to slow down prematurely.
=>AdaGrad may not converge
• Cons: Adagrad decays the learning rate very aggressively (as the
denominator grows)
AdaDelta Optimizer
• α is replaced with a value (δ) that depends on the previous incremental updates. In each update,
the value of Δwi is the increment in the value of wi.
• Update Rule:
• A given data set should always be divided into three parts defined according to
the way in which the data are used:
1. Training data: part of data used to build training model
2. Validation data: part of the data used for model tuning
• Strictly speaking, the validation data is also a part of the training data, because it influences
the final model
3. Testing data: part of the data used to test the accuracy of the final (tuned)
model.
• It is important that the testing data are not even looked at during the process
of parameter tuning and model selection to prevent overfitting.
• The testing data are used only once at the very end of the process.
Generalization Issues in Model Tuning and Evaluation
Holdout
• In the hold-out method, a fraction of the instances are used to build
the training model.
• The remaining instances, which are also referred to as the held-out
instances, are used for testing.
• The accuracy of predicting the labels of the held-out instances is then
reported as the overall accuracy.
Holdout
• Pros:
• Such an approach ensures that the reported accuracy is not a result of
overfitting to the specific data set, because different instances are used for
training and testing.
• Simple and efficient
• Cons:
• underestimates the true accuracy
• pessimistic bias in evaluation due to class imbalance
Cross Validation
• In the cross-validation method, the labeled data is divided into q
equal segments.
• One of the q segments is used for testing, and the remaining (q−1)
segments are used for training.
• This process is repeated q times by using each of the q segments as
the test set.
• The average accuracy over the q different test sets is reported.
Ensemble Methods
Ensemble Methods
• Ensemble methods derive their inspiration from the bias-variance
trade-off.
• One way of reducing the error of a classifier is to find a way to reduce
either its bias or the variance without affecting the other component.
• Ensemble methods are used commonly in machine learning,
• Examples:
• Bagging ----------variance reduction
• Boosting ---------bias reduction.
Boosting Reference
Ensemble Methods
• Most ensemble methods in neural networks are focused on variance
reduction.
• as neural networks are valued for their ability to build arbitrarily complex
models in which the bias is relatively low.
• However, operating at the complex end of the bias variance trade-off almost
always leads to higher variance, which is manifested as overfitting.
• Therefore, the goal of most ensemble methods in the neural network
setting is variance reduction (i.e., better generalization).
Ensemble Methods: Bagging
• In bagging, the training data is sampled with replacement.
• The sample size s may be different from the size of the training data size n,
although it is common to set s to n
• The resampled data will contain duplicates, and about a fraction (1−1/n)n ≈ 1/e of
the original data set will not be included at all. Here, the notation e denotes the
base of the natural logarithm.
• A model is constructed on the resampled training data set, and each test instance
is predicted with the resampled data.
• The entire process of resampling and model building is repeated m times
Ensemble Methods: Bagging
Ensemble Methods: Bagging
• For a given test instance, each of these m models is applied to the test data.
• The predictions from different models are then averaged to yield a single robust
prediction.
• In bagging, the best results are often obtained by choosing values of s <<< n.
Challenges:
• The main challenge in directly using bagging for neural networks is that one must construct
multiple training models, which is highly inefficient unless training is on multiple GPU processors.
Ensemble Methods: Subsampling
• Subsampling is similar to bagging, except that the different models are
constructed on the samples of the data created without replacement.
• The predictions from the different models are averaged.
• In this case, it is essential to choose s < n
Ensemble Methods: Parametric Model
Selection and Averaging
• The presence of a large number of hyper parameters creates problems in model
construction, because the performance might be sensitive to the particular
configuration used.
• One possibility is to hold out a portion of the training data and try different
combinations of parameters and model choices.
• The selection that provides the highest accuracy on the held-out portion of the
training data is then used for prediction.
• This is the standard approach used for parameter tuning in all machine learning
models, and is referred to as model selection/bucket-of-models.
Ensemble Methods: Randomized Connection
Dropping
• The random dropping of connections between different layers in a multilayer
neural network often leads to diverse models in which different combinations of
features are used to construct the hidden variables.
• The dropping of connections between layers does tend to create less powerful
models because of the addition of constraints to the model-building process.
• However, since different random connections are dropped from different
models, the predictions from different models are very diverse.
• The averaged prediction from these different models is often highly accurate.
Note: The weights of different models are not shared in this approach