Artificial Neural Network Notes
Artificial Neural Network Notes
Artificial Neural Network Notes
Activation Functions
Activation functions are critical components in artificial neural networks (ANNs).
They introduce non-linearity into the network, enabling it to learn and model complex
data patterns. They determine whether a neuron should be activated based on the
weighted sum of its inputs. The following are some activation functions:
Sigmoid Activation Function
The sigmoid activation function converts input values into a probability
between 0 and 1, making it suitable for binary classification problems. The
Sigmoid function provides a smooth gradient, which prevents abrupt changes in
the output. However, it has a significant drawback: the output values saturate and
kill gradients for very high or low input values, leading to the vanishing gradient
problem. This issue makes it challenging for the network to learn effectively
during training. Despite this, Sigmoid is often used in the output layer of binary
classification networks but is rarely used in hidden layers. It is defined by the
equation below:
1
σ(x) = 1+𝑒 −𝑥 (1)
Tanh (Hyperbolic Tangent) Activation Function
This activation function has output values in the range of -1 to 1. This
function is centered around zero, making it easier to model inputs with strongly
negative, neutral, or strongly positive values. The zero-centered nature of Tanh
can lead to faster convergence during training. However, similar to the Sigmoid
function, Tanh is susceptible to the gradient vanishing problem for very high or
low input values. The Tanh function is often used in hidden layers of neural
networks, especially in problems where the inputs have a mean close to zero.
The Tanh activation function is defined as:
2
tanh(x) = (2)
1+𝑒 −2𝑥
where:
zi is the input to the 𝑖�-th neuron, and the sum is over all neurons in the
output layer.
Swish Activation Function
Swish activation function combines linearity and non-linearity, providing
smooth gradients. It has been shown to outperform ReLU in many deep learning
tasks. Despite introducing a small computational overhead due to the sigmoid
component, Swish demonstrates superior performance in various applications.
This function is increasingly used in deep learning models, particularly in
architectures like MobileNetV3 and EfficientNet, due to its ability to enhance the
learning capability of neural networks. The equation used by this activation
function is shown below:
𝑓(𝑥)= x ∗ σ(𝑥) (5)
where:
𝜎� (𝑥�) is the sigmoid function.
Loss Function
The loss function, or cost function, measures the difference between the
predicted values and the actual target values. It quantifies how well the network
is performing. The goal of training is to minimize the value of the loss function,
indicating that the network's predictions are close to the actual values. Common
loss functions include:
a. Mean Squared Error (MSE) which is used for regression tasks. It calculates
the average of the squares of the errors (the differences between predicted
and actual values).
b. Cross-Entropy Loss that used for classification tasks. It measures the
performance of a classification model whose output is a probability value
between 0 and 1.
Backpropagation
Backpropagation is the process of updating the weights in the network to
minimize the loss. It involves calculating the gradient of the loss function with
respect to each weight using the chain rule of calculus. The steps are:
1. Calculate the Loss Gradient to determine how much the loss function
would change if each weight were adjusted.
2. Propagate the Gradient Backward by starting from the output layer,
propagate the error gradients backward through the network layers.
3. Adjust the weights in the direction that reduces the loss. This
adjustment is proportional to the learning rate, a hyperparameter that
controls the size of the steps taken in the gradient descent.
Optimization Algorithms
Optimization algorithms are methods used to update the weights to
minimize the loss function. Common optimization algorithms include:
1. Gradient Descent which updates weights by moving in the direction of
the negative gradient of the loss function. Variants of this are
Stochastic Gradient Descent (SGD) that updates weights using one
training sample at a time, introducing randomness into the process.
Mini-Batch Gradient Descent that uses a small random subset of the
data (mini-batch) to update the weights, balancing the efficiency of
batch gradient descent with the noise of stochastic gradient descent.
Adam (Adaptive Moment Estimation) which combines the advantages
of two other extensions of stochastic gradient descent, AdaGrad and
RMSProp. It maintains an adaptive learning rate for each parameter by
computing first and second moments of the gradients.
2. RMSprop (Root Mean Square Propagation) adapts the learning rate for
each parameter by dividing the gradient by a moving average of recent
gradients' magnitudes.
Hyperparameters
Hyperparameters are critical configuration settings in artificial neural networks
(ANNs) that are set before training and remain constant throughout the training process.
They control the learning process and significantly impact the model's performance.
Unlike model parameters, which are learned during training, hyperparameters need to
be specified by the practitioner. Key hyperparameters include learning rate, batch size,
number of epochs, and others such as momentum and regularization terms.
Learning Rate
The learning rate determines the step size at each iteration while moving
toward a minimum of the loss function. It controls how much to adjust the weights
with respect to the loss gradient. High Learning Rate leads to faster convergence
but risks overshooting the optimal solution, causing the loss to oscillate or
diverge. Low Learning Rate can ensure a more precise adjustments but can
result in slow convergence, getting stuck in local minima, or prolonged training
times. Optimal learning rate is crucial for efficient and effective training.
Techniques like learning rate schedules, which reduce the learning rate as
training progresses, and adaptive learning rate methods (e.g., Adam, RMSprop)
are commonly used.
Batch Size
Batch size is the number of training samples processed before the model's
internal parameters are updated. It influences the stability and efficiency of the
training process. A small batch size provides more frequent updates, which can
lead to faster learning and better generalization. However, it introduces more
noise into the gradient estimate, which can make training less stable, while a
large batch size reduces the variance in gradient estimates, leading to more
stable and accurate updates. It requires more memory and can result in slower
convergence due to less frequent updates. Common practice involves
experimenting with different batch sizes to find a balance between training speed
and stability.
Number of Epochs
Epochs represent the number of complete passes through the entire
training dataset. Each epoch involves training the model on all the training data
once. Using too few epochs can result to a model that is underfit, failing to
capture the underlying patterns in the data. While using too many epochs may
lead to a model that is overfitted, learning noise and details in the training data
that do not generalize well to new data. Using techniques like early stopping,
where training is halted once the performance on a validation set stops
improving, helps in preventing overfitting.
Momentum
Momentum is a hyperparameter that helps accelerate gradient descent
algorithms by dampening oscillations. It achieves this by adding a fraction of the
previous weight update to the current update. High Momentum leads to faster
convergence by maintaining the direction of previous updates, reducing the
likelihood of getting stuck in local minima. Low Momentum provides smoother
updates but may slow down the convergence process. Typical values for
momentum are around 0.9, but it can be fine-tuned based on the specific
problem.
Regularization
Regularization techniques are used to prevent overfitting by adding a
penalty to the loss function based on the complexity of the model. Common
regularization methods include: L1 Regularization (Lasso) which adds the
absolute value of the weights to the loss function. Encourages sparsity, meaning
fewer weights become non-zero. L2 Regularization (Ridge) which adds the
squared magnitude of the weights to the loss function. Encourages smaller
weights, leading to simpler models. Regularization strength is controlled by a
hyperparameter, which determines the extent of the penalty applied. Balancing
this hyperparameter is crucial to prevent underfitting or overfitting.
Dropout
Dropout is a regularization technique where randomly selected neurons
are ignored during training. This prevents neurons from co-adapting too much.
Dropout Rate is the fraction of neurons to drop during training. Common values
range from 0.2 to 0.5. Having a high dropout rate can create a strong
regularization effect, which can lead to underfitting if too many neurons are
dropped. Low dropout rate has less regularization, potentially leading to
overfitting. Dropout is typically applied to hidden layers and sometimes the input
layer, but not to the output layer.
Weight Initialization
Initial values of the weights can significantly impact the training process.
Proper initialization helps in achieving faster convergence and avoiding issues
like exploding or vanishing gradients. Having random initialization by assigning
small random values to weights. Often scaled based on the number of input and
output neurons (e.g., Xavier or He initialization). Zero Initialization is generally
avoided because it can lead to symmetry problems where all neurons in a layer
learn the same features.
ANN Architectures
Artificial Neural Networks (ANNs) come in various architectures, each suited for
different types of tasks and data. Understanding the different architectures and their
applications is essential for designing effective neural network models.
Autoencoders
Autoencoders are a type of neural network used for unsupervised
learning. They aim to learn a compressed representation of input data. An
autoencoder consists of an encoder that maps input data to a latent-space
representation and a decoder that reconstructs the input data from the latent
representation. Autoencoders are characterized by their ability to perform
dimensionality reduction, feature learning, and data denoising. They have a
symmetric structure with encoder and decoder layers. Applications of
autoencoders include anomaly detection, data compression, and feature
extraction, as well as image denoising and generation of new data similar to the
training set.