Artificial Neural Network Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Artificial Neural Network

Artificial Neural Networks (ANNs) are computing systems inspired by the


biological neural networks that constitute animal brains. They are designed to recognize
patterns, learn from data, and make predictions. ANNs are a subset of machine learning
algorithms and form the basis of deep learning.

Neurons and Layers


Neuron or nodes are the fundamental unit of an ANN, analogous to a biological
neuron. It consists of input values, weights, a bias, and an activation function. A neuron
receives input, applies weights and bias, processes the weighted sum through an
activation function, and produces an output.
Neurons are organized into layers. There are three main types of layers:
 Input Layer – The first layer, which receives the initial input data. Each neuron
in this layer represents a feature of the input data.
 Hidden Layers – One or more intermediate layers where data transformation
occurs. The network learns to recognize patterns and features through these
layers. More hidden layers can lead to greater learning capacity, often
referred to as "deep" networks.
 Output Layer – The final layer that produces the output of the network. The
number of neurons in this layer corresponds to the number of desired output
values.

Activation Functions
Activation functions are critical components in artificial neural networks (ANNs).
They introduce non-linearity into the network, enabling it to learn and model complex
data patterns. They determine whether a neuron should be activated based on the
weighted sum of its inputs. The following are some activation functions:
Sigmoid Activation Function
The sigmoid activation function converts input values into a probability
between 0 and 1, making it suitable for binary classification problems. The
Sigmoid function provides a smooth gradient, which prevents abrupt changes in
the output. However, it has a significant drawback: the output values saturate and
kill gradients for very high or low input values, leading to the vanishing gradient
problem. This issue makes it challenging for the network to learn effectively
during training. Despite this, Sigmoid is often used in the output layer of binary
classification networks but is rarely used in hidden layers. It is defined by the
equation below:
1
σ(x) = 1+𝑒 −𝑥 (1)
Tanh (Hyperbolic Tangent) Activation Function
This activation function has output values in the range of -1 to 1. This
function is centered around zero, making it easier to model inputs with strongly
negative, neutral, or strongly positive values. The zero-centered nature of Tanh
can lead to faster convergence during training. However, similar to the Sigmoid
function, Tanh is susceptible to the gradient vanishing problem for very high or
low input values. The Tanh function is often used in hidden layers of neural
networks, especially in problems where the inputs have a mean close to zero.
The Tanh activation function is defined as:
2
tanh(x) = (2)
1+𝑒 −2𝑥

ReLU (Rectified Linear Unit) Activation Function


The ReLU activation function introduces non-linearity while preserving
some properties of linear models. ReLU is simple to compute, leading to efficient
training. It also helps prevent and solve the gradient vanishing problem, enabling
the training of deeper networks. However, ReLU can cause "dead neurons,"
where neurons stop learning if the input is always negative. Despite this, ReLU is
widely used in hidden layers of most modern neural networks due to its
computational efficiency and ability to scale with network depth. It is defined as
follows:
f(x) = max(0, 𝑥) (3)
Leaky ReLU Activation Function
The Leaky ReLU activation function addresses the "dying ReLU" problem
by allowing a small, non-zero gradient when the unit is not active. This ensures
that neurons can still learn even when they receive negative inputs. Leaky ReLU
retains many benefits of ReLU while mitigating the issue of dead neurons. It is
commonly used in hidden layers to ensure all neurons continue learning
throughout the training process. The equation used by Leaky ReLU is shown
below, where alpha (𝛼�) is typically a small constant (0.01).
f(x) = 𝑥�𝑖𝑓�𝑥 > 0; 𝑓(𝑥) = 𝛼𝑥�𝑖𝑓�𝑥 ≤ 0� (4)
Softmax Activation Function
The Softmax activation function converts logits (raw model outputs) into
probabilities that sum to 1, making it suitable for multi-class classification
problems. Softmax provides a probabilistic interpretation of class membership. It
is typically used in the output layer of neural networks designed for multi-class
classification, enabling the network to predict multiple classes with a single
model. The activation function is defined by the equation below:
𝑒 𝑍𝑖
�σ(𝑧)𝑖 = ∑ 𝑧𝑗 (5)
𝑗𝑒

where:
zi is the input to the 𝑖�-th neuron, and the sum is over all neurons in the
output layer.
Swish Activation Function
Swish activation function combines linearity and non-linearity, providing
smooth gradients. It has been shown to outperform ReLU in many deep learning
tasks. Despite introducing a small computational overhead due to the sigmoid
component, Swish demonstrates superior performance in various applications.
This function is increasingly used in deep learning models, particularly in
architectures like MobileNetV3 and EfficientNet, due to its ability to enhance the
learning capability of neural networks. The equation used by this activation
function is shown below:
𝑓(𝑥)= x ∗ σ(𝑥) (5)
where:
𝜎� (𝑥�) is the sigmoid function.

Neural Network Development


Training an artificial neural network (ANN) involves adjusting the weights of
connections between neurons to minimize prediction error. This process, known as
training or learning, is iterative and comprises several essential steps: forward
propagation, loss function calculation, backpropagation, and optimization.
Forward Propagation
Forward propagation is the process of passing input data through the
network to generate an output. During this phase, each layer of the network
transforms the data using its weights, biases, and activation functions. The input
layer receives the raw data and passes it to the first hidden layer. The hidden
layers apply weights and biases to the inputs, process the results through
activation functions, and pass the transformed data to the next layer. The output
layer produces the final prediction by transforming the last hidden layer's output.
The result of forward propagation is a set of predicted values, which will be
compared to the actual values to compute the error.

Loss Function
The loss function, or cost function, measures the difference between the
predicted values and the actual target values. It quantifies how well the network
is performing. The goal of training is to minimize the value of the loss function,
indicating that the network's predictions are close to the actual values. Common
loss functions include:
a. Mean Squared Error (MSE) which is used for regression tasks. It calculates
the average of the squares of the errors (the differences between predicted
and actual values).
b. Cross-Entropy Loss that used for classification tasks. It measures the
performance of a classification model whose output is a probability value
between 0 and 1.

Backpropagation
Backpropagation is the process of updating the weights in the network to
minimize the loss. It involves calculating the gradient of the loss function with
respect to each weight using the chain rule of calculus. The steps are:
1. Calculate the Loss Gradient to determine how much the loss function
would change if each weight were adjusted.
2. Propagate the Gradient Backward by starting from the output layer,
propagate the error gradients backward through the network layers.
3. Adjust the weights in the direction that reduces the loss. This
adjustment is proportional to the learning rate, a hyperparameter that
controls the size of the steps taken in the gradient descent.

Optimization Algorithms
Optimization algorithms are methods used to update the weights to
minimize the loss function. Common optimization algorithms include:
1. Gradient Descent which updates weights by moving in the direction of
the negative gradient of the loss function. Variants of this are
Stochastic Gradient Descent (SGD) that updates weights using one
training sample at a time, introducing randomness into the process.
Mini-Batch Gradient Descent that uses a small random subset of the
data (mini-batch) to update the weights, balancing the efficiency of
batch gradient descent with the noise of stochastic gradient descent.
Adam (Adaptive Moment Estimation) which combines the advantages
of two other extensions of stochastic gradient descent, AdaGrad and
RMSProp. It maintains an adaptive learning rate for each parameter by
computing first and second moments of the gradients.
2. RMSprop (Root Mean Square Propagation) adapts the learning rate for
each parameter by dividing the gradient by a moving average of recent
gradients' magnitudes.

Practical Steps in Training ANNs


1. Set initial weights, often randomly. Proper initialization helps in faster
convergence.
2. Define learning rate, batch size, number of epochs, and other training
settings.
3. Pass input data through the network to compute predictions through
forward propagation.
4. Measure the difference between predicted and actual values using the
loss function.
5. Calculate gradients and update weights to minimize the loss through
backpropagation.
6. Repeat the forward propagation, loss calculation, and backpropagation
steps for a set number of epochs or until the loss converges to a
satisfactory level.
7. Assess the model using evaluation metrics on a separate validation set to
ensure it generalizes well to unseen data.

Hyperparameters
Hyperparameters are critical configuration settings in artificial neural networks
(ANNs) that are set before training and remain constant throughout the training process.
They control the learning process and significantly impact the model's performance.
Unlike model parameters, which are learned during training, hyperparameters need to
be specified by the practitioner. Key hyperparameters include learning rate, batch size,
number of epochs, and others such as momentum and regularization terms.
Learning Rate
The learning rate determines the step size at each iteration while moving
toward a minimum of the loss function. It controls how much to adjust the weights
with respect to the loss gradient. High Learning Rate leads to faster convergence
but risks overshooting the optimal solution, causing the loss to oscillate or
diverge. Low Learning Rate can ensure a more precise adjustments but can
result in slow convergence, getting stuck in local minima, or prolonged training
times. Optimal learning rate is crucial for efficient and effective training.
Techniques like learning rate schedules, which reduce the learning rate as
training progresses, and adaptive learning rate methods (e.g., Adam, RMSprop)
are commonly used.

Batch Size
Batch size is the number of training samples processed before the model's
internal parameters are updated. It influences the stability and efficiency of the
training process. A small batch size provides more frequent updates, which can
lead to faster learning and better generalization. However, it introduces more
noise into the gradient estimate, which can make training less stable, while a
large batch size reduces the variance in gradient estimates, leading to more
stable and accurate updates. It requires more memory and can result in slower
convergence due to less frequent updates. Common practice involves
experimenting with different batch sizes to find a balance between training speed
and stability.

Number of Epochs
Epochs represent the number of complete passes through the entire
training dataset. Each epoch involves training the model on all the training data
once. Using too few epochs can result to a model that is underfit, failing to
capture the underlying patterns in the data. While using too many epochs may
lead to a model that is overfitted, learning noise and details in the training data
that do not generalize well to new data. Using techniques like early stopping,
where training is halted once the performance on a validation set stops
improving, helps in preventing overfitting.

Momentum
Momentum is a hyperparameter that helps accelerate gradient descent
algorithms by dampening oscillations. It achieves this by adding a fraction of the
previous weight update to the current update. High Momentum leads to faster
convergence by maintaining the direction of previous updates, reducing the
likelihood of getting stuck in local minima. Low Momentum provides smoother
updates but may slow down the convergence process. Typical values for
momentum are around 0.9, but it can be fine-tuned based on the specific
problem.

Regularization
Regularization techniques are used to prevent overfitting by adding a
penalty to the loss function based on the complexity of the model. Common
regularization methods include: L1 Regularization (Lasso) which adds the
absolute value of the weights to the loss function. Encourages sparsity, meaning
fewer weights become non-zero. L2 Regularization (Ridge) which adds the
squared magnitude of the weights to the loss function. Encourages smaller
weights, leading to simpler models. Regularization strength is controlled by a
hyperparameter, which determines the extent of the penalty applied. Balancing
this hyperparameter is crucial to prevent underfitting or overfitting.

Dropout
Dropout is a regularization technique where randomly selected neurons
are ignored during training. This prevents neurons from co-adapting too much.
Dropout Rate is the fraction of neurons to drop during training. Common values
range from 0.2 to 0.5. Having a high dropout rate can create a strong
regularization effect, which can lead to underfitting if too many neurons are
dropped. Low dropout rate has less regularization, potentially leading to
overfitting. Dropout is typically applied to hidden layers and sometimes the input
layer, but not to the output layer.

Weight Initialization
Initial values of the weights can significantly impact the training process.
Proper initialization helps in achieving faster convergence and avoiding issues
like exploding or vanishing gradients. Having random initialization by assigning
small random values to weights. Often scaled based on the number of input and
output neurons (e.g., Xavier or He initialization). Zero Initialization is generally
avoided because it can lead to symmetry problems where all neurons in a layer
learn the same features.

ANN Architectures
Artificial Neural Networks (ANNs) come in various architectures, each suited for
different types of tasks and data. Understanding the different architectures and their
applications is essential for designing effective neural network models.

Feedforward Neural Networks (FNNs)


Feedforward Neural Networks (FNNs) are the simplest type of neural
network architecture. In FNNs, the information moves in one direction—from the
input layer through hidden layers to the output layer. There are no cycles or loops
in the network. Each neuron in one layer is connected to every neuron in the next
layer. These networks are straightforward to implement and train. FNNs are
suitable for tasks where inputs and outputs have a fixed size. They are
commonly used for classification and regression problems, providing a
foundational model for many neural network applications.

Convolutional Neural Networks (CNNs)


Convolutional Neural Networks (CNNs) are designed for processing grid-
like data structures, such as images. They use convolutional layers to
automatically learn spatial hierarchies of features. CNNs consist of convolutional
layers that perform convolution operations to detect local patterns, pooling layers
that down sample the feature maps to reduce dimensionality, and fully connected
layers that flatten the output from the convolutional/pooling layers and feed it into
a fully connected network. CNNs excel in hierarchical feature learning and
parameter sharing, making them highly efficient. They are widely used in image
and video recognition, object detection, and image segmentation, as well as in
natural language processing tasks when applied to text embeddings.
Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNNs) are suited for sequential data. They
have connections that form directed cycles, allowing them to maintain a state and
remember previous inputs. An RNN is composed of units where each unit takes
input from the current time step and the previous unit’s output. This structure
includes feedback connections to capture temporal dependencies. RNNs can
process sequences of varying lengths and maintain a memory of previous inputs,
making them effective for time-dependent data. They are commonly used in time
series forecasting, language modeling, speech recognition, and text generation,
as well as in tasks involving sequential data like stock price prediction and music
composition.

Long Short-Term Memory Networks (LSTMs)


Long Short-Term Memory Networks (LSTMs) are a special type of RNN
designed to address the vanishing gradient problem. They can capture long-term
dependencies more effectively. LSTMs are composed of units containing a cell
state and three gates (input, forget, and output) that regulate the flow of
information. This enhanced structure allows LSTMs to learn long-term
dependencies while mitigating issues with vanishing and exploding gradients.
LSTMs are suitable for tasks requiring long-range temporal dependencies and
are used in machine translation, speech synthesis, and video analysis, providing
robust performance in scenarios where standard RNNs struggle.

Gated Recurrent Units (GRUs)


Gated Recurrent Units (GRUs) are a simplified version of LSTMs,
retaining the benefits while reducing computational complexity. GRUs consist of
units with two gates (reset and update) instead of three. This structure provides
fewer parameters compared to LSTMs, leading to more efficient training and
faster convergence. GRUs are effective for similar applications as LSTMs, such
as sequential data tasks, but are preferred when computational resources are
limited due to their simplified architecture.

Autoencoders
Autoencoders are a type of neural network used for unsupervised
learning. They aim to learn a compressed representation of input data. An
autoencoder consists of an encoder that maps input data to a latent-space
representation and a decoder that reconstructs the input data from the latent
representation. Autoencoders are characterized by their ability to perform
dimensionality reduction, feature learning, and data denoising. They have a
symmetric structure with encoder and decoder layers. Applications of
autoencoders include anomaly detection, data compression, and feature
extraction, as well as image denoising and generation of new data similar to the
training set.

Generative Adversarial Networks (GANs)


Generative Adversarial Networks (GANs) consist of two neural networks, a
generator and a discriminator, that compete against each other. The generator
produces fake data samples, while the discriminator distinguishes between real
and fake data. GANs are characterized by their adversarial training process,
where the generator aims to fool the discriminator. This capability allows GANs to
generate realistic data samples. GANs are used in image generation, style
transfer, and data augmentation, as well as in creating high-quality synthetic data
for various applications, providing powerful tools for creative and synthetic data
generation tasks.

ANN Python File:


https://colab.research.google.com/drive/1K7Z5ZRVif7xvC8Mqgo0eyG6ReviLSohy?usp
=sharing

You might also like