Unit3 2023 NNDL

Download as pdf or txt
Download as pdf or txt
You are on page 1of 69

UNIT 3

CNN
• Introduction - Components of CNN Architecture - Rectified Linear Unit
(ReLU) Layer -
• Exponential Linear Unit (ELU, or SELU) - Unique Properties of CNN
-Architectures of CNN

• Applications of CNN.
• CNN
• CNNs are a class of Deep Neural Networks that can recognize and classify
particular features from images and are widely used for analyzing visual images.
Their applications range from image and video recognition, image classification,
medical image analysis, computer vision and natural language processing.
• CNN has high accuracy, and because of the same, it is useful in image recognition.
Image recognition has a wide range of uses in various industries such as medical
image analysis,security, recommendation systems, etc.
• The term ‘Convolution” in CNN denotes the mathematical function of convolution
which is a special kind of linear operation wherein two functions are multiplied to
produce a third function which expresses how the shape of one function is
modified by the other. In simple terms, two images which can be represented as
matrices are multiplied to give an output that is used to extract features from the
image.
CNN
• In the fast-paced world of computer vision and image processing, one problem
consistently stands out: the ability to effectively recognize and classify images.
• As we continue to digitize and automate our world, the demand for systems that
can understand and interpret visual data is growing at an unprecedented rate.
• The challenge is not just about recognizing images – it’s about doing so accurately
and efficiently. Traditional machine learning methods often fall short, struggling to
handle the complexity and high dimensionality of image data. This is
where Convolutional Neural Networks (CNNs) comes to rescue.
• The CNN architectures are the most popular deep learning framework. CNNs
shown remarkable success in tackling the problem of image recognition, bringing
a newfound level of precision and scalability.
image
How Does A Computer Read an
Image?
• Consider this image of the New York
skyline, upon first glance you will see a lot
of buildings and colors. So how does the
computer process this image?

• The image is broken down into 3


color-channels which is Red,
Green and Blue. Each of these color
channels are mapped to the image’s pixel.

Then, the computer recognizes the value
associated with each
pixel and determine the size of the image.
• However, for black-white images, there is
only one channel and the concept is
the same.
Convolutional Neural Networks -CNN
• Convolutional Neural Networks, commonly referred to as CNNs, are a
specialized kind of neural network architecture that is designed to process
data with a grid-like topology. This makes them particularly well-suited
for dealing with spatial and temporal data, like images and videos, that
maintain a high degree of correlation between adjacent elements.
• CNNs are similar to other neural networks, but they have an added layer of
complexity due to the fact that they use a series of convolutional layers.
Convolutional layers perform a mathematical Operation called convolution,
a sort of specialized matrix multiplication, on the input data.
• The convolution operation helps to preserve the spatial relationship
between pixels by learning image features using small squares of input data.
. The picture below represents a typical CNN architecture
Convolutional operation
Stride and Padding
• Stride: Stride refers to how much the filter "slides" or moves across the input matrix
during each step of the convolution operation.
• When performing convolutional operations, including strides, it's possible to encounter a
situation where the input matrix size becomes smaller than the filter size due to the stride
movement. In such cases, there are a few common approaches to handle this situation,
depending on the desired outcome:
• 1. Valid Padding:
• In this approach, you apply the filter only to positions where it fully fits within the input
matrix. If the filter doesn't fit, that portion of the input is skipped. This effectively reduces
the size of the output feature map. No additional padding is added in this case.
• . Same Padding:
• With same padding, you add zeros around the input matrix before applying the
convolution. If the input size becomes smaller than the filter size due to the stride, you
continue to apply the filter as if the input were larger. This maintains the spatial
dimensions of the output feature map and helps to avoid reducing its size.
Stride and padding
• Zero padding in convolutional neural networks advantages:
• Preservation of Spatial Dimensions: Zero padding adds extra rows and columns of
zeros around the input image. This helps maintain the spatial dimensions(width,
height, and depth) of the input and output feature maps, especially when using
larger filters or strides. This can be crucial when you want to maintain the
information at the edges of the input image.
• Mitigation of Border Effects: Without padding, the filter's center will only overlap
with the center of the input image. This can lead to a reduction in spatial
dimensions and a loss of information at the edges. Zero padding mitigates these
border effects, as the filter can interact with all parts of the input, allowing the
output feature map to capture more comprehensive information.
• Control of Output Size: Padding allows you to control the output size of the
feature map. By adjusting the amount of padding, you can determine whether the
output size remains the same as the input size or changes based on your
requirements.
COMPONENTS OF CONVOLUTIONAL NEURAL NETWORKS
• . Input layer
• As the name says, it’s our input image and can be Grayscale or RGB.
.Every image is made up of pixels that range from 0 to 255. We need
to normalize them i.e convert the range between 0 to 1 before
passing it to the model.
• Below is the example of an input image of size 4*4 and has 3 channels
i.e RGB and pixel values.
COMPONENTS OF CONVOLUTIONAL NEURAL NETWORKS

• Convolutional layer: Convolutional layers operate by sliding a set of


‘filters’ or ‘kernels’ across the input data.
• Each filter is designed to detect a specific feature or pattern, such as
edges, corners, or more complex shapes in the case of deeper layers.
As these filters move across the image, they generate a map that
signifies the areas where those features were found.
• The output of the convolutional layer is a feature map, which is a
representation of the input image with the filters applied.
• Convolutional layers can be stacked to create more complex models,
which can learn more intricate features from images.
COMPONENTS OF CONVOLUTIONAL NEURAL NETWORKS

• Pooling layer performs the dimensionality reduction on the input by


reducing the number of parameters, and it also uses a filter to
perform this action.
• The only difference is that instead of using a matrix with weights as
convolutional layers do, pooling layers perform aggregation on the
input pixels.
• Although this operation results in the loss of some data, the benefits
of using pooling layers include reducing the complexity of features
making it less prone to overfitting, speeding up the calculations, and
improving the efficiency of CNN. There are two types of pooling
layers:
.
• There are two main types of pooling: max pooling and average pooling.
• Max pooling takes the maximum value from each feature map.
• For example, if the pooling window size is 2×2, it will pick the pixel with the highest value in that
2×2 region. Max pooling effectively captures the most prominent feature or characteristic within
the pooling window.
• Average pooling calculates the average of all values within the pooling window. It provides a
smooth, average feature representation.
• The flattening layer in a Convolutional Neural Network (CNN) architecture is
used to transform the 2D or 3D feature maps from the previous
convolutional and pooling layers into a 1D vector. This transformation is
important when transitioning from the convolutional and pooling layers to
the fully connected layers in the network.
• The flattening layer takes the output feature maps from a single sample
(image) and converts them into a 1D vector.
• After flattening, the resulting 1D vector becomes the input to the fully
connected layers. These fully connected layers can then perform the
necessary classification or regression tasks based on the extracted features.
COMPONENTS OF CONVOLUTIONAL NEURAL NETWORKS

• Fully connected layer: Fully-connected layers are one of the most basic
types of layers in a convolutional neural network (CNN).
• As the name suggests, each neuron in a fully-connected layer is Fully
connected- to every other neuron in the previous layer.
• Fully connected layers are typically used towards the end of a CNN- when
the goal is to take the features learned by the convolutional and pooling
layers and use them to make predictions such as classifying the input to a
label.
• For example, if we were using a CNN to classify images of animals, the final
Fully connected layer might take the features learned by the previous layers
and use them to classify an image as containing a dog, cat, bird, etc.
various kernel filters in CNN
• The purpose of using various kernel filters in Convolutional Neural
Networks (CNNs) is to capture different types of features and patterns
present in the input data, such as images. Each filter specializes in
detecting a specific feature, such as edges, corners, textures, or more
complex shapes. By applying multiple filters of different types, CNNs
can learn a diverse set of features and build a richer representation of
the input data.
kernel filters in CNN
• some common types of kernel filters and their purposes through
examples:
• Edge Detection: Edge detection filters are used to identify abrupt
intensity changes in an image, which correspond to edges. A common
edge detection filter is the Sobel filter. It has two variations, one for
detecting vertical edges and another for detecting horizontal edges.
These filters highlight regions of rapid intensity change in the input
image.
• Example of a Sobel filter for vertical edge detection:
Sobel Filter:
-1 0 1
-2 0 2
-1 0 1
kernel filters in CNN
• Blur (Gaussian) Filter: Gaussian filters are used for blurring or
smoothing an image. They average the pixel values in a neighborhood
to reduce noise and fine details. Gaussian filters are often used as
pre-processing steps.
• Example of a Gaussian filter:
• Gaussian Filter:
•1 2 1
•2 4 2
•1 2 1
kernel filters in CNN
• Sharpening Filter: Sharpening filters enhance edges and details in an
image. They work by subtracting a blurred version of the image from
the original image. This amplifies high-frequency components, making
edges more pronounced.
• Example of a sharpening filter:
• Sharpening Filter:
• 0 -1 0
• -1 5 -1
• 0 -1 0
kernel filters in CNN
• Embossing Filter: Embossing filters create a 3D effect by emphasizing
the differences between adjacent pixels. They simulate the effect of
light and shadow on a textured surface.
• Example of an embossing filter:Embossing Filter:
• -2 -1 0
• -1 1 1
•0 1 2
kernel filters in CNN

• Identity Filters:
• Identity filters do not alter the input and serve as baseline filters. They can
be used to visualize the regions where different patterns are detected by
other filters.

• Example: Identity Filter (No Change):

•0 0 0
• 0 1 0
• 0 0 0
• Custom Filters:
• Custom filters can be designed to detect specific patterns relevant to a particular
task, such as detecting diagonal lines, corners, or texture patterns.
• Example: Custom Filter for Diagonal Line Detection:
• 0 0 -1
• 0 1 0
• -1 0 0
• Using these various kernel filters, CNNs can learn a diverse set of features at
different scales and orientations. By stacking multiple convolutional layers with
different filters, CNNs become capable of recognizing complex visual patterns and
features, making them a powerful tool for a wide range of computer vision tasks.

WORKING OF CNN
• Example: Handwritten Digit Recognition
• Imagine you have a CNN trained to recognize handwritten digits (0-9).
The input to the CNN is a grayscale image of a digit.
• Input Image: input image of the digit "7":
• Convolutional Layer:
• The first layer in the CNN is a convolutional layer. It consists of
multiple filters that slide over the input image. Each filter detects
specific features like edges, corners, or textures.
• As the filters move across the input image, they perform convolutions
to create feature maps that highlight relevant patterns. Each filter
generates a separate feature map.
• Activation Function: After convolution, an activation function
(typically ReLU) is applied element-wise to each feature map. This
introduces non-linearity to the network and helps capture complex
relationships in the data.

WORKING OF CNN
• Pooling Layer:
• The next step is pooling (often max pooling). This layer reduces the spatial dimensions of the
feature maps, reducing computational complexity and making the network more robust to
variations in the input.
• Max pooling takes the maximum value from each pooling region and discards the rest.
• Flattening:
• The pooled feature maps are flattened into a 1D vector. This prepares the data for the fully
connected layers.
WORKING OF CNN
• Fully Connected Layers:
• The flattened vector is passed through fully connected layers. These layers learn higher-level
representations of the features, combining information from different parts of the image.
• Output Layer:
• The final fully connected layer leads to the output layer. For digit recognition, this layer typically has
10 neurons (one for each digit). The output values represent the model's confidence in each
possible class (digit).
• Softmax Activation:
• A softmax activation function is applied to the output layer. It converts the output values into a
probability distribution, indicating the likelihood of the input image belonging to each class.
• Prediction:
• The class with the highest probability becomes the predicted label for the input image. In our
example, the CNN predicts that the input image is most likely the digit "7."
WORKING OF CNN
• This entire process of convolution, activation, pooling, flattening, and
fully connected layers constitutes the core working of a Convolutional
Neural Network. The network learns to adjust the filter values during
training to recognize different features in the input data, allowing it to
make accurate predictions for various tasks such as image
recognition.
various stages of a CNN model with
illustrations-PYTHON
• REFER WORD FILE CNN
ReLU
• Vanishing –
• As the backpropagation algorithm advances downwards(or backward) from
the output layer towards the input layer, the gradients often get smaller
and smaller and approach zero which eventually leaves the weights of the
initial or lower layers nearly unchanged. As a result, the gradient descent
never converges to the optimum. This is known as the vanishing
gradients problem.
• Exploding –
• On the contrary, in some cases, the gradients keep on getting larger and
larger as the backpropagation algorithm progresses. This, in turn, causes
very large weight updates and causes the gradient descent to diverge. This
is known as the exploding gradients problem.
ReLu (Rectified Linear Unit) Activation
Function
• ReLu is the best and most advanced activation function right now
compared to the sigmoid and TanH because all the drawbacks like
Vanishing Gradient Problem is completely removed in this activation
function which makes this activation function more advanced
compare to other activation function.

Range: 0 to infinity
.
Advantage of ReLu:
•Here all the negative values are
converted into the 0 so there are no
negative values are available.
•Maximum Threshold values are Infinity,
so there is no issue of Vanishing
Gradient problem so the output
prediction accuracy and there
efficiency is maximum.
•Speed is fast compare to other
activation function
• Disadvantages of ReLU:
• Dying ReLU Problem: ReLU units can sometimes become inactive
during training, causing the gradient to be zero for all negative inputs.
This is known as the "dying ReLU" problem, which can slow down
learning.
• Unbounded Activation: ReLU doesn't have an upper bound on the
output, which can lead to issues like exploding gradients in deeper
networks.
• Output Instability: ReLU's output can be unstable if the input is not
properly normalized.
• Leaky ReLU Function

. • Leaky ReLU is an improved


version of ReLU function to solve
the Dying ReLU problem as it has
a small positive slope in the
negative area.
• The advantages of Leaky ReLU are
same as that of ReLU, in addition
to the fact that it does enable
backpropagation, even for
negative input values.

• By making this minor


modification for negative input
values, the gradient of the left
side of the graph comes out to be
a non-zero value. Therefore, we
would no longer encounter dead
neurons in that region.
.
• Advantages of Leaky ReLU:
• Mitigates Dying ReLU: Leaky ReLU helps prevent neurons from becoming
completely inactive during training, thereby addressing the "dying ReLU"
problem.
• Simple Implementation: Like ReLU, Leaky ReLU is computationally efficient
and easy to implement.
• Disadvantages of Leaky ReLU:
• Negative Saturation: Leaky ReLU still suffers from the potential for negative
inputs to cause saturation, limiting the range of activation values.
• Limited Improvement: While Leaky ReLU helps with the dying ReLU
problem, it doesn't completely eliminate it and might not be the most
effective solution in all cases.
• Randomized ReLU (RReLU): Randomized ReLU, also known as RReLU,
is another variation of the ReLU activation function. It introduces
randomness during training to encourage robustness and reduce
overfitting. RReLU is defined as follows:
• f(x) = x, if x > 0
• f(x) = a * x, if x <= 0
• a is a random value drawn from a uniform distribution within a
predefined range [lower, upper]. During training, a remains fixed for
each individual neuron, but it changes across different neurons and
different training iterations.
• Advantages of Randomized ReLU (RReLU):
• Regularization: The randomness introduced by RReLU during training
acts as a form of regularization, helping to prevent overfitting and
improve the generalization of the model.
• Flexibility: RReLU allows for adaptive slopes for negative inputs,
which can aid in learning more robust representations.
• Disadvantages of Randomized ReLU (RReLU):
• Increased Complexity: Introducing randomness makes the network's
behavior less deterministic, potentially making it harder to interpret
or debug.
Exponential Linear Unit activation function
• ELU activation function is continuous and
differentiable at all points. For positive
values of input x, the function simply
outputs x. Whereas if the input is
negative, the output is exp(z) – 1 . As
input values tend to become more and
more negative, the output tends to be
closer to 1. The derivative of the ELU
function is 1 for all positive values and
exp(x) for all negative values.
• Mathematically, the ELU activation
function can be written as
• y = ELU(x) = exp(x) − 1 ; if x<0
y = ELU(x) = x ; if x≥0
• Unlike ReLU, ELU’s have a negative value too which cause the mean of the ELU activation
function to shift towards 0.
• Advantages of ELU:
• Smooth Non-Linearity: ELU is smooth and differentiable everywhere, including at the
point where x = 0, making it well-suited for gradient-based optimization.
• Mitigates Dying ReLU: The non-zero gradient for negative inputs helps mitigate the dying
ReLU problem, leading to faster convergence and potentially better generalization.
• Disadvantages of ELU:
• Computation Complexity: ELU involves exponential computations for negative inputs,
which can be computationally more expensive compared to ReLU or its variants.
• Non-Zero Mean Outputs: The negative saturation of ELU can result in outputs with
non-zero means, which might introduce bias in the network's learning process.
• Scaled Exponential Linear Unit
. (SELU) Layer:
• SELU is a further extension of the
ELU activation function, designed to
promote self-normalization in neural
networks. It's defined as follows:
• f(x) = lambda * (x if x > 0 else alpha *
(exp(x) - 1))
• lambda and alpha are scaling
parameters that help ensure the
activations and gradients stay close
to a mean of 0 and a standard
deviation of 1 during training.
• Advantages of SELU:
• Self-Normalization: SELU activations are designed to preserve mean and
variance, promoting self-normalization and potentially stabilizing training
without requiring extensive normalization techniques.
• Stable Training: SELU is specifically designed to avoid the vanishing and
exploding gradient problems, leading to more stable training in deep
networks.
• Dying Unit Prevention: Like ELU, SELU also mitigates the dying unit problem
due to its non-zero gradients for negative inputs.
• Disadvantages of SELU:
• Limited Usage: While SELU has shown promising results, it might not be the
best choice for all types of neural network architectures or tasks.
Unique properties of CNN
• Weight Sharing
• Translation Invarience
Architectures of CNN
• LeNet
• AlexNet
• ZFNet
• GoogLeNet
• VGGNet
• Resnet
• DenseNet
Lenet
• Introduction
• Overview of Lenet architecture
• Layer by layer explanation in architecture
• Merits and demerits
• Applications of Lenet architecture
LeNet5 architecture
• The LeNet architecture was introduced in the paper titled
"Gradient-Based Learning Applied to Document Recognition" by Yann
LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner, published in
1998. This paper described the development of a convolutional neural
network (CNN) architecture for the purpose of recognizing
handwritten characters and digits in documents, particularly for
applications like reading bank checque.
• LeNet-5 is called so because it was the fifth neural network model
developed by Yann LeCun and his colleagues during their research in
the 1990s. The name "LeNet" is derived from Yann LeCun's last name,
and the "5" in "LeNet-5" indicates that it was the fifth iteration or
version of the model in their research.
Architecture of Lenet-5 • The first layer is the input layer with feature map size
32X32X1.
.
• Then we have the first convolution layer with 6 filters of size
5X5 and stride is 1. The activation function used at his layer
is tanh. The output feature map is 28X28X6.
• Next, we have an average pooling layer with filter size 2X2
and stride 1. The resulting feature map is 14X14X6. Since
the pooling layer doesn’t affect the number of channels.
• After this comes the second convolution layer with 16 filters
of 5X5 and stride 1. Also, the activation function is tanh.
Now the output size is 10X10X16.
• Again comes the other average pooling layer of 2X2 with
stride 2. As a result, the size of the feature map reduced to
5X5X16.
• The final pooling layer has 120 filters of 5X5 with stride 1
and activation function tanh. Now the output size is 120.
• The next is a fully connected layer with 84 neurons that
result in the output to 84 values and the activation function
used here is again tanh.
• The last layer is the output layer with 10 neurons and
Softmax function. The Softmax gives the probability that a
data point belongs to a particular class. The highest value is
then predicted.
Layers of Lenet
Merits of LeNet Architecture Demerits of LeNet Architecture
.1. Effective Feature Extraction: LeNet effectively 1. Limited Complexity: LeNet is relatively shallow
extracts features from images, making it suitable for compared to modern architectures, which can limit its
tasks like handwritten digit recognition and early ability to handle complex, high-dimensional data or
computer vision tasks. tasks like object detection.

2. Limited to 2D Grid Data: LeNet is primarily designed


2. Simplicity: LeNet is a straightforward architecture
for 2D grid data like images and may not be
with a small number of layers and parameters, making
well-suited for more complex data types, such as
it easy to understand and train.
sequences or 3D data.

3. Low Computational Requirements: Due to its 3. Poor Performance on Complex Tasks: While
simplicity, LeNet requires fewer computational effective for digit recognition, LeNet may not perform
resources and can run on less powerful hardware well on more challenging computer vision tasks, such
compared to deep neural networks with many layers. as object detection or image segmentation.
.
Merits of LeNet Architecture Demerits of LeNet Architecture

4. Pioneering Architecture: LeNet laid the foundation 4. Lack of Depth: LeNet's limited depth may result in
for modern CNNs and inspired the development of difficulty capturing hierarchical features or learning
more advanced architectures, making it a significant complex data representations, which are important for
contribution to the field of deep learning. many modern tasks.

5. Useful for Educational Purposes: LeNet is often used


5. Requires Large Datasets: Like many deep learning
as a teaching tool for understanding the fundamental
models, LeNet performs better with larger datasets,
concepts of convolutional neural networks due to its
which may not be available for all applications.
simplicity.

6. Limited Regularization Techniques: LeNet lacks


6. Efficient on Small-Scale Tasks: LeNet can efficiently
modern regularization techniques like dropout and
handle small-scale image classification tasks, especially
batch normalization, which help prevent overfitting in
when computational resources are limited.
deeper networks.
• APPLICATIONS
• Handwritten Digit Recognition: LeNet-5 was originally designed for handwritten digit recognition, and it
. continues to be used in this domain. It can effectively classify and recognize handwritten digits in various
contexts, such as reading postal codes on envelopes and processing handwritten forms.
• Optical Character Recognition (OCR): LeNet-5 can be applied to OCR tasks where it recognizes printed or
handwritten characters in documents, cheques, and forms. It's particularly useful for processing printed
characters with high accuracy.
• Bank Cheque Processing: LeNet-5's ability to recognize handwritten digits and characters makes it suitable
for applications involving bank check processing. It can help automate the reading of handwritten amounts
and account numbers on checks.
• Traffic Sign Recognition: LeNet-5 can be used for traffic sign recognition in autonomous vehicles and
advanced driver assistance systems (ADAS). It can classify traffic signs, helping vehicles navigate and follow
traffic regulations.
• Medical Image Analysis: LeNet-5 has been employed in medical imaging tasks, such as detecting and
classifying features in X-rays, CT scans, and histopathology images. It can be used for tasks like identifying
specific structures or anomalies in medical images.
• Face Recognition: While more complex architectures are often used for face recognition, LeNet-5 can be
adapted for simpler face recognition tasks, such as recognizing faces in controlled environments or verifying
identities based on facial features.
.
• Document Classification: LeNet-5 can be used for document classification tasks,
such as sorting documents based on their content or identifying the type of
document (e.g., invoices, contracts) from scanned images.
• Character Recognition in CAPTCHA: LeNet-5 can be employed to break simple
CAPTCHA (Completely Automated Public Turing test to tell Computers and
Humans Apart) challenges that involve recognizing distorted characters. However,
it's worth noting that CAPTCHAs have become more sophisticated to thwart
automated recognition.
• Handwritten Signature Verification: In applications requiring handwritten
signature verification, LeNet-5 can be used to compare and verify signatures for
security purposes.
• Digit-Based Sorting Systems: In industrial automation, LeNet-5 can be applied to
digit-based sorting systems where it classifies and sorts objects based on printed
or marked digits.
ALEX NET
• AlexNet is a deep convolutional neural network (CNN) architecture
that made significant strides in the field of computer vision and deep
learning. It was developed by Alex Krizhevsky, Ilya Sutskever, and
Geoffrey Hinton and won the 2012 ImageNet Large Scale Visual
Recognition Challenge (ILSVRC), marking a breakthrough in image
classification.
Alexnet Architecture
• Architecture Overview:
• AlexNet consists of eight layers, five of
which are convolutional layers followed by
three fully connected layers. It also utilizes
techniques like max-pooling, dropout, and
ReLU activation functions. The architecture
was designed to be deep, which was
unusual at the time (2012), and it
demonstrated the power of deep neural
networks for image classification.
Alexnet Architecture
• Layer 1: Convolutional layer with 96 filters of size 11x11 and a
stride of 4. The input shape is set to (224, 224, 3) to match the
expected input size.
• Layer 2: Max-pooling layer with a pool size of 3x3 and a stride
of 2.
• Layer3: Convolutional layer with 256 filters of size 5x5 and a
padding of 2
• Layer 4:Max-pooling layer with a pool size of 3x3 and a stride
of 2.
• Layers 5-7 : Convolutional layers with the specified filter sizes,
padding, and activation functions.
• Layer 8: Max-pooling layer.
• Layers 9 and 10: Fully connected layers with 4096 neurons
and ReLU activation functions
• Layer 11: The output layer with 1000 neurons (for the 1000
classes in ImageNet) and a softmax activation function.
Merits of AlexNet Architecture Demerits of AlexNet Architecture
1. Deep Architecture: AlexNet was one of the first 1. Computational Resources: Training and running
deep neural networks, demonstrating the potential of AlexNet requires significant computational resources,
deep learning for computer vision tasks. making it challenging for smaller projects or devices.
2. Improved Accuracy: It significantly reduced the 2. Complexity: The architecture is relatively complex
top-5 error rate in image classification tasks, setting a with a large number of parameters, making it more
new benchmark in accuracy. challenging to understand and optimize.
3. Susceptible to Overfitting: Due to its depth and the
3. ReLU Activation: AlexNet introduced the use of
number of parameters, AlexNet can be prone to
Rectified Linear Units (ReLU) for activation functions,
overfitting, especially with small datasets.
which helped alleviate the vanishing gradient problem
Regularization techniques like dropout are often
and improved training efficiency.
needed.
.
4. Local Response Normalization (LRN): AlexNet 4. Large Memory Footprint: The model's size can be a
introduced LRN layers to enhance generalization by limitation, especially for deployment on
normalizing activations across local receptive fields. memory-constrained devices or in mobile applications.
5. GPU Utilization: It demonstrated the potential of 5. Specific Data Requirements: AlexNet's success was
using Graphics Processing Units (GPUs) for training in part due to its training on large-scale datasets like
deep neural networks, which later became a standard ImageNet, which may not be available for all
practice. applications.
6. Lack of Interpretable Features: The high-level
6. Influence on Deep Learning: AlexNet's success
features learned by AlexNet layers are not easily
played a pivotal role in rekindling interest in neural
interpretable by humans, making it challenging to
networks and paved the way for many subsequent
understand why the network makes certain
advancements in deep learning.
predictions.
7. Transfer Learning: Pre-trained AlexNet models can 7. Older Architecture: While historically important,
be used as a starting point for various computer vision AlexNet is now considered somewhat dated compared
tasks, enabling faster development of new to more recent architectures with improved
applications. performance.
• Image Classification: AlexNet excels in classifying objects in images, making it suitable for tasks like
identifying objects in photographs or categorizing images into predefined classes.
• Object Detection: While AlexNet itself is not a dedicated object detection model, its features can be
used as a backbone in object detection pipelines. Researchers have integrated AlexNet features into
early object detection models to identify objects and their locations within images.
• Fine-Grained Classification: AlexNet has been applied to fine-grained classification tasks, such as
distinguishing between closely related species of animals or identifying specific makes and models
of cars within a broader vehicle category.
• Scene Understanding: AlexNet's ability to recognize objects within images contributes to scene
understanding applications. It can be used to detect and identify objects in natural scenes, aiding in
applications like autonomous driving or scene analysis.
• Content-Based Image Retrieval: AlexNet features can be used to extract representations of images,
making it possible to search for visually similar images in large datasets. This is valuable for
applications like reverse image search or finding related images in databases.
• Medical Imaging: AlexNet has been applied to medical image analysis tasks, including identifying
and classifying diseases from medical images such as X-rays, CT scans, and MRI scans.
Comparison components LeNet AlexNet
Year of Introduction 1998 2012
.
Depth Relatively shallow (7 layers) Deep (8 layers)
Two convolutional layers (C1 and Five convolutional layers (Conv1 to
Convolutional Layers
C3) Conv5)
Three max-pooling layers
Pooling Layers Two max-pooling layers (S2 and S4)
(MaxPool1 to MaxPool3)
Activation Function Sigmoid (mainly) ReLU (Rectified Linear Unit)
Local Response Normalization (LRN) Batch Normalization (introduced
Normalization
in some layers later)
Employed in the fully connected
Dropout Not originally used
layers
Typically 224x224 color (RGB)
Input Size Typically 32x32 grayscale images
images
Output Layer Activation Softmax for classification Softmax for classification
Relatively complex for its time, but
Simple compared to modern
Complexity considered less complex than some
architectures
modern networks
• Visual Question Answering (VQA): In VQA tasks, AlexNet can be used to extract visual features from
images, which are then combined with textual information to answer questions about the content
of the images.
• Gesture Recognition: For applications involving gesture recognition, such as sign language
interpretation or human-computer interaction, AlexNet can assist in recognizing and interpreting
hand gestures or body poses in images or video frames.
• Emotion Recognition: AlexNet can be used to recognize and classify facial expressions in images or
video frames, enabling emotion recognition in human-computer interaction or sentiment analysis.
• Artificial Intelligence in Games: In the gaming industry, AlexNet can be used for character
recognition, gesture recognition, or identifying objects within the game environment.
• Quality Control in Manufacturing: In manufacturing and industrial settings, AlexNet can be
employed for visual inspection and quality control tasks, such as identifying defects or anomalies in
products.
• Security and Surveillance: AlexNet can be used for object recognition and tracking in surveillance
systems, aiding in security applications.

You might also like