UNIT-III DeepLearning Notes
UNIT-III DeepLearning Notes
UNIT-III DeepLearning Notes
DEFINITION
IBM DEFINITION:
IBM defines Convolutional Neural Networks (CNNs) as a specialized type of artificial neural network
designed to process structured grid-like data, such as images, by leveraging the spatial and temporal
dependencies between pixels. IBM emphasizes CNNs' ability to mimic the human visual system. By
scanning small parts of an image (using filters), CNNs recognize essential features and combine them
to understand the overall image.
A Convolutional Neural Network (CNN) is a type of Deep Learning neural network architecture
commonly used in Computer Vision. Computer vision is a field of Artificial Intelligence that enables a
computer to understand and interpret the image or visual data.
When it comes to Machine Learning, Artificial Neural Networks perform really well. Neural Networks
are used in various datasets like images, audio, and text. Different types of Neural Networks are
used for different purposes, for example for predicting the sequence of words we use Recurrent
Neural Networks more precisely an LSTM, similarly for image classification we use Convolution
Neural networks.
ARCHITECTURE EXPLANATION:
Neural Networks: Layers and Functionality
In a regular Neural Network there are three types of layers:
1. Input Layers: It’s the layer in which we give input to our model. The
number of neurons in this layer is equal to the total number of features
in our data (number of pixels in the case of an image).
2. Hidden Layer: The input from the Input layer is then fed into the
hidden layer. There can be many hidden layers depending on our
model and data size. Each hidden layer can have different numbers of
neurons which are generally greater than the number of features. The
output from each layer is computed by matrix multiplication of the
output of the previous layer with learnable weights of that layer and
then by the addition of learnable biases followed by activation function
which makes the network nonlinear.
3. Output Layer: The output from the hidden layer is then fed into a
logistic function like sigmoid or softmax which converts the output of
each class into the probability score of each class.
The data is fed into the model and output from each layer is obtained from
the above step is called feedforward, we then calculate the error using an
error function, some common error functions are cross-entropy, square
loss error, etc. The error function measures how well the network is
performing. After that, we backpropagate into the model by calculating the
derivatives. This step is called Backpropagation which basically is used
to minimize the loss.
1
Convolution Neural Network
Convolutional Neural Network (CNN) is the extended version of artificial
neural networks (ANN) which is predominantly used to extract the feature
from the grid-like matrix dataset. For example visual datasets like images
or videos where data patterns play an extensive role.
CNN Architecture
Convolutional Neural Network consists of multiple layers like the input
layer, Convolutional layer, Pooling layer, and fully connected layers.
Now imagine taking a small patch of this image and running a small
neural network, called a filter or kernel on it, with say, K outputs and
representing them vertically. Now slide that neural network across the
whole image, as a result, we will get another image with different widths,
heights, and depths. Instead of just R, G, and B channels now we have
2
more channels but lesser width and height. This operation is
called Convolution. If the patch size is the same as that of the image it
will be a regular neural network. Because of this small patch, we have
fewer weights.
3
This layer holds the raw input of the image with width 32, height 32,
and depth 3.
Convolutional Layers: This is the layer, which is used to extract
the feature from the input dataset. It applies a set of learnable filters
known as the kernels to the input images. The filters/kernels are
smaller matrices usually 2×2, 3×3, or 5×5 shape. it slides over the input
image data and computes the dot product between kernel weight and
the corresponding input image patch. The output of this layer is
referred as feature maps. Suppose we use a total of 12 filters for this
layer we’ll get an output volume of dimension 32 x 32 x 12.
Activation Layer: By adding an activation function to the output of
the preceding layer, activation layers add nonlinearity to the network. it
will apply an element-wise activation function to the output of the
convolution layer. Some common activation functions are RELU:
max(0, x), Tanh, Leaky RELU, etc. The volume remains unchanged
hence output volume will have dimensions 32 x 32 x 12.
Pooling layer: This layer is periodically inserted in the covnets and
its main function is to reduce the size of volume which makes the
computation fast reduces memory and also prevents overfitting. Two
common types of pooling layers are max pooling and average
pooling. If we use a max pool with 2 x 2 filters and stride 2, the
resultant volume will be of dimension 16x16x12.
4
be passed into a completely linked layer for categorization or
regression.
Fully Connected Layers: It takes the input from the previous layer
and computes the final classification or regression task.
Output Layer: The output from the fully connected layers is then
fed into a logistic function for classification tasks like sigmoid or
softmax which converts the output of each class into the probability
score of each class.
Advantages and Disadvantages of Convolutional
Neural Networks (CNNs)
Advantages of CNNs:
1. Good at detecting patterns and features in images, videos, and
audio signals.
2. Robust to translation, rotation, and scaling invariance.
3. End-to-end training, no need for manual feature extraction.
4. Can handle large amounts of data and achieve high accuracy.
Disadvantages of CNNs:
1. Computationally expensive to train and require a lot of memory.
2. Can be prone to overfitting if not enough data or proper
regularization is used.
3. Requires large amounts of labeled data.
4. Interpretability is limited, it’s hard to understand what the network
has learned.
5
CNN TYPES
CNN Architecture
LeNet-5, introduced by Yann LeCun and his team in the 1990s, was
one of the first successful CNN architectures. Designed for
handwritten digit recognition, it laid the foundation for subsequent
CNN developments. LeNet-5 features convolutional layers,
subsampling layers, and fully connected layers, showcasing the core
elements of modern CNNs.
6
LeNet CNN
7
3. VGGNet: The Pursuit of Simplicity
8
4. GoogLeNet (Inception): Embracing Parallelism
9
Google Net Like Architecture
10
Image Classification using CNN
Today, we will create an Image Classifier of our own that can distinguish
whether a given pic is of a dog or cat or something else depending
upon your fed data. To achieve our goal, we will use one of the famous
machine learning algorithms out there which are used for Image
Classification i.e. Convolutional Neural Network(or CNN).
So basically what is CNN – as we know it’s a machine learning algorithm
for machines to understand the features of the image with foresight and
remember the features to guess whether the name of the new image is
fed to the machine. Since it’s not an article explaining CNN so I’ll add
some links in the end if you guys are interested in how CNN works and
behaves.
So after going through all those links let us see how to create our very
own cat-vs-dog image classifier. For the dataset we will use the Kaggle
dataset of cat-vs-dog:
Now after getting the data set, we need to preprocess the data a bit and
provide labels to each of the images given there during training the data
set. To do so we can see that name of each image of the training data set
is either start with “cat” or “dog” so we will use that to our advantage then
we use one hot encoder for the machine to understand the labels(cat[1, 0]
or dog[0, 1]).
def label_img(img):
word_label = img.split('.')[-3]
Image classification using Convolutional Neural Networks (CNNs) is a popular approach for
analyzing visual data. Here's how you can implement a basic CNN in Python using
TensorFlow/Keras to classify images.
For demonstration, let's use the CIFAR-10 dataset, which contains 60,000 32x32 color
images across 10 classes.
import tensorflow as tf
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.utils import to_categorical
import matplotlib.pyplot as plt
12
class_names = ['Airplane', 'Automobile', 'Bird', 'Cat', 'Deer', 'Dog',
'Frog', 'Horse', 'Ship', 'Truck']
plt.figure(figsize=(10, 5))
for i in range(10):
plt.subplot(2, 5, i+1)
plt.imshow(x_train[i])
plt.title(class_names[y_train[i].argmax()])
plt.axis('off')
plt.tight_layout()
plt.show()
model = Sequential([
# First Convolutional Layer
Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
MaxPooling2D((2, 2)),
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
history = model.fit(
x_train, y_train,
13
validation_split=0.2,
epochs=10,
batch_size=64
)
import numpy as np
Output
14
CNN PRACTICAL APPLICATION
Handwritten character recognition using the MNIST dataset involves several steps, from data
preparation to model training and evaluation. Here’s a breakdown of the process:
1. Dataset Overview
2. Data Preprocessing
Normalization: Pixel values are scaled to the range [0, 1] by dividing by 255.0. This
helps improve the training stability and convergence speed.
Reshaping: Since most deep learning frameworks expect input with a channel
dimension, images are reshaped from (28, 28) to (28, 28, 1).
Label Encoding: The labels (digits 0-9) are one-hot encoded. For example, the label
3 becomes [0, 0, 0, 1, 0, 0, 0, 0, 0, 0].
3. Model Architecture
Convolutional Neural Network (CNN): CNNs are effective for image classification
because they can capture spatial hierarchies and features.
o Convolutional Layers: These layers apply filters to the input images to detect
features like edges and patterns.
o Activation Function: ReLU (Rectified Linear Unit) is commonly used to
introduce non-linearity.
o Pooling Layers: Max pooling layers reduce the spatial dimensions of the
feature maps, retaining the most important features and reducing computation.
o Dense Layers: After flattening the output from the convolutional layers, dense
layers (fully connected layers) are used to make the final classification. The
last layer typically uses a softmax activation function for multi-class
classification.
After training, the model is evaluated on the test set to assess its performance. The test
accuracy indicates how well the model can generalize to unseen data.
The final test accuracy gives a measure of the model’s performance. Commonly,
CNNs achieve accuracies above 98% on the MNIST dataset.
Misclassifications can be analyzed to understand where the model struggles, which
can guide further improvements.
7. Applications
This process provides a solid foundation for understanding how to approach handwritten
character recognition tasks using deep learning. You can further experiment with the model
architecture, hyperparameters, and data augmentation techniques to enhance performance.
Creating a handwritten digit recognition system using the MNIST dataset in Python involves
the following steps:
Prerequisites
2. Dataset: MNIST is a dataset of 70,000 handwritten digits (0-9) with each image being
28x28 pixels. You can easily load it using TensorFlow or Keras.
Code Implementation
16
2. Load the Dataset
# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()
Output
Accuracy: With a simple neural network, you can expect around 96-98% accuracy. To
improve further, you can use convolutional neural networks (CNNs).
Extensions
1. Use CNNs: Leverage layers like Conv2D and MaxPooling2D for better accuracy.
2. Data Augmentation: Enhance training data with transformations like rotation and flipping.
3. Save the Model: Save your trained model using model.save('mnist_model.h5') for
reuse.
Sample Input
1. Image Input
o The MNIST dataset contains grayscale images of digits. Each image is 28x28
pixels, with pixel values ranging from 0 to 255.
o Below is an example of one such image from the dataset:
Image:
css
Copy code
[[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0]
...
[ 0 0 0 54 63 156 170 253 253 189 39 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0]
...
]
18
o The pixel values are normalized to the range [0, 1]:
Sample Output
1. Predicted Class After feeding the image into the trained model, it outputs a
probability distribution over the 10 classes (digits 0-9). For example:
predictions = model.predict(x_test)
print(predictions[0])
Output:
2. Interpretation The model predicts that the input image corresponds to the digit 5, as
the 6th value (index 5) has the highest probability (0.995).
3. Visual Output Display the result visually:
plt.imshow(x_test[0], cmap='gray')
plt.title(f"Predicted: 5, Actual: 5")
plt.axis('off')
plt.show()
Output Image:
o The grayscale image of "5" is shown with the prediction and actual label as
"5".
19
APPLICATIONS OF CNN
Convolutional Neural Networks (CNNs) are widely used in various domains due to their ability to
extract features and patterns from images and structured data. Below are some notable applications
of CNNs:
1. Computer Vision
Image Classification
Objective: Categorize images into predefined classes (e.g., cat, dog, airplane).
Examples:
o Handwritten digit recognition (MNIST dataset).
o Classifying medical images (e.g., X-rays, MRIs).
Object Detection
Objective: Identify objects within an image and locate them with bounding boxes.
Examples:
o Autonomous vehicles detecting pedestrians, traffic signs, and other vehicles.
o Face detection in images (used in social media tagging).
Image Segmentation
Image Generation
Facial Recognition
2. Healthcare
20
Medical Imaging
Drug Discovery
Pathology
3. Autonomous Systems
Self-Driving Cars
Drones
Text Recognition
Objective: Recognize and convert handwritten or printed text into digital format.
Examples:
o Optical Character Recognition (OCR).
o Translating text from images in different languages.
21
Speech Recognition
Style Transfer
Video Surveillance
Biometric Authentication
7. Robotics
Industrial Automation
22
Examples:
o Quality control in manufacturing using visual inspection.
o Robots picking and sorting items using visual cues.
Agriculture
8. Astronomy
Satellite Imaging
Visual Search
Product Recommendation
23
10. Art and Creativity
Image Enhancement
Content Creation
1. Convolutional Layer
Definition: The core building block of CNNs, this layer applies filters (kernels) to the
input data to extract features like edges, textures, and patterns.
Key Terms:
o Filter/Kernel: A small matrix (e.g., 3x3 or 5x5) used to scan the input data
and detect patterns.
o Stride: The step size with which the filter moves across the input. A stride of
1 means the filter moves one pixel at a time.
o Feature Map: The output of the convolution operation, showing detected
features.
2. Padding
Definition: Adding extra pixels (usually zeros) around the edges of the input to
control the size of the output feature map.
Types:
o Valid Padding: No padding; the feature map shrinks after convolution.
o Same Padding: Padding added to keep the output size the same as the input
size.
3. Activation Function
4. Pooling Layer
Definition: Reduces the size of the feature map, making the model faster and less
prone to overfitting.
Types:
o Max Pooling: Takes the maximum value in each patch.
o Average Pooling: Computes the average of each patch.
Definition: Connects all neurons from the previous layer to every neuron in this layer.
It's used at the end of the network for classification or regression tasks.
Purpose: Combines features learned by convolutional layers for decision-making.
6. Dropout
7. Flattening
8. Backpropagation
Definition: The process of updating weights in the network using the gradient of the
loss function with respect to the weights.
25
9. Epoch
Definition: Measures the error or difference between the predicted output and the
actual target.
Common Loss Functions:
o Cross-Entropy Loss: Used for classification tasks.
o Mean Squared Error (MSE): Used for regression tasks.
11. Optimizer
Definition: An algorithm used to adjust the model weights to minimize the loss
function.
Popular Optimizers:
o SGD (Stochastic Gradient Descent).
o Adam (Adaptive Moment Estimation).
Definition: A hyperparameter that controls how much the model weights are updated
during training.
Importance: Too high → may not converge; Too low → slow training.
Definition: The region of the input image that a particular feature in the feature map
is derived from. It increases as the network deepens.
26
15. Overfitting and Underfitting
Overfitting: The model performs well on training data but poorly on unseen data.
Underfitting: The model fails to capture the underlying patterns in the data.
Definition: Using a pre-trained CNN (e.g., ResNet, VGG) on a new but similar task
to save time and improve accuracy.
17. Preprocessing
Definition: Steps to prepare data for the model, such as resizing, normalization
(scaling pixel values to 0–1), and data augmentation (rotations, flips).
A confusion matrix is a performance evaluation tool for classification models. It is a table that
summarizes how well the predictions of a model match the actual labels of a dataset. Each row in
the matrix represents the instances of an actual class, while each column represents the instances of
a predicted class.
27
For a binary classification task (e.g., classifying an email as spam or not spam), the confusion
matrix is typically a 2x2 table:
Image augmentation is a common technique used in training convolutional neural networks (CNNs)
to artificially expand the training dataset by applying random transformations to the images. Below
is an example of Python code for image augmentation using the Keras library:
PYTHON CODE:
import tensorflow as tf
datagen = ImageDataGenerator(
datagen.fit(x_train)
plt.figure(figsize=(10, 5))
for i in range(num_images):
plt.subplot(1, num_images, i + 1)
plt.imshow(augmented_images[i])
plt.axis('off')
plt.show()
plot_augmented_images(datagen, x_train[:10])
model = tf.keras.Sequential([
tf.keras.layers.MaxPooling2D((2, 2)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
29
# Train the model using augmented data
history = model.fit(
validation_data=(x_test, y_test),
epochs=10
30