Unit1 Kumod Deeplearning

Noida Institute of Engineering and Technology, Greater Noida
Deep Learning
(ACSML0702)
Unit: I
INTRODUCTION
Dr. Kumod kr. Gupta
Course Details (Asst. Professor)
(B. Tech. 7th Sem)
AI Department
Dr. Kumod Kumar Gupta Deep Learning Unit I 1

Faculty Introduction
Name Dr. Kumod Kr. Gupta

Qualification Ph.D., M. Tech
Designation Assistant Professor
Department AI
Total Experience 17 years
NIET Experience 12 years
Subject Taught Python Basics, Advanced Python, ML, DL

Evaluation Scheme
Bachelor of Technology
Computer Science And Engineering (Artificial Intelligence & Machine Learning)
EVALUATION SCHEME
SEMESTER-VI
Sl. Subject Codes Periods Evaluation Scheme End

No. Subject Name Semester Total Credit
L T P CT TA TOTAL PS TE PE
1 ACSML0602 Deep Learning 3 0 0 30 20 50 100 150 3

2 ACSML0603 Advanced Database Management 3 1 0 30 20 50 100 150 4
Systems
3 ACSE0603 Software Engineering 3 0 0 30 20 50 100 150 3
4 Departmental Elective-III 3 0 0 30 20 50 100 150 3
5 Departmental Elective-IV 3 0 0 30 20 50 100 150 3
6 Open Elective-I 3 0 0 30 20 50 100 150 3
7 ACSML0652 Deep Learning Lab 0 0 2 25 25 50 1
8 ACSML0653 Advanced Database Management Systems 0 0 2 25 25 50 1
Lab
9 ACSE0653 Software Engineering Lab 0 0 2 25 25 50 1
10 ACSE0659 Mini Project 0 0 2 50 50 1
Essence of Indian Traditional Knowledge /
ANC0602 / Constitution of India, Law and
11 ANC0601 Engineering (Non 2 0 0 30 20 50 50 100
Credit)
12 MOOCs (For B.Tech. Hons. Degree)
GRAND TOTAL 1100 23

Syllabus
UNIT-I: Model Improvement and Performance
Curse of Dimensionality, Bias and Variance Trade off, Overfitting and underfitting,
Regression - MAE, MSE, RMSE, R Squared, Adjusted R Squared, p-Value, Classification -
Precision, Recall, F1, Other topics, K-Fold Cross validation, RoC curve, Hyper-Parameter
Tuning Introduction – Grid search, random search, Introduction to Deep Learning.
Artificial Neural Network: Neuron, Nerve structure and synapse, Artificial Neuron and its
model, activation functions, Neural network architecture: Single layer and Multilayer feed
forward networks, recurrent networks. Various learning techniques; Perception and
Convergence rule, Hebb Learning. Perceptron’s, Multilayer perceptron, Gradient descent and
the Delta rule, Multilayer networks, Derivation of Backpropagation Algorithm.

Syllabus
UNIT-II: CONVOLUTION NEURAL NETWORK
What is computer vision? Why Convolutions (CNN)?
Introduction to CNN, Train a simple convolutional neural net, Explore the design space for convolutional nets,
Pooling layer motivation in CNN, Design a convolutional layered application, Understanding and visualizing a
CNN, Transfer learning and fine-tuning CNN, Image classification, Text classification, Image classification and
hyper-parameter tuning, Emerging NN architectures

Syllabus
UNIT-III:DETECTION & RECOGNITION
Padding & Edge Detection, Strided Convolutions, Networks in Networks and

1x1Convolutions, Inception Network Motivation, Object Detection, YOLO Algorithm.

Syllabus
UNIT-IV: RECURRENT NEURAL NETWORKS
Why use sequence models? Recurrent Neural Network Model, Notation, Back-propagation
through time (BTT), Different types of RNNs, Language model and sequence generation,
Sampling novel sequences, Vanishing gradients with RNNs, Gated Recurrent Unit (GRU),
Long Short-Term Memory (LSTM), Bidirectional RNN, Deep RNNs

Syllabus
UNIT-V: AUTO ENCODERS IN DEEP LEARNING
Auto-encoders and unsupervised learning, Stacked auto-encoders and semi-supervised

learning,
Regularization - Dropout and Batch normalization.

Course Objective
To be able to learn unsupervised techniques and provide continuous

improvement in accuracy and outcomes of various datasets with more reliable
and concise analysis results.

Course Outcome (CO)
Course Bloom’s
Outcome At the end of course , the student will be able to: Knowledge
( CO) Level (KL)
CO1 Analyze ANN model and understand the ways of accuracy K4
measurement.
CO2 Develop a convolutional neural network for multi-class K6
classification in images
CO3 Apply Deep Learning algorithm to detect and recognize an K3
object.
CO4 Apply RNNs to Time Series Forecasting, NLP, Text and K4
Image Classification
CO5 Apply Lower-dimensional representation over higher- K3
dimensional data for dimensionality reduction and capture
the important features of an object.
Program Outcomes (POs)
Engineering Graduates will be able to:
PO1 : Engineering Knowledge
PO2 : Problem Analysis
PO3 : Design/Development of solutions
PO4 : Conduct Investigations of complex problems
PO5 : Modern tool usage
PO6 : The engineer and society

Program Outcomes (POs)
Engineering Graduates will be able to:
PO7 : Environment and sustainability
PO8 : Ethics
PO9 : Individual and teamwork
PO10 : Communication
PO11 : Project management and finance
PO12 : Life-long learning

CO-PO Mapping
CO.K PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO1 3 3 3 3 2 2 1 - 1 - 2 2
CO2 3 3 3 3 2 2 1 - 1 1 2 2
CO3 3 3 3 3 3 2 2 - 2 1 2 3
CO4 3 3 3 3 3 2 2 1 2 1 2 3
CO5 3 3 3 3 3 2 2 1 2 1 2 2
AVG 3.0 3.0 3.0 3.0 2.6 2.0 1.6 0.4 1.6 0.8 2.0 2.4

Result Analysis 2022-2023 (Even semester )
Institute Result
FACULTY NAME BRANCH/SECTION RESULT

Pattern of Online External Exam Question Paper (100 marks)






Unit I Content
Model Improvement and Artificial Neural Network:

Performance: • Neuron, Nerve structure and synapse,
• Curse of Dimensionality, • Artificial Neuron and its model,
• Bias and Variance Trade off • activation functions,
• Overfitting and underfitting, • Neural network architecture: Single
• Regression - MAE, MSE, RMSE, layer and Multilayer feed forward
networks, recurrent networks.
• R Squared, Adjusted R Squared, p-Value,
• Various learning techniques; Perception
• Classification - Precision, Recall, F1, and Convergence rule, Hebb Learning.
• Other topics, K-Fold Cross validation, Perceptron’s, Multilayer perceptron,
• RoC curve, Gradient descent and the Delta rule,
• Hyper-Parameter Tuning Introduction – • Multilayer networks,
Grid search, random search, • Derivation of Backpropagation
• Introduction to Deep Learning. Algorithm.

Unit I Objective
Analyze ANN model and understand the ways of accuracy measurement.

Topis Prerequisite
• Python, Basic Modeling Concepts

Topic Objective
To be able to learn unsupervised techniques and provide continuous improvement in accuracy

and outcomes of various datasets with more reliable and concise analysis results.
Analyze ANN model and understand the ways of accuracy measurement .

Curse of Dimensionality (CO1)
• Increasing the number of features will not always improve

classification accuracy.
• In practice, the inclusion of more features might actually lead

to worse performance.
• The number of training examples required increases k=3

3 bins
1
exponentially with dimensionality d (i.e., kd).
32 bins
33 bins

Dimensionality Reduction (CO1)
• What is the objective?

– Choose an optimum set of features of lower dimensionality to improve classification
accuracy.
• Different methods can be used to reduce dimensionality:

– Feature extraction
– Feature selection

There are two components of dimensionality reduction:

•Feature selection: In this, we try to find a subset of the original set of variables, or
features, to get a smaller subset which can be used to model the problem. It usually
involves three ways:
• Filter
• Wrapper
• Embedded
•Feature extraction: This reduces the data in a high dimensional space to a lower
dimension space, i.e. a space with lesser no. of dimensions.
Methods of Dimensionality Reduction
The various methods used for dimensionality reduction include:
•Principal Component Analysis (PCA)
•Linear Discriminant Analysis (LDA)
•Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear and non-linear, depending upon the
method used. The prime linear method, called Principal Component Analysis, or
PCA, is discussed below.

Advantages of Dimensionality Reduction

•It helps in data compression, and hence reduced storage space.
•It reduces computation time.
•It also helps remove redundant features, if any.
•Improved Visualization: High dimensional data is difficult to visualize, and dimensionality
reduction techniques can help in visualizing the data in 2D or 3D, which can help in better
understanding and analysis.
•Overfitting Prevention: High dimensional data may lead to overfitting in machine learning
models, which can lead to poor generalization performance. Dimensionality reduction can
help in reducing the complexity of the data, and hence prevent overfitting.
•Feature Extraction: Dimensionality reduction can help in extracting important features
from high dimensional data, which can be useful in feature selection for machine learning
models.
•Data Preprocessing: Dimensionality reduction can be used as a preprocessing step before
applying machine learning algorithms to reduce the dimensionality of the data and hence
improve the performance of the model.
•Improved Performance: Dimensionality reduction can help in improving the performance
of machine learning models by reducing the complexity of the data, and hence reducing
the noise and irrelevant information in the data.
Disadvantages of Dimensionality Reduction

•It may lead to some amount of data loss.
•PCA tends to find linear correlations between variables, which is sometimes undesirable.
•PCA fails in cases where mean and covariance are not enough to define datasets.
•We may not know how many principal components to keep- in practice, some thumb rules are applied.
•Interpretability: The reduced dimensions may not be easily interpretable, and it may be difficult to
understand the relationship between the original features and the reduced dimensions.
•Overfitting: In some cases, dimensionality reduction may lead to overfitting, especially when the number
of components is chosen based on the training data.
•Sensitivity to outliers: Some dimensionality reduction techniques are sensitive to outliers, which can result
in a biased representation of the data.
•Computational complexity: Some dimensionality reduction techniques, such as manifold learning, can be
computationally intensive, especially when dealing with large datasets.

Bias-Variance Tradeoff (CO1)
• It is important to understand prediction errors (bias and variance) when it comes to accuracy in any
machine-learning algorithm.
• There is a tradeoff between a model’s ability to minimize bias and variance which is referred to as the best
solution for selecting a value of Regularization constant.
• A proper understanding of these errors would help to avoid the overfitting and underfitting of a data set
while training the algorithm.

Bias(CO1)
What is Bias?
• The bias is known as the difference between the prediction of the values by the Machine Learning model
and the correct value.
• Being high in biasing gives a large error in training as well as testing data.
• It recommended that an algorithm should always be low-biased to avoid the problem of underfitting.
• By high bias, the data predicted is in a straight line format, thus not fitting accurately in the data in the data
set. Such fitting is known as the Underfitting of Data. This happens when the hypothesis is too simple or
linear in nature.
High Bias in the Model

Variance(CO1)
What is Variance?
• The variability of model prediction for a given data point which tells us the spread of our data is called the
variance of the model.
• The model with high variance has a very complex fit to the training data and thus is not able to fit accurately
on the data which it hasn’t seen before. As a result, such models perform very well on training data but have
high error rates on test data.
• When a model is high on variance, it is then said to as Overfitting of Data.
• Overfitting is fitting the training set accurately via complex curve and high order hypothesis but is not the
solution as the error with unseen data is high. While training a data model variance should be kept low. The
high variance data looks as follows.
High Variance in the Model

Variance(CO1)
Bias and Variance Trade-Off

Bias- Variance trade off (CO1)
Bias and variance should be low Bias- Variance Trade-off

Underfitting(CO1)
• In supervised learning, underfitting happens when a model unable to capture the underlying pattern of the data.
These models usually have high bias and low variance. It happens when we have very less amount of data to
build an accurate model or when we try to build a linear model with a nonlinear data. Also, these kind of models
are very simple to capture the complex patterns in data like Linear and logistic regression.
Reasons for Underfitting

1.High bias and low variance.
2.The size of the training dataset used is not enough.
3.The model is too simple.
4.Training data is not cleaned and also contains noise in it.
Techniques to Reduce Underfitting
5.Increase model complexity.
6.Increase the number of features, performing feature engineering.
7.Remove noise from the data.
8.Increase the number of epochs or increase the duration of training to get better results.

Overfitting(CO1)
• In supervised learning, Overfitting happens when our model captures the noise along with the underlying pattern
in data. It happens when we train our model a lot over noisy dataset. These models have low bias and high
variance. These models are very complex like Decision trees which are prone to overfitting.
Overfitting is a problem where the evaluation of machine learning algorithms on training data is different from
unseen data.
Reasons for Overfitting:
1. High variance and low bias.
2.The model is too complex.
3.The size of the training data.
Techniques to Reduce Overfitting
4.Increase training data.
5.Reduce model complexity.
6.Early stopping during the training phase (have an eye over the loss over the training period as soon as loss
begins to increase stop training).
7.Ridge Regularization and Lasso Regularization.
8.Use dropout for neural networks to tackle overfitting.
9.Cross- Validation (K- Fold Cross Validation)
10.Batch normalization
Overfitting(CO1)
Underfitting and Overfitting in Machine Learning

Overfitting(CO1)
Regularization
• The word “regularize” means to make things regular or acceptable.
• This is exactly why we use it for. Regularization is a form of regression used to reduce the error by
fitting a function appropriately on the given training set and avoid overfitting.
• It discourages the fitting of a complex model, thus reducing the variance and chances of overfitting. It
is used in the case of multicollinearity (when independent variables are highly correlated).
the equation of Linear Regression. Let be the prediction made.
We also introduced the concept of loss functions. We will use one such loss function in this post -
Residual Sum of Squares (RSS). It can be mathematically given as:

Solution of Overfitting(CO1)
Regularization can be of two kinds,

1. Ridge / L2 Regularization
2. Lasso Regression/L1 Regularization
Ridge Regression / L2 Regularization

In this regression, we add a penalty term to the RSS loss function. Our modified loss function now
becomes:
• Here, λ is called the “tuning parameter” which decides how heavily we want to penalize the
flexibility of our model.
• If we look closely, we might observe that if λ=0, it performs like linear regression
• as λ→inf, the impact of the shrinkage penalty grows, and the ridge regression coe ﬃcient estimates
will approach zero.
• As can be seen, selecting a good value of λ is critical. The coefficient estimates produced by this
method are sometimes also known as the “L2 norm”.

Lasso Regression / L1 Regularization
This regression adopts the same idea

as Ridge Regression with a change in Thus our new loss function becomes:
the penalty term. Instead of , we use
this is sometimes called the “L1 norm”.

• Note:
• The tuning parameter λ controls the impact on bias and variance.
• As the value of λ rises, it reduces the value of coefficients and thus reducing the variance.
• Till a point, this increase in λ is beneficial as it is only reducing the variance (hence avoiding overfitting),
without losing any important properties in the data.
• But after a certain value, the model starts losing important properties, giving rise to bias in the model and
thus underfitting. Therefore, the value of λ should be carefully selected.
• λ is optimized using cross-validation(K –Fold Cross Validation)

Regularization:
• The regularization model promotes smoother functions by creating a new criterion function
that relies not only on the training error, but also on algorithmic intricacy.
• Particularly, the new criterion function punishes extremely complex hypotheses; looking for
the minimum in this criterion is to balance error on the training set with complexity.
• Formally, it is possible to write the new criterion as a sum of the error on the training set plus
a regularization term, which depicts constraints or sought after properties of solutions
• The second term penalizes complex hypotheses with large variance.

• When we minimize augmented error function instead of the error on data only, we penalize
complex hypotheses and thus decrease variance.
• When λ is taken too large, only very simple functions are allowed and we risk introducing bias. λ
is optimized using cross-validation

• We consider here the example of neural network hypotheses class .

• The hypothesis complexity may be expressed as,
• The regularizer encourages smaller weights .

• For small values of weights, the network mapping is approximately linear.
• Relatively large values of weights lead to overfitted mapping with regions of large curvature

Early Stopping:
• The training of a learning machine corresponds to iterative decrease in the error function defined as
per the training data.
• During a specific training session, this error generally reduces as a function of the number of iterations
in the algorithm.
• Stopping the training before attaining a minimum training error, represents a technique of restricting
the effective hypothesis complexity.
Pruning:
• An alternative solution that sometimes is more successful than early stopping the growth (complexity)
of the hypothesis is pruning the full-grown hypothesis that is likely to be overfitting the training data.
• Pruning is the basis of search in many decision-tree algorithms; weakest branches of large tree
overfitting the training data, which hardly reduce the error rate, are removed.

UNIT-1 Regression
• Regression analysis is a set of statistical methods used for the
estimation of relationships between a dependent variable and
one or more independent variables.
• Regression analysis includes several variations, such as linear,
multiple linear, and nonlinear.
• The most common models are simple linear and multiple
linear.
• Nonlinear regression analysis is commonly used for more
complicated data sets in which the dependent and
independent variables show a nonlinear relationship.
UNIT-1 Regression
• Regression Analysis
– Simple Linear Regression: A model that assesses the relationship
between a dependent variable and an independent variable
Y = mx + c + e
– Where:
• Y – Dependent variable
• x – Independent (explanatory) variable
• c – Intercept
• m – Slope
• e – Residual (error)
UNIT-1 Regression
• Multiple linear regression analysis is essentially similar to the
simple linear model, with the exception that multiple
independent variables are used in the model.
• The mathematical representation of multiple linear regression
is:
Y = a + bX1 + cX2 + dX3 + ϵ
• Where:
• Y – Dependent variable
• X1, X2, X3 – Independent (explanatory) variables
• a – Intercept
• b, c, d – Slopes
• ϵ – Residual (error)
UNIT-1 Regression
• Loss Function
– Loss function is a way to know the performance of a model.
– High Loss function leads to bad train model and low loss function
leads to good train model.
– loss function should be as minimum as possible.
– Loss function calculated over a single training data.
L = (Actual_Value - Predicted_Value)2
– Loss function Sometime also known as error function.
• Cost Function
– Cost function calculated for complete batch of data
2
C=
UNIT-1 Regression
– Example for Loss and Cost Function
Actual_Value Predicted_Value
Roll No. CGPA IQ Loss Function Cost Function
Package Predicted
1 5.2 100 6.3 6.4 0.01
2 4.3 91 4.5 5.3 0.64
3.475
3 8.2 83 6.5 5.2 1.69
4 8.9 102 5.5 8.9 11.56
NOTE: Loss function calculated for Individual Data while Cost

Function calculate for Entire Dataset
UNIT-1 Regression
• MAE (Mean Absolute Error): MAE is a metric that measures
the average absolute difference between the predicted values
and the actual values. It gives an idea of how far off the
predictions are from the true values, regardless of the direction
of the error.
L = |Actual_Value - Predicted_Value|
C=
UNIT-1 Regression
• Advantages
– Easy to Understand
– Same unit as unit of Actual_Value
– It is Robust to Outlier: It means outlier will not affect error, so if there is
no outliers in dataset then it better to use MAE instead of MSE
• Disadvantages
– Grap is not differenciable due which Gradient Descent(GD) algorithm
not easy to implement.
– To implement GD we need to calculate Sub-Gradient.
UNIT-1 Regression
• MSE (Mean Squared Error): MSE is a metric that calculates the
average squared difference between the predicted values and
the actual values. Squaring the errors gives more weight to
larger errors, making it useful for penalizing significant
deviations from the true values.
L = (Actual_Value - Predicted_Value)2
C= 2
UNIT-1 Regression
• Advantages
– Easy to interpret
– Loss function is differenciable that allows to implement GD easily
– One Local Mininma: It means function has one minimum value that we
have to find.
• Disadvantage
– Unit of error is Square: That creates an confusion to understand it, so
to extract accurate error we have to find square root of MSE.
– It is not Robust to Outlier: If dataser conists outliers then MSE is not
useful
UNIT-1 Regression
• Huber loss
• Huber Loss is applicable when Outlier data is around 25% because 25% is
a significant amount of data and if we use MSE then it will ignore the 75%
data which is correct, because graph will deviate towards Outliers and if
we use MAE, it will ignore 25% outlier data that is also significant. In this
type of situation Huber Loss is useful.
UNIT-1 Regression
• RMSE
• It quantifies the differences between predicted values and

actual values, squaring the errors, taking the mean, and then
finding the square root.
• RMSE provides a clear understanding of the model’s performance, with
lower values indicating better predictive accuracy.
• RMSE is computed by taking the square root of MSE
• RMSE value with zero indicates that the model has a perfect fit
UNIT-1 Regression
• RMSE
• The lower the RMSE, the better the model and its predictions.
• A higher RMSE indicates that there is a large deviation from the
residual to the ground truth.
UNIT-1 Regression
• Pros of the RMSE Evaluation Metric:
– RMSE is easy to understand.
– It serves as a heuristic for training models.
– It is computationally simple and easily differentiable which many
optimization algorithms desire.
– RMSE does not penalize the errors as much as MSE does due to the
square root.
• Cons of the RMSE metric:
– Like MSE, RMSE is dependent on the scale of the data. It increases in
magnitude if the scale of the error increases.
– One major drawback of RMSE is its sensitivity to outliers and the
outliers have to be removed for it to function properly.
UNIT-1 Regression
• R Squared
• R-squared (Coefficient of Determination) is a statistical measure that
quantifies the proportion of the variance in the dependent variable that is
explained by the independent variables in a regression model.
• Where:
– SSR (Sum of Squares Residual) represents the sum of squared differences between
the observed values and the predicted values by the model.
– SST (Total Sum of Squares) represents the sum of squared differences between the
observed values and the mean of the dependent variable.
UNIT-1 Regression
• R-squared ranges between 0 and 1, with the following
interpretations:
– =0: The model does not explain any of the variability in the dependent
variable. It's a poor fit.
– : The model explains a proportion of the variability. A higher R-squared
indicates a better fit, with 1 indicating a perfect fit where the model
explains all the variability.
– =1: The model perfectly predicts the dependent variable based on the
independent variables.
UNIT-1 Regression
• R-squared evaluates regression model fit but has limitations:
• High R-squared doesn't always mean good fit; high value may
imply overfitting, lacking generalization.
• Including more predictors can inflate R-squared, even if they're
weak; adjusted R-squared adjusts for this.
• "Good" R-squared varies by field; lower values acceptable in
data-rich areas.
• R-squared may miss fit quality with nonlinearity or outliers.
UNIT-1 Regression
• Adjusted R Squared
• Where −
– n = the number of points in your data sample.
– k = the number of independent regressors, i.e. the number of variables
in your model, excluding the constant.
UNIT-1 Regression
– Adjusted R-squared adjusts the statistic based on the
number of independent variables in the model
– Adjusted R2 also indicates how well terms fit a curve or line,
but adjusts for the number of terms in a model.
– If you add more and more useless variables to a model,
adjusted r-squared will decrease.
– If you add more useful variables, adjusted r-squared will
increase.
– Adjusted R2 will always be less than or equal to R2
UNIT-1 Regression
– Problem Statement −
• A fund has a sample R-squared value close to 0.5 and it is
doubtlessly offering higher risk adjusted returns with the
sample size of 50 for 5 predictors. Find Adjusted R square
value.
– Sample size = 50 Number of predictor = 5 Sample R - square
= 0.5.Substitute the qualities in the equation,
UNIT-1 Regression
• RMSE (Root Mean Squared Error): RMSE is the square root of
the MSE and is commonly used to express the average
magnitude of the prediction errors in the same units as the
dependent variable. It provides a measure of the model's
accuracy, and lower values indicate better performance.
• R Squared (Coefficient of Determination): R-squared is a
statistical measure that represents the proportion of the
variance in the dependent variable that is explained by the
independent variables in the regression model. It ranges from 0
to 1, where 1 indicates that the model explains all the variance,
and 0 indicates that the model doesn't explain any of the
UNIT-1 Regression
• Adjusted R Squared: Adjusted R-squared is a modified version
of R-squared that takes into account the number of
independent variables in the model. It penalizes the addition of
irrelevant variables that might artificially inflate the R-squared
value.
• p-Value: The p-value is a measure of the evidence against a null
hypothesis in a statistical hypothesis test. In the context of
regression analysis, p-values are used to determine whether
the coefficients of the independent variables are statistically
significant. A low p-value (typically below a significance level
like 0.05) suggests that the variable has a significant impact on
UNIT-1 Classification
• A Fraud Detection Classifier
• Objective: To detect fraud claim
• Assumption:
– The output of your fraud detection model is the probability [0.0–1.0]
that a transaction is fraudulent.
– If this probability is below 0.5, you classify the transaction as non-
fraudulent; otherwise, you classify the transaction as fraudulent.
• Methodology
– Collect 10,000 manually classified transactions, with 300 fraudulent
transaction and 9,700 non-fraudulent transactions.
– You run your classifier on every transaction, predict the class label
(fraudulent or non-fraudulent) and
•
• A True Positive (TP=100) is an outcome where the model
correctly predicts the positive (fraudulent) class.
• A True Negative (TN=9,000) is an outcome where the model
correctly predicts the negative (non-fraudulent) class.
• A False Positive (FP=700) is an outcome where the model
incorrectly predicts the positive (fraudulent) class.
• A False Negative (FN=200) is an outcome where the model
incorrectly predicts the negative (non-fraudulent) class.
• Accuracy: Correctly predicted values out of total given data.
• Accuracy = ?
• Area Under Curve
• Area Under Curve(AUC) is one of the most widely used metrics
for evaluation.
• It is used for binary classification problem.
• AUC of a classifier is equal to the probability that the classifier
will rank a randomly chosen positive example higher than a
randomly chosen negative example.
• Two basic terms used in AUC:
– True Positive Rate (Sensitivity)
– True Negative Rate (Specificity)
• Few basic terms used in AUC:
– True Positive Rate (Sensitivity) : True Positive Rate is defined as TP/
(FN+TP). True Positive Rate corresponds to the proportion of positive
data points that are correctly considered as positive, with respect to
all positive data points.
– True Negative Rate (Specificity) : True Negative Rate is defined as

TN / (FP+TN). False Positive Rate corresponds to the proportion of
negative data points that are correctly considered as negative, with
– False Positive Rate : False Positive Rate is defined as FP / (FP+TN).
False Positive Rate corresponds to the proportion of negative data
points that are mistakenly considered as positive, with respect to all
negative data points.
• False Positive Rate and True Positive Rate both have values in
the range [0, 1].
• FPR and TPR both are computed at varying threshold values
such as (0.00, 0.02, 0.04, …., 1.00) and a graph is drawn.
• AUC is the area under the curve of plot False Positive Rate vs
• As evident, AUC has a range of [0, 1]. The greater the value, the
better is the performance of our model.
• F1-Score:
– F1 Score is used to measure a test’s accuracy
– F1 Score is the Harmonic Mean between precision and recall.
– The range for F1 Score is [0, 1].
– It tells you how precise your classifier is (how many instances it classifies correctly), as well as
how robust it is (it does not miss a significant number of instances).
– High precision but lower recall, gives you an extremely accurate, but it then misses a large
number of instances that are difficult to classify.
– The greater the F1 Score, the better is the performance of our model.
– Mathematically, it can be expressed as :
– F1 Score tries to find the balance between precision and recall.

• Precision : It is the number of correct positive results divided by the number of
positive results predicted by the classifier.
• Recall : It is the number of correct positive results divided by

the number of all relevant samples (all samples that should
have been identified as positive).
Regression - MAE, MSE, RMSE(CO1)
• Root Mean Squared Error (RMSE)and Mean Absolute Error (MAE) are metrics used to evaluate a
Regression Model. These metrics tell us how accurate our predictions are and, what is the amount of
deviation from the actual values.

MAE(CO1)
• The Mean absolute error represents the average of the absolute difference between the actual and
predicted values in the dataset. It measures the average of the residuals in the dataset.

MSE(CO1)
• Mean Squared Error represents the average of the squared difference between the original and
predicted values in the data set. It measures the variance of the residuals.

RMSE(CO1)
• Root Mean Squared Error is the square root of Mean Squared error. It measures the standard deviation
of residuals.

RMSE(CO1)
Technically, RMSE is the Root of the Mean of the Square of Errors and MAE is the Mean of Absolute value
of Errors. Here, errors are the differences between the predicted values (values predicted by our regression model)
and the actual values of a variable. They are calculated as follows :

R-squared(CO1)
• The coefficient of determination or R-squared represents the proportion of the variance in the
dependent variable which is explained by the linear regression model. It is a scale-free score i.e.
irrespective of the values being small or large, the value of R square will be less than one.

Adjusted R-squared(CO1)
• Adjusted R squared is a modified version of R square, and it is adjusted for the number of
independent variables in the model, and it will always be less than or equal to R².In the formula
below n is the number of observations in the data and k is the number of the independent variables in
the data.

P-value (CO1)
• The concept of p-value comes from statistics and widely used in machine learning and data
science.
• p-value is also used as an alternative to determine the point of rejection in order to provide the
smallest significance level at which the null hypothesis is least or rejected.
• it is expressed as the level of significance that lies between 0 and 1, and if there is smaller p-
value, then there would be strong evidence to reject the null hypothesis. if the value of p-value
is very small, then it means the observed output is feasible but doesn't lie under the null
hypothesis conditions (h0).
• the p-value of 0.05 is known as the level of significance (α). usually, it is considered using two
suggestions, which are given below:
– if p-value>0.05: the large p-value shows that the null hypothesis needs to be accepted.
– if p-value<0.05: the small p-value shows that the null hypothesis needs to be rejected, and
the result is declared as statically significant.

Precision, Recall, and F1-Score
Some basic terms are Precision, Recall, and F1-Score. These relate to getting a finer-grained idea of how well
a classifier is doing, as opposed to just looking at overall accuracy.
I am looking at a binary classifier in this article. The same concepts do apply more broadly, just require a bit
more consideration on multi-class problems. But that is something to consider another time.

Precision
• Precision is a measure of how many of the positive predictions

made are correct (true positives). The formula for it is:

Recall / Sensitivity
Recall / Sensitivity
Recall is a measure of how many of the positive cases the classifier
correctly predicted, over all the positive cases in the data. It is
sometimes also referred to as Sensitivity. The formula for it is:

F1-Score
• F1-Score is a measure combining both precision and recall. It is generally described
as the harmonic mean of the two.
• Harmonic mean is just another way to calculate an “average” of values, generally
described as more suitable for ratios (such as precision and recall) than the
traditional arithmetic mean.
• The formula used for F1-score in this case is:

k-Fold Cross-Validation
• Cross-validation is a statistical method used to estimate the skill of machine learning models.
• It is commonly used in applied machine learning to compare and select a model for a given
predictive modeling problem because it is easy to understand, easy to implement, and results in
skill estimates that generally have a lower bias than other methods.
• Cross-validation is a resampling procedure used to evaluate machine learning models on a
limited data sample.
• The procedure has a single parameter called k that refers to the number of groups that a given
data sample is to be split into. As such, the procedure is often called k-fold cross-validation.
• When a specific value for k is chosen, it may be used in place of k in the reference to the
model, such as k=10 becoming 10-fold cross-validation.

• We can’t check the ability of this person because 70 math questions are from
algebra but in test 30 questions,10 from calculus so we can’t judge the ability of
person.
• That’s why we are going for K-Fold Cross- Validation, to get good results

Here K=5, total data is divided into 5 fold. First time we are using first fold for test and rest
80% for training purpose, repeat this process 5 times, after that we take average of 5
results.
https://www.youtube.com/watch?v=gJo0uNL-5Qw

ROC curve (CO1)
An ROC curve (receiver operating characteristic curve) is a graph showing the

performance of a classification model at all classification thresholds. This curve plots two
parameters:
•True Positive Rate
•False Positive Rate
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
TPR=TP/(TP+FN)
False Positive Rate (FPR) is defined as follows:
FPR=FP/(FP+TN)
An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the
classification threshold classifies more items as positive, thus increasing both False
Positives and True Positives. The following figure shows a typical ROC curve.

ROC curve (CO1)
AUC-ROC curve
Let’s first understand the meaning of the two
terms ROC and AUC.
•ROC: Receiver Operating Characteristics
•AUC: Area Under Curve
ROC Curve
ROC stands for Receiver Operating Characteristics, and the ROC curve is the graphical representation of the
effectiveness of the binary classification model. It plots the true positive rate (TPR) vs the false positive rate (FPR)
at different classification thresholds.
AUC Curve:
AUC stands for Area Under the Curve, and the AUC curve represents the area under the ROC curve. It measures
the overall performance of the binary classification model. As both TPR and FPR range between 0 to 1, So, the area
will always lie between 0 and 1, and A greater value of AUC denotes better model performance. Our main goal is to
maximize this area in order to have the highest TPR and lowest FPR at the given threshold. The AUC measures the
probability that the model will assign a randomly chosen positive instance a higher predicted probability compared
to a randomly chosen negative instance.

ROC curve (CO1)
TPR and FPR

This is the most common definition that you would have encountered when you would
Google AUC-ROC. Basically, the ROC curve is a graph that shows the performance of a
classification model at all possible thresholds( threshold is a particular value beyond which
you say a point belongs to a particular class). The curve is plotted between two parameters
•TPR – True Positive Rate
•FPR – False Positive Rate
93
Dr. Kumod Kumar Gupta Deep Learning Unit I
ROC curve (CO1)
• Specificity measures the proportion of actual negative instances that are correctly
identified by the model as negative.
• It represents the ability of the model to correctly identify negative instances And as
said earlier ROC is nothing but the plot between TPR and FPR across all possible
thresholds and AUC is the entire area beneath this ROC curve.
Sensitivity versus False Positive Rate plot

94
ROC curve (CO1)
95
ROC curve (CO1)
96
ROC curve (CO1)
By changing cutoff point false negative increases By changing cutoff point false positive increases
ROC curve can be used to determine cutoff point, which optimize the sensitivity,
specificity of a given test.
97
ROC curve (CO1)
AUC measures how well a model is able to distinguish between classes.

An AUC of 0.75 would actually mean that let’s say we take two data points belonging to separate classes then there is a 75% chance the
model would be able to segregate them or rank order them correctly i.e positive point has a higher prediction probability than the negative
class. (assuming a higher prediction probability means the point would ideally belong to the positive class). Here is a small example to
make things more clear.
98
ROC curve (CO1)
99
ROC curve (CO1)
100
Hyper parameter tuning(CO1)
• Hyperparameters in Machine learning are those parameters that are explicitly defined by the user to control
the learning process.
• These hyperparameters are used to improve the learning of the model, and their values are set before
starting the learning process of the model.
• They are usually fixed before the actual training process begins.
• These parameters express important properties of the model such as its complexity or how fast it should
learn.
• Some examples of model hyper parameters include:
• The penalty in Logistic Regression Classifier i.e. L1 or L2 regularization
• The learning rate for training a neural network.
• The C and sigma hyperparameters for support vector machines.
• The k in k-nearest neighbors.
https://www.geeksforgeeks.org/hyperparameter-tuning/

Hyper parameter tuning(CO1)
Models can have many hyperparameters and finding the best combination of
parameters can be treated as a search problem. The two best strategies for
Hyperparameter tuning are:
•GridSearchCV: Grid search Cross Validation
•RandomizedSearchCV: Randonized search Cross-Validation

Grid Search technique (CO1)
we have covered the 3 most popular hyperparameter optimization techniques that are used to get the
optimal set of hyperparameters leading to training a robust machine learning model.
In general, if the number of combinations is limited enough, we can use the Grid Search technique. But
when the number of combinations increases, we should try Random Search or Bayes Search as they are
not computationally expensive.
GridSearchCV takes a dictionary that describes the parameters that could be tried on a model to train
it. The grid of parameters is defined as a dictionary, where the keys are the parameters and the values
are the settings to be tested.

GridSearchCV
In GridSearchCV approach, the machine learning model is evaluated for a range of hyperparameter
values. This approach is called GridSearchCV, because it searches for the best set of hyperparameters
from a grid of hyperparameters values.
For example, if we want to set two hyperparameters C and Alpha of the Logistic Regression Classifier
model, with different sets of values. The grid search technique will construct many versions of the model
with all possible combinations of hyperparameters and will return the best one.
As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3, 0.4]. For a combination
of C=0.3 and Alpha=0.2, the performance score comes out to be 0.726(Highest), therefore it is selected.

Grid Search technique Code (CO1)
# Necessary imports
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
# Creating the hyperparameter grid

c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}
# Instantiating logistic regression classifier

logreg = LogisticRegression()
# Instantiating the GridSearchCV object

logreg_cv = GridSearchCV(logreg, param_grid, cv = 5)
logreg_cv.fit(X, y)
# Print the tuned parameters and score

print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Best score is {}".format(logreg_cv.best_score_))
https://www.geeksforgeeks.org/hyperparameter-tuning/
Drawback: GridSearchCV will go through all the intermediate combinations of hyperparameters
which makes grid search computationally very expensive.
• RandomizedSearchCV
RandomizedSearchCV solves the drawbacks of GridSearchCV, as it goes through only a fixed number
of hyperparameter settings.
• It moves within the grid in a random fashion to find the best set of hyperparameters. This approach
reduces unnecessary computation.

Random Search Code (CO1)
# Necessary imports
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
# Creating the hyperparameter grid

param_dist = {"max_depth": [3, None],
"max_features": randint(1, 9),
"min_samples_leaf": randint(1, 9),
"criterion": ["gini", "entropy"]}
# Instantiating Decision Tree classifier
tree = DecisionTreeClassifier()
# Instantiating RandomizedSearchCV object

tree_cv = RandomizedSearchCV(tree, param_dist, cv = 5)
tree_cv.fit(X, y)
# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))

Random Search Code (CO1)
OutPut
Tuned Decision Tree Parameters: {‘min_samples_leaf’: 5, ‘max_depth’: 3,

‘max_features’: 5, ‘criterion’: ‘gini’} Best score is 0.7265625

Introduction to Deep Learning(CO1)
Deep learning is a class of machine learning algorithms that use several layers of
nonlinear processing units for feature extraction and transformation. Each successive
layer uses the output from the previous layer as input.
Deep neural networks, deep belief networks and recurrent neural networks have been
applied to fields such as computer vision, speech recognition, natural language
processing, audio recognition, social network filtering, machine translation, and
bioinformatics where they produced results comparable to and in some cases better than
human experts have.
Deep Learning Algorithms and Networks −
are based on the unsupervised learning of multiple levels of features or representations
of the data. Higher-level features are derived from lower level features to form a
hierarchical representation.
use some form of gradient descent for training.
Deep Learning Applications (CO1)
Here are just a few examples of deep learning at work:

• A self-driving vehicle slows down as it approaches a
pedestrian crosswalk.
• An ATM rejects a counterfeit bank note.
• A smartphone app gives an instant translation of a
foreign street sign.
• Deep learning is especially well-suited to identification

applications such as face recognition, text translation,
voice recognition, and advanced driver assistance
systems, including, lane classification and traffic sign
recognition.

Some other Applications (CO1)
Used for speed of machine Digital imaging
Fraud Detection Increasing phone efficiency
111
What Makes Deep Learning State-of-the-Art? (CO1)
In a word, accuracy. Advanced tools and techniques have dramatically improved deep learning algorithms
—to the point where they can outperform humans at classifying images, win against the world’s best GO
player, or enable a voice-controlled assistant like Amazon Echo® and Google Home to find and download
that new song you like.
112

What Makes Deep Learning State-of-the-Art? (CO1)
Three technology enablers make this degree of accuracy possible:

Easy access to massive sets of labeled data Data sets such as
ImageNet and PASCAL VoC are freely available, and are useful for
training on many different types of objects.
Increased computing power High-performance GPUs accelerate

the training of the massive amounts of data needed for deep
learning, reducing training time from weeks to hours.
Pretrained models built by experts Models such as AlexNet can be retrained to perform new recognition
tasks using a technique called transfer learning. While AlexNet was trained on 1.3 million high-resolution
images to recognize 1000 different objects, accurate transfer learning can be achieved with much smaller
datasets.

Difference between AI, ML, DL (CO1)
AI ML DL
AI
ML
DL

Why it needed deep learning (CO1)
1. Huge amount of data

(Initially we started with ML, its major drawback is, its efficiency is degraded with higher data
or data sets)
(x-axis: number of data, Y-axis: efficiency )

And the solution given by deep learning, that can handled huge amount of data, which may
be structured or unstructured.
2. Complex problem
These are basically included the real time data analysis, medical diagnosis system etc., which
are handled by deep learning

Artificial Neural Network(CO1)
The term "Artificial neural network" refers to a biologically inspired sub-

field of artificial intelligence modeled after the brain. An Artificial neural
network is usually a computational network based on biological neural
networks that construct the structure of the human brain. Similar to a
human brain has neurons interconnected to each other, artificial neural
networks also have neurons that are linked to each other in various layers
of the networks. These neurons are known as nodes.

How do our brains work?(CO1)
 The Brain is A massively parallel information processing system.

 Our brains are a huge network of processing elements. A typical brain contains a network of 10 billion
neurons.

 A processing element
Dendrites: Input
Cell body: Processor
Synaptic: Link
Axon: Output
Synapse: weight

An artificial neuron is an imitation of a human neuron

Biological neural network and artificial neural network(CO1)
• Dendrites from Biological Neural Network represent inputs in

Artificial Neural Networks, cell nucleus represents Nodes,
synapse represents Weights, and Axon represents Output.
• Relationship between Biological neural network and artificial
neural network:
Biological Neural Network Artificial Neural Network
Dendrites Inputs
Cell nucleus Nodes
Synapse Weights
Axon Output

The typical Artificial Neural Network looks something like the given figure(CO1)
Our basic computational element (model neuron) is often called a node or unit. It receives input from some other
units, or perhaps from an external source. Each input has an associated weight w, which can be modified so as to
model synaptic learning. The unit computes some function f of the weighted sum of its inputs:

Model of Artificial Neural Network(CO1)
• The following diagram represents the general model of ANN

followed by its processing.
• For the above general model of artificial neural network, the
net input can be calculated as follows −
• yin=x1.w1+x2.w2+x3.w3…xm.wm i.e.,
• Net input yin=∑mixi.wi The output can be calculated by
applying the activation function over the net input.
• Y=F(yin) Output = function netinputcalculated

Activation function (CO1)
• Bipolar binary and unipolar binary are called as hard limiting

activation functions used in discrete neuron model
• Unipolar continuous and bipolar continuous are called soft
limiting activation functions are called sigmoidal
characteristics.

Activation functions (CO1)
Also called the squashing function as it limits the amplitude of the

output of the neuron.
Many types of activations functions are used:

– linear: a = f(n) = n
– threshold: a = {1 if n >= 0
(hardlimiting)
0 if n < 0
– sigmoid: a = 1/(1+e-n)
– ... Dr. Kumod Kumar Gupta Deep Learning Unit I 124

Activation Functions (CO1)

Bipolar continuous
Bipolar binary functions

Unipolar continuous
Unipolar Binary

Binary perceptrons
Continuous perceptrons

Neural network architecture
Feedforward Network
• It is a non-recurrent network having processing units/nodes in
layers and all the nodes in a layer are connected with the nodes of
the previous layers. The connection has different weights upon
them. There is no feedback loop means the signal can only flow in
one direction, from input to output. It may be divided into the
following two types −

Neural network architecture Cont…(CO1)
•Single layer feedforward network − The concept is of feedforward

ANN having only one weighted layer. In other words, we can say the
input layer is fully connected to the output layer.

Single layer Feedforward Network

• Multilayer feedforward network − The concept is of

feedforward ANN having more than one weighted layer. As this
network has one or more layers between the input and the output
layer, it is called hidden layers.

Multilayer feed forward network(CO1)
Can be used to solve complicated problems

• Feedback Network :As the name suggests, a feedback network has feedback
paths, which means the signal can flow in both directions using loops. This
makes it a non-linear dynamic system, which changes continuously until it
reaches a state of equilibrium.
• Recurrent networks − They are feedback networks with closed loops. It is a
closed loop network in which the output will go to the input again as feedback
as shown in the following diagram.

Feedback network(CO1)
When outputs are directed back as inputs to same

or preceding layer nodes it results in the formation
of feedback networks

Recurrent network(CO1)
Feedback networks with closed loop are called Recurrent Networks. The response at the k+1’th instant depends on
the entire history of the network starting at k=0.
Automaton: A system with discrete time inputs and a discrete data representation is called an automaton
• Single node with own feedback

• Competitive nets
• Single-layer recurrent nts
• Multilayer recurrent networks

Hebbian Learning Rule(CO1)
FEED FORWARD UNSUPERVISED LEARNING
• The learning signal is equal to the neuron’s output

Features of Hebbian Learning(CO1)
• Feedforward unsupervised learning

• “When an axon of a cell A is near enough to exicite a cell B and
repeatedly and persistently takes place in firing it, some growth
process or change takes place in one or both cells increasing
the efficiency”
• If oixj is positive the results is increase in weight else vice versa

Final answer:
• For the same inputs for bipolar continuous activation function

the final updated weight is given by

Perceptron Learning Rule(CO1)
• Learning signal is the difference between the desired and actual neuron’s
response
• Learning is supervised

Perceptron Learning Rule(CO1)

Delta Learning Rule(CO1)
• Only valid for continuous activation function

• Used in supervised training mode
• Learning signal for this rule is called delta
• The aim of the delta rule is to minimize the error over all training patterns

Delta Learning Rule Contd.(CO1)
Learning rule is derived from the condition of least squared error.

Calculating the gradient vector with respect to wi
Minimization of error requires the weight changes to be in the negative gradient direction

MLP training algorithm(CO1)
A Multi-Layer Perceptron (MLP) neural network trained using the Backpropagation learning algorithm is
one of the most powerful forms of supervised neural network system.
The training of such a network involves three stages:
• feedforward of the input training pattern,
• calculation and backpropagation of the associated error
• adjustment of the weights
This procedure is repeated for each pattern over several complete passes (epochs) through the training
set.
After training, application of the net only involves the computations of the feedforward phase.

Backpropagation Learning Algorithm(CO1)
Feed Forward phase:
• Xi = input[i]
• Yj = f( bj + XiWij)
• Zk = f( bk + YjWjk)
Backpropagation of errors:
• k = Zk[1 - Zk](dk - Zk)
• j = Yj[1 - Yj]  k Wjk
Weight updating:
• W (t+1) = W (t) +  Y + [W (t) - W (t - 1)]
jk jk k j jk jk
• b (t+1) = b (t) +  Y + [b (t) - b (t - 1)]
k k k tn k k
• W (t+1) = W (t) +  X + [W (t) - W (t - 1)]
ij ij j i ij ij
• b (t+1) = b (t) +  X + [b (t) - b (t - 1)]
j j j tn j j

Faculty Video Links, Youtube & NPTEL Video Links and Online Courses Details
• 1. https://nptel.ac.in/courses/117/105/117105084/
• 4.https://www.youtube.com/watch?
v=DKSZHN7jftI&list=PLZoTAELRMXVPGU70ZGsckrMdr0FteeRUi
• 5.https://www.youtube.com/watch?
v=aPfkYu_qiF4&list=PLyqSpQzTE6M9gCgajvQbc68Hk_JKGBAYT

Quiz
1. Which element method for computing is started early days.
(a) Machine learning (b) Artificial intelligence (c) Deep learning (d) none of these
2. In which method for efficiency is higher with more in data set.
(b)Machine learning (b) Artificial intelligence (c) Deep learning (d) none of these
3. What is tensor flow.
(c) all the mathematics in the form of flow chart (b) Artificial intelligence algorithm (c)
Deep learning algorithm (d) none of these

Quiz
4. What are the benefits of TensorFlow over other libraries?
(a) Scalability (b) Visualization of Data (c) Pipelining (d) all of these
5. What do you mean by pipelining?
(b)Whole work doing at a time (b) whole work divide in small segments and then
execute in parallel manner (c) copying a work from another processor (d) none of
these

QUIZ
6. What is API?
(a) A programming interface(b) After programming interface (c)
Application Programming Interface (d) none of these.
7. What is the main operation in TensorFlow?
(b)Computing(b) calculation (c) Pipelining (d) passing values and
assigning the output to another tensor.
8. TensorFlow is the product of which company?
(c) Google research team (b) Amazon technical team (c) PayPal
(d) none of these
9. What is the execution speed of brain neuron?

(d) (b) (c) (d) none of these

Weekly Assignment
Q1. For which purpose Convolutional Neural Network is used?

Q2. What is the biggest advantage utilizing CNN?Q3. Discuss the history of deep learning.
Q4. What is the difference between Neural Networks and Deep Learning?
Q5. How can a neural network learn itself?
Q6.Explain the Concept of ANN with the help of an example.
Q7. Define the term Gradient descent. Also discuss its importance.
Q8. Explain Perceptron Convergence Theorem.
Q9.Define the term Bias.
Q10. Why ReLu function is required ?

MCQ
Q1. Which neural network has only one hidden layer between the input and output?
A. Shallow neural network
B. Deep neural network
C. Feed-forward neural networks
D. Recurrent neural networks
Q2. Which of the following is/are Limitations of deep learning?

A. Data labeling
B. Obtain huge training datasets
C. Both A and B
D. None of the above

MCQ
Q3.Deep learning algorithms are _______ more accurate than machine learning algorithm in image
classification.
A. 33%
B. 37%
C. 40%
D. 41%
Q4. Which of the following functions can be used as an activation function in the output layer if we wish
to predict the probabilities of n classes (p1, p2..pk) such that sum of p over all n equals to 1?
A. Softmax
B. ReLu
C. Sigmoid
D. Tanh

MCQ
Q5. Which of the following would have a constant input in each epoch of training a Deep Learning model?
A. Weight between input and hidden layer
B. Weight between hidden and output layer
C. Biases of all hidden layer neurons
D. Activation function of output layer
6. If in the training method we are not obtained the accurate output then which value the neural network
changes to get accurate output?
(a) bias (b) perceptron (c) weight (d) all value can change
B. 7. What are benefit of using graph in the tensor flow?

(a) parallelism (b) high execution speed (c) less complexity (d) all of these
C. 8. In between CPU and GPU which have high execution speed?

(a) GPU (b) CPU (c) Both have same speed (d) cannot distinguished

Old Question Papers

Old Question Papers

Expected Questions for University Exam
Q1. Define Batch Normalization. Why Batch Normalization helps in faster

convergence?
Q2. Define Deep Learning . Also discuss its importance.
Q3. Discuss the history of deep learning.
Q4. What is the difference between Neural Networks and Deep Learning?
Q5. How can a neural network learn itself?
Q6.Explain the Concept of ANN with the help of an example.
Q7. Define the term Gradient descent. Also discuss its importance.
Q8. Explain Perceptron Convergence Theorem.
Q9.Define the term Bias.
Q10. Why ReLu function is required ?

Summary
 Deep Learning is a subfield of machine learning concerned with algorithms inspired

by the structure and function of the brain called artificial neural networks.
 If you are just starting out in the field of deep learning or you had some experience
with neural networks some time ago, you may be confused. I know I was confused
initially and so were many of my colleagues and friends who learned and used
neural networks in the 1990s and early 2000s.

References
 1. https://www.slideshare.net/lablogga/deep-learning-explained2. Qin, T. (2020).

 Deep Learning Basics. In Dual Learning (pp. 25-46). Springer, Singapore.
 3.http://people.uncw.edu/chenc
/STT592_Deep%20Learning/STT592DeepLearning_Index.html
 4. Gulli, Antonio, and Sujit Pal. Deep learning with Keras. Packt Publishing Ltd, 2017.
Thank You

THANK YOU

Unit1 Kumod Deeplearning

Uploaded by

Copyright:

Available Formats

Unit1 Kumod Deeplearning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit1 Kumod Deeplearning

Uploaded by

Copyright:

Available Formats

Noida Institute of Engineering and Technology, Greater Noida

Dr. Kumod Kumar Gupta Deep Learning Unit I 1

Name Dr. Kumod Kr. Gupta

Dr. Kumod Kumar Gupta Deep Learning Unit I 2

Sl. Subject Codes Periods Evaluation Scheme End

1 ACSML0602 Deep Learning 3 0 0 30 20 50 100 150 3

12 MOOCs (For B.Tech. Hons. Degree)

GRAND TOTAL 1100 23

Dr. Kumod Kumar Gupta Deep Learning Unit I 3

Dr. Kumod Kumar Gupta Deep Learning Unit I 4

UNIT-II: CONVOLUTION NEURAL NETWORK

What is computer vision? Why Convolutions (CNN)?

Dr. Kumod Kumar Gupta Deep Learning Unit I 5

UNIT-III:DETECTION & RECOGNITION

Padding & Edge Detection, Strided Convolutions, Networks in Networks and

Dr. Kumod Kumar Gupta Deep Learning Unit I 6

Dr. Kumod Kumar Gupta Deep Learning Unit I 7

UNIT-V: AUTO ENCODERS IN DEEP LEARNING

Auto-encoders and unsupervised learning, Stacked auto-encoders and semi-supervised

Regularization - Dropout and Batch normalization.

Dr. Kumod Kumar Gupta Deep Learning Unit I 8

To be able to learn unsupervised techniques and provide continuous

Dr. Kumod Kumar Gupta Deep Learning Unit I 9

Engineering Graduates will be able to:

PO1 : Engineering Knowledge

PO2 : Problem Analysis

PO3 : Design/Development of solutions

PO4 : Conduct Investigations of complex problems

PO5 : Modern tool usage

PO6 : The engineer and society

Engineering Graduates will be able to:

PO7 : Environment and sustainability

PO9 : Individual and teamwork

PO11 : Project management and finance

PO12 : Life-long learning

Dr. Kumod Kumar Gupta Deep Learning Unit I 13

FACULTY NAME BRANCH/SECTION RESULT

Dr. Kumod Kumar Gupta Deep Learning Unit I 14

Dr. Kumod Kumar Gupta Deep Learning Unit I 15

Dr. Kumod Kumar Gupta Deep Learning Unit I 16

Dr. Kumod Kumar Gupta Deep Learning Unit I 17

Dr. Kumod Kumar Gupta Deep Learning Unit I 18

Dr. Kumod Kumar Gupta Deep Learning Unit I 19

Dr. Kumod Kumar Gupta Deep Learning Unit I 20

Model Improvement and Artificial Neural Network:

Dr. Kumod Kumar Gupta Deep Learning Unit I 21

Analyze ANN model and understand the ways of accuracy measurement.

Dr. Kumod Kumar Gupta Deep Learning Unit I 22

• Python, Basic Modeling Concepts

Dr. Kumod Kumar Gupta Deep Learning Unit I 23

To be able to learn unsupervised techniques and provide continuous improvement in accuracy

Analyze ANN model and understand the ways of accuracy measurement .

Dr. Kumod Kumar Gupta Deep Learning Unit I 24

• Increasing the number of features will not always improve

• In practice, the inclusion of more features might actually lead

• The number of training examples required increases k=3

exponentially with dimensionality d (i.e., kd).

Dr. Kumod Kumar Gupta Deep Learning Unit I 25

• What is the objective?