NNDL

Download as pdf or txt
Download as pdf or txt
You are on page 1of 96

Mod 1

Introduction to Neural network : ANN , Biological neural network , mcCulloch Pitts


neuron, Perceptron : architecture, algorithm , perceptron learning rule convergence
theorem
Feedforward Networks : Multilayer Perceptron,Gradient Descent,Backpropagation

https://towardsdatascience.com/learning-process-of-a-deep-neural-network-5a9768d7a
651

https://towardsdatascience.com/basic-concepts-of-neural-networks-1a18a7aa2bd2

Neural networks working

● A neural network is made up of neurons connected to each other; at the same


time, each connection of our neural network is associated with a weight that
dictates the importance of this relationship in the neuron when multiplied by the
input value.


● Each neuron has an activation function that defines the output of the neuron.
The activation function is used to introduce non-linearity in the modeling
capabilities of the network. We have several options for activation functions that
we will present in this post.

● First phase is propagation phase = occurs when the network is exposed to the
training data and these cross the entire neural network for their predictions
(labels) to be calculated.
● Next we will use a loss function to estimate the loss (or error) and to compare
and measure how good/bad our prediction result was in relation to the correct
result.
● After this using backpropagation each neuron receives the error rate however, the
neurons of the hidden layer only receive a fraction of the total signal of the loss,
based on the relative contribution that each neuron has contributed to the
original output. This process is repeated, layer by layer, until all the neurons in
the network have received a loss signal that describes their relative contribution
to the total loss.
● Now that we have spread this information back, we can adjust the weights of
connections between neurons. What we are doing is making the loss as close as
possible to zero the next time we go back to using the network for a prediction.
For this, we will use a technique called gradient descent.
● For Model parameterization we use Epochs, Batch size , Learning rate
Feedforward NN

● Neurons in one layer are connected to neurons to the next layer and
the previous layer


● The feedforward neural network was the first and simplest type of
artificial neural network devised.
● In this network, the information moves in only one
direction—forward—from the input nodes, through the hidden nodes
(if any) and to the output nodes. There are no cycles or loops in the
network.
● More layers more deep and more complex
Gradient Descent

Now that we have spread this information back, we can adjust the weights of connections

between neurons. What we are doing is making the loss as close as possible to zero the next time

we go back to using the network for a prediction. For this, we will use a technique called

gradient descent. This technique changes the weights in small increments with the help of the

calculation of the derivative (or gradient) of the loss function, which allows us to see in which

direction “to descend” towards the global minimum; this is done in general in batches of data in

the successive iterations (epochs) of all the dataset that we pass to the network in each iteration.
Activation function

● Remember that we use the activation functions to propagate the

output of a neuron forward.

● This output is received by the neurons of the next layer to which this

neuron is connected (up to the output layer included).

● The activation function serves to introduce non-linearity in the

modeling capabilities of the network.


https://www.geeksforgeeks.org/activation-functions/

Step

Step Function is one of the simplest kinds of activation functions. In this, we

consider a threshold value and if the value of net input says y is greater than the

threshold then the neuron is activated.

Mathematically,
Sigmoid

● The sigmoid function has already been introduced in a previous post.

● Its interest lies in the fact that it allows a reduction in extreme or

atypical values in valid data without eliminating them: it converts

independent variables of almost infinite range into simple

probabilities between 0 and 1.

● Most of its output will be very close to the extremes of 0 or 1.

● This is a smooth function and is continuously differentiable.


● The biggest advantage that it has over step and linear function is that it is

non-linear. This is an incredibly cool feature of the sigmoid function.

● This essentially means that when I have multiple neurons having sigmoid

function as their activation function – the output is non linear as well.

● The function ranges from 0-1 having an S shape.


Softmax
The softmax activation function was also presented in a previous post to

generalize the logistic regression, insofar as instead of classifying in binary

it can contain multiple decision limits. As we have seen, the softmax

activation function will often be found in the output layer of a neural

network and return the probability distribution over mutually exclusive

output classes.

ReLU

● The activation function rectified linear unit (ReLU) is a very interesting

transformation that activates a single node if the input is above a certain

threshold.
● The default and more usual behavior is that, as long as the input has a value

below zero, the output will be zero but, when the input rises above, the

output is a linear relationship with the input variable of the form f(x)=x.

● The ReLU activation function has proven to work in many different

situations and is currently widely used.

● The main advantage of using the ReLU function over other activation

functions is that it does not activate all the neurons at the same time.

● What does this mean ? If you look at the ReLU function, if the input is

negative it will convert it to zero and the neuron does not get activated.

McCulloch-Pitts Neuron
https://towardsdatascience.com/mcculloch-pitts-model-5fdf65ac5dd1

The first computational model of a neuron was proposed by Warren MuCulloch

(neuroscientist) and Walter Pitts (logician) in 1943.

Perceptron what is ?

The simplest type of perceptron has a single layer of weights connecting


the inputs and output.

Formally, the perceptron is defined by y = sign(PNi=1 wixi ✓) or

y = sign(wT x ✓) (1)

where w is the weight vector and ✓ is the threshold. Unless otherwise


stated, we will ignore the threshold in the analysis of the perceptron (and
other topics), be- cause we can instead view the threshold as an additional
synaptic weight, which
Is given the constant input 1. This is because Tx✓=[wT,1] x .

● The Perceptron is sometimes referred to a threshold logic unit (TLU) since it


discriminates the data depending on whether the sum is greater than the
threshold value S i=1d wi xi > -w0 or the sum is less than the threshold value S
d
i=1 wi xi < -w0.
● In the above formulation we imagine that the threshold value w0 is the
weight of an additional connection held constantly to x0 = 1.
● The Perceptron is strictly equivalent to a linear discriminant, and it is often
used as a device that decides whether an input pattern belongs to one of two
classes.
https://www.simplilearn.com/tutorials/deep-learning-tutorial/pe
rceptron
Perceptron

Perceptron was introduced by Frank Rosenblatt in 1957. He proposed a Perceptron


learning rule based on the original MCP neuron. A Perceptron is an algorithm for
supervised learning of binary classifiers. This algorithm enables neurons to learn and
process elements in the training set one at a time.

There are two types of Perceptrons: Single layer and Multilayer.

● Single layer - Single layer perceptrons can learn only linearly


separable patterns
● Multilayer - Multilayer perceptrons or feedforward neural networks
with two or more layers have the greater processing power

The Perceptron algorithm learns the weights for the input signals in order
to draw a linear decision boundary.

This enables you to distinguish between the two linearly separable classes
+1 and -1.
Perceptron learning rule convergence theorem

Perceptron Learning Rule states that the algorithm would automatically


learn the optimal weight coefficients. The input features are then multiplied
with these weights to determine if a neuron fires or not.

The Perceptron receives multiple input signals, and if the sum of the input
signals exceeds a certain threshold, it either outputs a signal or does not
return an output. In the context of supervised learning and classification,
this can then be used to predict the class of a sample.
Next up, let us focus on the perceptron function.
https://en.wikipedia.org/wiki/Feedforward_neural_network
Gradient Descent

Gradient descent (GD) is an iterative first-order optimisation algorithm used to find


a local minimum/maximum of a given function. This method is commonly used in
machine learning (ML) and deep learning(DL) to minimise a cost/loss function (e.g. in a
linear regression).

Mod 2

CNN , RNN , Backpropagation through time , Bidirectional RNN, LSTM,


GRU , Bidirectional LSTM

CNN
- Sparse Interaction
- In normal NN what happens is all the layers are connected to
each other by each point but we don’t need that actually
- So in CNN only the ones which actually need to be connected
are connected
- Parameter Sharing
-
- Equivariant Representation
Layers in CNN

Convolution Layer
-
-

-
-

Activation Layer

Pooling layer
- Downsizing of parameters
-
- Stride - how many features is skipped
- Depth remains the same after pooling
-
Fully connected layer
-

- Applying a hidden layer and applying softmax activation


-
-
RNN
● At every time t we have to perform backpropagation which takes
longer time and thus it takes too much computational time


https://builtin.com/data-science/recurrent-neural-networks-and-lstm
BPTT is basically just a fancy buzz word for doing backpropagation on an unrolled
RNN. Unrolling is a visualization and conceptual tool, which helps you understand
what’s going on within the network. Most of the time when implementing a
recurrent neural network in the common programming frameworks,
backpropagation is automatically taken care of, but you need to understand how it
works to troubleshoot problems that may arise during the development process.

You can view a RNN as a sequence of neural networks that you train one after
another with backpropagation.

WHAT IS BACKPROPAGATION?

Backpropagation (BP or backprop, for short) is known as a


workhorse algorithm in machine learning. Backpropagation is
used for calculating the gradient of an error function with respect
to a neural network’s weights. The algorithm works its way
backwards through the various layers of gradients to find the
partial derivative of the errors with respect to the weights.
Backprop then uses these weights to decrease error margins
when training.

Bidirectional RNN
In sequence learning, so far we assumed that our goal is to model the next output given
what we have seen so far, e.g., in the context of a time series or in the context of a
language model. While this is a typical scenario, it is not the only one we might
encounter. To illustrate the issue, consider the following three tasks of filling in the blank
in a text sequence:

● I am ___.
● I am ___ hungry.
● I am ___ hungry, and I can eat half a pig.

Depending on the amount of information available, we might fill in the blanks with very
different words such as “happy”, “not”, and “very”. Clearly the end of the phrase (if
available) conveys significant information about which word to pick. A sequence model
that is incapable of taking advantage of this will perform poorly on related tasks. For
instance, to do well in named entity recognition (e.g., to recognize whether “Green”
refers to “Mr. Green” or to the color) in a longer-range context is equally vital. To get
some inspiration for addressing the problem let us take a detour to probabilistic
graphical models.

https://d2l.ai/chapter_recurrent-modern/bi-rnn.html

LSTM
Long short-term memory networks (LSTMs) are an extension for recurrent neural
networks, which basically extends the memory. Therefore it is well suited to learn
from important experiences that have very long time lags in between.

LSTMs enable RNNs to remember inputs over a long period of time. This is
because LSTMs contain information in a memory, much like the memory of a
computer. The LSTM can read, write and delete information from its memory.

This memory can be seen as a gated cell, with gated meaning the cell decides
whether or not to store or delete information (i.e., if it opens the gates or not),
based on the importance it assigns to the information. The assigning of importance
happens through weights, which are also learned by the algorithm. This simply
means that it learns over time what information is important and what is not.
In an LSTM you have three gates: input, forget and output gate. These gates
determine whether or not to let new input in (input gate), delete the information
because it isn’t important (forget gate), or let it impact the output at the current
timestep (output gate). Below is an illustration of a RNN with its three gates:
A recurrent neural network is a type of ANN that is used when users want to
perform predictive operations on sequential or time-series based data.
These Deep learning layers are commonly used for ordinal or temporal problems
such as Natural Language Processing, Neural Machine Translation, automated
image captioning tasks and likewise.
Today’s modern voice assistance devices such as Google Assistance, Alexa, Siri
are incorporated with these layers to fulfil hassle-free experiences for users.

https://analyticsindiamag.com/lstm-vs-gru-in-recurrent-neural-network-a-comparat
ive-study/
GRU

Bidirectional LSTM

Mod 3
Tensorflow : Introduction , tensor , tensor properties , basic tensor methods

● The inputs, outputs, and transformations within neural networks are all
represented using tensors, and as a result, neural network programming utilizes
tensors heavily.



Neural networks working

● A neural network is made up of neurons connected to each other; at the same


time, each connection of our neural network is associated with a weight that
dictates the importance of this relationship in the neuron when multiplied by the
input value.

● Each neuron has an activation function that defines the output of the neuron.
The activation function is used to introduce non-linearity in the modeling
capabilities of the network. We have several options for activation functions that
we will present in this post.

● First phase is propagation phase = occurs when the network is exposed to the
training data and these cross the entire neural network for their predictions
(labels) to be calculated.
● Next we will use a loss function to estimate the loss (or error) and to compare
and measure how good/bad our prediction result was in relation to the correct
result.
● After this using backpropagation each neuron receives the error rate however, the
neurons of the hidden layer only receive a fraction of the total signal of the loss,
based on the relative contribution that each neuron has contributed to the
original output. This process is repeated, layer by layer, until all the neurons in
the network have received a loss signal that describes their relative contribution
to the total loss.
● Now that we have spread this information back, we can adjust the weights of
connections between neurons. What we are doing is making the loss as close as
possible to zero the next time we go back to using the network for a prediction.
For this, we will use a technique called gradient descent.
● For Model parameterization we use Epochs, Batch size , Learning rate

CNN in Tensorflow

Applying convolution in Tensorflow

Steps involved in ML model


https://analyticsindiamag.com/the-7-key-steps-to-build-your-machine-learning-model/
1. Collect Data
2. Prepare the data
3. Choose the model
4. Train your deep learning model
5. Evaluation of metrics
6. Parameter Tuning
7. Prediction or Inference
CNN
- Sparse Interaction
- In normal NN what happens is all the layers are connected to
each other by each point but we don’t need that actually
- So in CNN only the ones which actually need to be connected
are connected
- Parameter Sharing
-
- Equivariant Representation
Layers in CNN

Convolution Layer
-
-

-
-

Activation Layer

Pooling layer
- Downsizing of parameters
-
- Stride - how many features is skipped
- Depth remains the same after pooling
-
Fully connected layer
-

- Applying a hidden layer and applying softmax activation


-
-
RNN using tensorflow

https://www.tensorflow.org/guide/keras/rnn

LSTM using tensorflow


https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM

https://towardsdatascience.com/lstm-by-example-using-tensorflow-feb0c1968537

Mod 4
Out-of-vocabulary (OOV) are terms that are not part of the normal lexicon found in
a natural language processing environment. In speech recognition, it's the audio
signal that contains these terms. Word vectors are the mathematical equivalent of word
meaning.
Questions paper

Mod 1

Q1. (a)
Neural networks working

● A neural network is made up of neurons connected to each other; at the same


time, each connection of our neural network is associated with a weight that
dictates the importance of this relationship in the neuron when multiplied by the
input value.

● Each neuron has an activation function that defines the output of the neuron.
The activation function is used to introduce non-linearity in the modeling
capabilities of the network. We have several options for activation functions that
we will present in this post.

● First phase is propagation phase = occurs when the network is exposed to the
training data and these cross the entire neural network for their predictions
(labels) to be calculated.
● Next we will use a loss function to estimate the loss (or error) and to compare
and measure how good/bad our prediction result was in relation to the correct
result.
● After this using backpropagation each neuron receives the error rate however, the
neurons of the hidden layer only receive a fraction of the total signal of the loss,
based on the relative contribution that each neuron has contributed to the
original output. This process is repeated, layer by layer, until all the neurons in
the network have received a loss signal that describes their relative contribution
to the total loss.
● Now that we have spread this information back, we can adjust the weights of
connections between neurons. What we are doing is making the loss as close as
possible to zero the next time we go back to using the network for a prediction.
For this, we will use a technique called gradient descent.
● For Model parameterization we use Epochs, Batch size , Learning rate

Q1 (b)

Activation functions

Remember that we use the activation functions to propagate the output of a

neuron forward. This output is received by the neurons of the next layer to

which this neuron is connected (up to the output layer included). As we

have said, the activation function serves to introduce non-linearity in the

modeling capabilities of the network. Below we will list the most used

nowadays; all of them can be used in a layer of Keras (we can find more

information on their website).


Step

Step Function:

Step Function is one of the simplest kind of activation functions. In this, we

consider a threshold value and if the value of net input say y is greater than the

threshold then the neuron is activated.

Mathematically,
Sigmoid

The sigmoid function has already been introduced in a previous post. Its

interest lies in the fact that it allows a reduction in extreme or atypical

values in valid data without eliminating them: it converts independent

variables of almost infinite range into simple probabilities between 0 and 1.

Most of its output will be very close to the extremes of 0 or 1.


Softmax
The softmax activation function was also presented in a previous post to

generalize the logistic regression, in so far as instead of classifying in binary

it can contain multiple decision limits. As we have seen, the softmax

activation function will often be found in the output layer of a neural

network and return the probability distribution over mutually exclusive

output classes.

ReLU

The activation function rectified linear unit (ReLU) is a very interesting

transformation that activates a single node if the input is above a certain

threshold. The default and more usual behavior is that, as long as the input

has a value below zero, the output will be zero but, when the input rises

above, the output is a linear relationship with the input variable of the form

f(x)=x. The ReLU activation function has proven to work in many different
situations and is currently widely used.

Mod 2

Q1(a)

CNN
- Sparse Interaction
- In normal NN what happens is all the layers are connected to
each other by each point but we don’t need that actually
- So in CNN only the ones which actually need to be connected
are connected
- Parameter Sharing
-
- Equivariant Representation
Layers in CNN

Convolution Layer
-
-

-
-

Activation Layer

Pooling layer
- Downsizing of parameters
-
- Stride - how many features is skipped
- Depth remains the same after pooling
-
Fully connected layer
-

- Applying a hidden layer and applying softmax activation


-
-
Q1(b)
RNN
RNNs are a powerful and robust type of neural network, and belong to
the most promising algorithms in use because it is the only one with an
internal memory.

Because of their internal memory, RNN’s can remember important


things about the input they received, which allows them to be very
precise in predicting what’s coming next. This is why they're the
preferred algorithm for sequential data like time series, speech, text,
financial data, audio, video, weather and much more. Recurrent neural
networks can form a much deeper understanding of a sequence and its
context compared to other algorithms.

To understand RNNs properly, you'll need a working knowledge of


"normal“ feed-forward neural networks and sequential data.

Sequential data is basically just ordered data in which related


things follow each other. Examples are financial data or the DNA
sequence. The most popular type of sequential data is perhaps
time series data, which is just a series of data points that are listed
in time order.

https://builtin.com/data-science/recurrent-neural-networks-and-lst
m

Read more here.


Q1(c)
LSTM vs GRU

Working of LSTM

Long Short Term Memory in short LSTM is a special kind of RNN


capable of learning long term sequences. They were introduced by
Schmidhuber and Hochreiter in 1997. It is explicitly designed to
avoid long term dependency problems. Remembering the long
sequences for a long period of time is its way of working.

The popularity of LSTM is due to the Getting mechanism involved


with each LSTM cell. In a normal RNN cell, the input at the time
stamp and hidden state from the previous time step is passed
through the activation layer to obtain a new state. Whereas in LSTM
the process is slightly complex, as you can see in the above
architecture at each time it takes input from three different states
like the current input state, the short term memory from the
previous cell and lastly the long term memory.

These cells use the gates to regulate the information to be kept or


discarded at loop operation before passing on the long term and
short term information to the next cell. We can imagine these gates
as Filters that remove unwanted selected and irrelevant
information. There are a total of three gates that LSTM uses as Input
Gate, Forget Gate, and Output Gate.

Input Gate

The input gate decides what information will be stored in long term
memory. It only works with the information from the current input
and short term memory from the previous step. At this gate, it
filters out the information from variables that are not useful.

Forget Gate

The forget decides which information from long term memory be


kept or discarded and this is done by multiplying the incoming long
term memory by a forget vector generated by the current input and
incoming short memory.

Output Gate

The output gate will take the current input, the previous short term
memory and newly computed long term memory to produce new
short term memory which will be passed on to the cell in the next
time step. The output of the current time step can also be drawn
from this hidden state.

Working of GRU
The workflow of the Gated Recurrent Unit, in short GRU, is the same as the
RNN but the difference is in the operation and gates associated with each GRU
unit. To solve the problem faced by standard RNN, GRU incorporates the two
gate operating mechanisms called Update gate and Reset gate.
Update gate
The update gate is responsible for determining the amount of previous
information that needs to pass along the next state. This is really powerful
because the model can decide to copy all the information from the past and
eliminate the risk of vanishing gradient.

Reset gate
The reset gate is used from the model to decide how much of the past
information is needed to neglect; in short, it decides whether the previous cell
state is important or not.

First, the reset gate comes into action it stores relevant information from the
past time step into new memory content. Then it multiplies the input vector
and hidden state with their weights. Next, it calculates element-wise
multiplication between the reset gate and previously hidden state multiple.
After summing up the above steps the non-linear activation function is applied
and the next sequence is generated.

This is all about the operation of GRU, the practical examples are included in
the notebooks.
LSTM Vs GRU
Now we have seen the operation of both the layers to combat the problem of
vanishing gradient. So you might wonder which one is to use? As GRU is
relatively approaching its tradeoffs haven’t been discussed yet.

According to empirical evaluation, there is not a clear winner. The basic idea of
using a getting mechanism to learn long term dependencies is the same as in
LSTM.

The few differencing points are as follows;

● The GRU has two gates, LSTM has three gates

● GRU does not possess any internal memory, they don’t have an output

gate that is present in LSTM

● In LSTM the input gate and target gate are coupled by an update gate and

in GRU reset gate is applied directly to the previous hidden state. In

LSTM the responsibility of reset gate is taken by the two gates i.e., input

and target.
Mod 3

Q1.

Q2.
Q3.

https://towardsdatascience.com/learning-process-of-a-deep-neural-network-5a9768d7a651

Deep learning is a type of machine learning and artificial intelligence (AI) that imitates
the way humans gain certain types of knowledge. Deep learning is an important
element of data science, which includes statistics and predictive modeling. It is
extremely beneficial to data scientists who are tasked with collecting, analyzing and
interpreting large amounts of data; deep learning makes this process faster and easier.

At its simplest, deep learning can be thought of as a way to automate


predictive analytics. While traditional machine learning algorithms are linear,
deep learning algorithms are stacked in a hierarchy of increasing complexity
and abstraction.
https://analyticsindiamag.com/the-7-key-steps-to-build-your-machine-learning-model/
8. Collect Data
9. Prepare the data
10. Choose the model
11. Train your deep learning model
12. Evaluation of metrics
13. Parameter Tuning
14. Prediction or Inference
Q4.

Long short-term memory is a modified RNN architecture that addresses the problem of
training over long sequences and retaining memory.

LSTM is best suited for sequence data. LSTM can predict, classify, and generate
sequence data.

An example of a sequence is a video, which can be considered as a sequence of


images or a sequence of audio clips.

Prediction based on the sequence of data is called the sequence prediction. Sequence
prediction is said to have four types.

● Sequence numeric prediction

● Sequence classification


● Sequence generation


● Sequence-to-sequence prediction


Mod 4

Q1.

Word embeddings have a capability of capturing semantic and syntactic


relationships between words and also the context of words in a document.
Word2vec is the technique to implement word embeddings.

Every word in a sentence is dependent on another word or other words.If you


want to find similarities and relations between words ,we have to capture word
dependencies.
Q2.

When there are a small number of training examples, the model sometimes
learns from noises or unwanted details from training examples—to an extent that
it negatively impacts the performance of the model on new examples. This
phenomenon is known as overfitting.
It means that the model will have a difficult time generalizing on a new dataset.

Overfitting refers to a model that models the training data too well.

Overfitting happens when a model learns the detail and noise in the training data
to the extent that it negatively impacts the performance of the model on new
data. This means that the noise or random fluctuations in the training data is
picked up and learned as concepts by the model. The problem is that these
concepts do not apply to new data and negatively impact the models ability to
generalize.

Overfitting is more likely with nonparametric and nonlinear models that have
more flexibility when learning a target function. As such, many nonparametric
machine learning algorithms also include parameters or techniques to limit and
constrain how much detail the model learns.

For example, decision trees are a nonparametric machine learning algorithm that
is very flexible and is subject to overfitting training data. This problem can be
addressed by pruning a tree after it has learned in order to remove some of the
detail it has picked up.

There are multiple ways to fight overfitting in the training process. In this example, we
will use data augmentation and add Dropout to the model.
Q3 .
https://dataaspirant.com/word-embedding-techniques-nlp/#t-1597685144204

“Term frequency–inverse document frequency, is a numerical statistic that is intended


to reflect how important a word is to a document in a collection or corpus.”
Q4.

The Word2vec method learns all those types of relationships of words while building a
model. For this purpose word2vec uses 2 types of methods. There are

1. Skip-gram
2. CBOW (Continuous Bag of Words)

1. Skip -gram
In this method , take the center word from the window size words as an input and
context words (neighbor words) as outputs. Word2vec models predict the context words
of a center word using skip-gram method. Skip-gram works well with a small dataset
and identifies rare words really well.

2. Continuous bag of words

CBow is just a reverse method of the skip gram method. Here we are taking context
words as input and predicting the center word within the window. Another difference
from the skip gram method is, It was working faster and better representations for most
frequency words.
Resources

TF-IDF
TF
IDF
Implementation of TF-IDF by using Sklearn
Word2vec
Skip-Gram
Continuous Bag-of-words
Word2vec implementation
Word embedding model using Pre-trained models
Google word2vec
Stanford Glove Embeddings

You might also like