NNDL
NNDL
NNDL
https://towardsdatascience.com/learning-process-of-a-deep-neural-network-5a9768d7a
651
https://towardsdatascience.com/basic-concepts-of-neural-networks-1a18a7aa2bd2
●
● Each neuron has an activation function that defines the output of the neuron.
The activation function is used to introduce non-linearity in the modeling
capabilities of the network. We have several options for activation functions that
we will present in this post.
● First phase is propagation phase = occurs when the network is exposed to the
training data and these cross the entire neural network for their predictions
(labels) to be calculated.
● Next we will use a loss function to estimate the loss (or error) and to compare
and measure how good/bad our prediction result was in relation to the correct
result.
● After this using backpropagation each neuron receives the error rate however, the
neurons of the hidden layer only receive a fraction of the total signal of the loss,
based on the relative contribution that each neuron has contributed to the
original output. This process is repeated, layer by layer, until all the neurons in
the network have received a loss signal that describes their relative contribution
to the total loss.
● Now that we have spread this information back, we can adjust the weights of
connections between neurons. What we are doing is making the loss as close as
possible to zero the next time we go back to using the network for a prediction.
For this, we will use a technique called gradient descent.
● For Model parameterization we use Epochs, Batch size , Learning rate
Feedforward NN
● Neurons in one layer are connected to neurons to the next layer and
the previous layer
●
● The feedforward neural network was the first and simplest type of
artificial neural network devised.
● In this network, the information moves in only one
direction—forward—from the input nodes, through the hidden nodes
(if any) and to the output nodes. There are no cycles or loops in the
network.
● More layers more deep and more complex
Gradient Descent
Now that we have spread this information back, we can adjust the weights of connections
between neurons. What we are doing is making the loss as close as possible to zero the next time
we go back to using the network for a prediction. For this, we will use a technique called
gradient descent. This technique changes the weights in small increments with the help of the
calculation of the derivative (or gradient) of the loss function, which allows us to see in which
direction “to descend” towards the global minimum; this is done in general in batches of data in
the successive iterations (epochs) of all the dataset that we pass to the network in each iteration.
Activation function
● This output is received by the neurons of the next layer to which this
Step
consider a threshold value and if the value of net input says y is greater than the
Mathematically,
Sigmoid
● This essentially means that when I have multiple neurons having sigmoid
output classes.
ReLU
threshold.
● The default and more usual behavior is that, as long as the input has a value
below zero, the output will be zero but, when the input rises above, the
output is a linear relationship with the input variable of the form f(x)=x.
● The main advantage of using the ReLU function over other activation
functions is that it does not activate all the neurons at the same time.
● What does this mean ? If you look at the ReLU function, if the input is
negative it will convert it to zero and the neuron does not get activated.
McCulloch-Pitts Neuron
https://towardsdatascience.com/mcculloch-pitts-model-5fdf65ac5dd1
Perceptron what is ?
y = sign(wT x ✓) (1)
The Perceptron algorithm learns the weights for the input signals in order
to draw a linear decision boundary.
This enables you to distinguish between the two linearly separable classes
+1 and -1.
Perceptron learning rule convergence theorem
The Perceptron receives multiple input signals, and if the sum of the input
signals exceeds a certain threshold, it either outputs a signal or does not
return an output. In the context of supervised learning and classification,
this can then be used to predict the class of a sample.
Next up, let us focus on the perceptron function.
https://en.wikipedia.org/wiki/Feedforward_neural_network
Gradient Descent
Mod 2
CNN
- Sparse Interaction
- In normal NN what happens is all the layers are connected to
each other by each point but we don’t need that actually
- So in CNN only the ones which actually need to be connected
are connected
- Parameter Sharing
-
- Equivariant Representation
Layers in CNN
Convolution Layer
-
-
-
-
Activation Layer
Pooling layer
- Downsizing of parameters
-
- Stride - how many features is skipped
- Depth remains the same after pooling
-
Fully connected layer
-
●
https://builtin.com/data-science/recurrent-neural-networks-and-lstm
BPTT is basically just a fancy buzz word for doing backpropagation on an unrolled
RNN. Unrolling is a visualization and conceptual tool, which helps you understand
what’s going on within the network. Most of the time when implementing a
recurrent neural network in the common programming frameworks,
backpropagation is automatically taken care of, but you need to understand how it
works to troubleshoot problems that may arise during the development process.
You can view a RNN as a sequence of neural networks that you train one after
another with backpropagation.
WHAT IS BACKPROPAGATION?
Bidirectional RNN
In sequence learning, so far we assumed that our goal is to model the next output given
what we have seen so far, e.g., in the context of a time series or in the context of a
language model. While this is a typical scenario, it is not the only one we might
encounter. To illustrate the issue, consider the following three tasks of filling in the blank
in a text sequence:
● I am ___.
● I am ___ hungry.
● I am ___ hungry, and I can eat half a pig.
Depending on the amount of information available, we might fill in the blanks with very
different words such as “happy”, “not”, and “very”. Clearly the end of the phrase (if
available) conveys significant information about which word to pick. A sequence model
that is incapable of taking advantage of this will perform poorly on related tasks. For
instance, to do well in named entity recognition (e.g., to recognize whether “Green”
refers to “Mr. Green” or to the color) in a longer-range context is equally vital. To get
some inspiration for addressing the problem let us take a detour to probabilistic
graphical models.
https://d2l.ai/chapter_recurrent-modern/bi-rnn.html
LSTM
Long short-term memory networks (LSTMs) are an extension for recurrent neural
networks, which basically extends the memory. Therefore it is well suited to learn
from important experiences that have very long time lags in between.
LSTMs enable RNNs to remember inputs over a long period of time. This is
because LSTMs contain information in a memory, much like the memory of a
computer. The LSTM can read, write and delete information from its memory.
This memory can be seen as a gated cell, with gated meaning the cell decides
whether or not to store or delete information (i.e., if it opens the gates or not),
based on the importance it assigns to the information. The assigning of importance
happens through weights, which are also learned by the algorithm. This simply
means that it learns over time what information is important and what is not.
In an LSTM you have three gates: input, forget and output gate. These gates
determine whether or not to let new input in (input gate), delete the information
because it isn’t important (forget gate), or let it impact the output at the current
timestep (output gate). Below is an illustration of a RNN with its three gates:
A recurrent neural network is a type of ANN that is used when users want to
perform predictive operations on sequential or time-series based data.
These Deep learning layers are commonly used for ordinal or temporal problems
such as Natural Language Processing, Neural Machine Translation, automated
image captioning tasks and likewise.
Today’s modern voice assistance devices such as Google Assistance, Alexa, Siri
are incorporated with these layers to fulfil hassle-free experiences for users.
https://analyticsindiamag.com/lstm-vs-gru-in-recurrent-neural-network-a-comparat
ive-study/
GRU
Bidirectional LSTM
Mod 3
Tensorflow : Introduction , tensor , tensor properties , basic tensor methods
● The inputs, outputs, and transformations within neural networks are all
represented using tensors, and as a result, neural network programming utilizes
tensors heavily.
●
●
Neural networks working
● First phase is propagation phase = occurs when the network is exposed to the
training data and these cross the entire neural network for their predictions
(labels) to be calculated.
● Next we will use a loss function to estimate the loss (or error) and to compare
and measure how good/bad our prediction result was in relation to the correct
result.
● After this using backpropagation each neuron receives the error rate however, the
neurons of the hidden layer only receive a fraction of the total signal of the loss,
based on the relative contribution that each neuron has contributed to the
original output. This process is repeated, layer by layer, until all the neurons in
the network have received a loss signal that describes their relative contribution
to the total loss.
● Now that we have spread this information back, we can adjust the weights of
connections between neurons. What we are doing is making the loss as close as
possible to zero the next time we go back to using the network for a prediction.
For this, we will use a technique called gradient descent.
● For Model parameterization we use Epochs, Batch size , Learning rate
CNN in Tensorflow
Convolution Layer
-
-
-
-
Activation Layer
Pooling layer
- Downsizing of parameters
-
- Stride - how many features is skipped
- Depth remains the same after pooling
-
Fully connected layer
-
https://www.tensorflow.org/guide/keras/rnn
https://towardsdatascience.com/lstm-by-example-using-tensorflow-feb0c1968537
Mod 4
Out-of-vocabulary (OOV) are terms that are not part of the normal lexicon found in
a natural language processing environment. In speech recognition, it's the audio
signal that contains these terms. Word vectors are the mathematical equivalent of word
meaning.
Questions paper
Mod 1
Q1. (a)
Neural networks working
● First phase is propagation phase = occurs when the network is exposed to the
training data and these cross the entire neural network for their predictions
(labels) to be calculated.
● Next we will use a loss function to estimate the loss (or error) and to compare
and measure how good/bad our prediction result was in relation to the correct
result.
● After this using backpropagation each neuron receives the error rate however, the
neurons of the hidden layer only receive a fraction of the total signal of the loss,
based on the relative contribution that each neuron has contributed to the
original output. This process is repeated, layer by layer, until all the neurons in
the network have received a loss signal that describes their relative contribution
to the total loss.
● Now that we have spread this information back, we can adjust the weights of
connections between neurons. What we are doing is making the loss as close as
possible to zero the next time we go back to using the network for a prediction.
For this, we will use a technique called gradient descent.
● For Model parameterization we use Epochs, Batch size , Learning rate
Q1 (b)
Activation functions
neuron forward. This output is received by the neurons of the next layer to
modeling capabilities of the network. Below we will list the most used
nowadays; all of them can be used in a layer of Keras (we can find more
Step Function:
consider a threshold value and if the value of net input say y is greater than the
Mathematically,
Sigmoid
The sigmoid function has already been introduced in a previous post. Its
output classes.
ReLU
threshold. The default and more usual behavior is that, as long as the input
has a value below zero, the output will be zero but, when the input rises
above, the output is a linear relationship with the input variable of the form
f(x)=x. The ReLU activation function has proven to work in many different
situations and is currently widely used.
Mod 2
Q1(a)
CNN
- Sparse Interaction
- In normal NN what happens is all the layers are connected to
each other by each point but we don’t need that actually
- So in CNN only the ones which actually need to be connected
are connected
- Parameter Sharing
-
- Equivariant Representation
Layers in CNN
Convolution Layer
-
-
-
-
Activation Layer
Pooling layer
- Downsizing of parameters
-
- Stride - how many features is skipped
- Depth remains the same after pooling
-
Fully connected layer
-
https://builtin.com/data-science/recurrent-neural-networks-and-lst
m
Working of LSTM
Input Gate
The input gate decides what information will be stored in long term
memory. It only works with the information from the current input
and short term memory from the previous step. At this gate, it
filters out the information from variables that are not useful.
Forget Gate
Output Gate
The output gate will take the current input, the previous short term
memory and newly computed long term memory to produce new
short term memory which will be passed on to the cell in the next
time step. The output of the current time step can also be drawn
from this hidden state.
Working of GRU
The workflow of the Gated Recurrent Unit, in short GRU, is the same as the
RNN but the difference is in the operation and gates associated with each GRU
unit. To solve the problem faced by standard RNN, GRU incorporates the two
gate operating mechanisms called Update gate and Reset gate.
Update gate
The update gate is responsible for determining the amount of previous
information that needs to pass along the next state. This is really powerful
because the model can decide to copy all the information from the past and
eliminate the risk of vanishing gradient.
Reset gate
The reset gate is used from the model to decide how much of the past
information is needed to neglect; in short, it decides whether the previous cell
state is important or not.
First, the reset gate comes into action it stores relevant information from the
past time step into new memory content. Then it multiplies the input vector
and hidden state with their weights. Next, it calculates element-wise
multiplication between the reset gate and previously hidden state multiple.
After summing up the above steps the non-linear activation function is applied
and the next sequence is generated.
This is all about the operation of GRU, the practical examples are included in
the notebooks.
LSTM Vs GRU
Now we have seen the operation of both the layers to combat the problem of
vanishing gradient. So you might wonder which one is to use? As GRU is
relatively approaching its tradeoffs haven’t been discussed yet.
According to empirical evaluation, there is not a clear winner. The basic idea of
using a getting mechanism to learn long term dependencies is the same as in
LSTM.
● GRU does not possess any internal memory, they don’t have an output
● In LSTM the input gate and target gate are coupled by an update gate and
LSTM the responsibility of reset gate is taken by the two gates i.e., input
and target.
Mod 3
Q1.
Q2.
Q3.
https://towardsdatascience.com/learning-process-of-a-deep-neural-network-5a9768d7a651
Deep learning is a type of machine learning and artificial intelligence (AI) that imitates
the way humans gain certain types of knowledge. Deep learning is an important
element of data science, which includes statistics and predictive modeling. It is
extremely beneficial to data scientists who are tasked with collecting, analyzing and
interpreting large amounts of data; deep learning makes this process faster and easier.
Long short-term memory is a modified RNN architecture that addresses the problem of
training over long sequences and retaining memory.
LSTM is best suited for sequence data. LSTM can predict, classify, and generate
sequence data.
Prediction based on the sequence of data is called the sequence prediction. Sequence
prediction is said to have four types.
● Sequence classification
●
● Sequence generation
●
● Sequence-to-sequence prediction
●
Mod 4
Q1.
When there are a small number of training examples, the model sometimes
learns from noises or unwanted details from training examples—to an extent that
it negatively impacts the performance of the model on new examples. This
phenomenon is known as overfitting.
It means that the model will have a difficult time generalizing on a new dataset.
Overfitting refers to a model that models the training data too well.
Overfitting happens when a model learns the detail and noise in the training data
to the extent that it negatively impacts the performance of the model on new
data. This means that the noise or random fluctuations in the training data is
picked up and learned as concepts by the model. The problem is that these
concepts do not apply to new data and negatively impact the models ability to
generalize.
Overfitting is more likely with nonparametric and nonlinear models that have
more flexibility when learning a target function. As such, many nonparametric
machine learning algorithms also include parameters or techniques to limit and
constrain how much detail the model learns.
For example, decision trees are a nonparametric machine learning algorithm that
is very flexible and is subject to overfitting training data. This problem can be
addressed by pruning a tree after it has learned in order to remove some of the
detail it has picked up.
There are multiple ways to fight overfitting in the training process. In this example, we
will use data augmentation and add Dropout to the model.
Q3 .
https://dataaspirant.com/word-embedding-techniques-nlp/#t-1597685144204
The Word2vec method learns all those types of relationships of words while building a
model. For this purpose word2vec uses 2 types of methods. There are
1. Skip-gram
2. CBOW (Continuous Bag of Words)
1. Skip -gram
In this method , take the center word from the window size words as an input and
context words (neighbor words) as outputs. Word2vec models predict the context words
of a center word using skip-gram method. Skip-gram works well with a small dataset
and identifies rare words really well.
CBow is just a reverse method of the skip gram method. Here we are taking context
words as input and predicting the center word within the window. Another difference
from the skip gram method is, It was working faster and better representations for most
frequency words.
Resources
TF-IDF
TF
IDF
Implementation of TF-IDF by using Sklearn
Word2vec
Skip-Gram
Continuous Bag-of-words
Word2vec implementation
Word embedding model using Pre-trained models
Google word2vec
Stanford Glove Embeddings