Nria20-Dl - Unit-4 Notes-Final
Nria20-Dl - Unit-4 Notes-Final
Nria20-Dl - Unit-4 Notes-Final
Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 Page |2
Recurrent Neuron
RNN Unfolding
Types Of RNN
There are four types of RNNs based on the number of inputs and outputs in the network.
1. One to One
2. One to Many
3. Many to One
4. Many to Many
One to One
This type of RNN behaves the same as any simple Neural network it is also known as Vanilla Neural
Network. In this Neural network, there is only one input and one output.
Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 Page |3
One To Many
In this type of RNN, there is one input and many outputs associated with it. One of the most used examples
of this network is Image captioning where given an image we predict a sentence having Multiple words.
Many to One
In this type of network, Many inputs are fed to the network at several states of the network generating only
one output. This type of network is used in the problems like sentimental analysis. Where we give multiple
words as input and predict only the sentiment of the sentence as output.
Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 Page |4
Many to Many
In this type of neural network, there are multiple inputs and multiple outputs corresponding to a problem.
One Example of this Problem will be language translation. In language translation, we provide multiple
words from one language as input and predict multiple words from the second language as output.
Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 Page |5
Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 Page |6
In a Bi-RNN, the input data is passed through two separate RNNs: one processes the data in the forward
direction, while the other processes it in the reverse direction. The outputs of these two RNNs are then
combined in some way to produce the final output.
One common way to combine the outputs of the forward and reverse RNNs is to concatenate them. Still, other
methods, such as element-wise addition or multiplication, can also be used. The choice of combination method
can depend on the specific task and the desired properties of the final output.
Consider an example where we could use the recurrent network to predict the masked word in a sentence.
In the first sentence, the answer could be fruit, company, or phone. But it can not be a fruit in the second and
third sentences.
A Recurrent Neural Network that can only process the inputs from left to right may not accurately predict the
right answer for sentences discussed above.
To perform well on natural language tasks, the model must be able to process the sequence in both directions.
Bi-directional RNNs
A bidirectional recurrent neural network (RNN) is a type of recurrent neural network (RNN) that processes input
sequences in both forward and backward directions.
This allows the RNN to capture information from the input sequence that may be relevant to the output prediction.
Still, the same could be lost in a traditional RNN that only processes the input sequence in one direction.
This allows the network to consider information from the past and future when making predictions rather than
just relying on the input data at the current time step.
Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 Page |7
This can be useful for tasks such as language processing, where understanding the context of a word or phrase
can be important for making accurate predictions.
In general, bidirectional RNNs can help improve a model's performance on various sequence-based tasks.
These two RNNs are typically called forward and backward RNNs, respectively.
During the forward pass of the RNN, the forward RNN processes the input sequence in the usual way by taking
the input at each time step and using it to update the hidden state. The updated hidden state is then used to
predict the output.
There are several ways in which the outputs of the forward and backward RNNs can be merged, depending on
the specific needs of the model and the task it is being used for. Some common merge modes include:
1. Concatenation: In this mode, the outputs of the forward and backward RNNs are concatenated together,
resulting in a single output tensor that is twice as long as the original input.
2. Sum: In this mode, the outputs of the forward and backward RNNs are added together element-wise, resulting
in a single output tensor that has the same shape as the original input.
3. Average: In this mode, the outputs of the forward and backward RNNs are averaged element-wise, resulting in
a single output tensor that has the same shape as the original input.
4. Maximum: In this mode, the maximum value of the forward and backward outputs is taken at each time step,
resulting in a single output tensor with the same shape as the original input.
Which merge mode to use will depend on the specific needs of the model and the task it is being used for.
Concatenation is generally a good default choice and works well in many cases, but other merge modes may
be more appropriate for certain tasks.
Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 Page |8
Block Input:
Block Output:
Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 Page |9
The encoder produced state C representing the sentence in the source language (English): I love learning.
Then, the decoder unfolded that state C into the target language (Spanish): Amo el aprendizaje.
C could be considered a vectorized representation of the whole sequence or, in other words, we could use an
encoder as a rough mean to obtain embeddings from a text of arbitrary length, but this is not the
proper way to do it, as we’ll see in another tutorial.
Also, in the decoder part of the diagram, there should be a softmax function that finds the word from the
vocabulary with the highest probability for that input and hidden state.
Let’s update our diagram with these additional details:
Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 10
2. One to Many
The main application of this architecture is text generation. The input is a topic, and the output is the
sequence of words generated for that topic:
Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 11
Machine Translation: One of the most prominent applications of Seq2Seq models is translating text
from one language to another, such as converting English sentences into French sentences.
Text Summarization: Seq2Seq models can generate concise summaries of longer documents, capturing
the essential information while omitting less relevant details.
Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 12
Speech Recognition: Converting spoken language into written text. Seq2Seq models can be trained to
map audio signals (sequences of sound) to their corresponding transcriptions (sequences of words).
Chatbots and Conversational AI: These models can generate human-like responses in a conversation,
taking the previous sequence of user inputs and generating appropriate replies.
Image Captioning: Seq2Seq models can describe the content of an image in natural language. The
encoder processes the image (often using Convolutional Neural Networks, CNNs) to produce a context
vector, which the decoder converts into a descriptive sentence.
Video Captioning: Similar to image captioning but with videos, Seq2Seq models generate descriptive
texts for video content, capturing the sequence of actions and scenes.
Time Series Prediction involves predicting the future values of a sequence based on past observations.
This application is expected in finance (stock prices), meteorology (weather forecasting), and more.
Code Generation: This process generates code snippets or entire programs from natural language
descriptions, which is helpful in programming assistants and automated software engineering tools.
Encoder-Decoder Architecture
The most common architecture used to build Seq2Seq models is Encoder-Decoder architecture. The model
consists of 3 parts: encoder, intermediate (encoder) vector and decoder.
Encoder
Multiple RNN cells can be stacked together to form the encoder. RNN reads each inputs sequentially
For every timestep (each input) t, the hidden state (hidden vector) h is updated according to the input at that
timestep X[i].
After all the inputs are read by encoder model, the final hidden state of the model represents the context/summary
of the whole input sequence.
Example: Consider the input sequence “I am a Student” to be encoded. There will be totally 4 timesteps ( 4
tokens) for the Encoder model. At each time step, the hidden state h will be updated using the previous hidden
state and the current input.
Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 13
Example: Encoder
At the first timestep t1, the previous hidden state h0 will be considered as zero or randomly chosen. So the first
RNN cell will update the current hidden state with the first input and h0. Each layer outputs two things — updated
hidden state and the output for each stage. The outputs at each stage are rejected and only the hidden states will be
propagated to the next layer.
The hidden states h_i are computed using the formula:
At second timestep t2, the hidden state h1 and the second input X[2] will be given as input , and the hidden state
h2 will be updated according to both inputs. Then the hidden state h1 will be updated with the new input and will
produce the hidden state h2. This happens for all the four stages wrt example taken.
A stack of several recurrent units (LSTM or GRU cells for better performance) where each accepts a single
element of the input sequence, collects information for that element, and propagates it forward.
In the question-answering problem, the input sequence is a collection of all words from the question. Each word is
represented as x_i where i is the order of that word.
This simple formula represents the result of an ordinary recurrent neural network. As you can see, we just apply
the appropriate weights to the previously hidden state h_(t-1) and the input vector x_t.
Encoder Vector
This is the final hidden state produced from the encoder part of the model. It is calculated using the formula
above.
This vector aims to encapsulate the information for all input elements in order to help the decoder make accurate
predictions.
It acts as the initial hidden state of the decoder part of the model.
Decoder
The Decoder generates the output sequence by predicting the next output Yt given the hidden state ht.
The input for the decoder is the final hidden vector obtained at the end of encoder model.
Each layer will have three inputs, hidden vector from previous layer ht-1 and the previous layer output yt-1,
original hidden vector h.
Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 14
At the first layer, the output vector of encoder and the random symbol START, empty hidden state ht-1 will be
given as input, the outputs obtained will be y1 and updated hidden state h1 (the information of the output will be
subtracted from the hidden vector).
The second layer will have the updated hidden state h1 and the previous output y1 and original hidden vector h as
current inputs, produces the hidden vector h2 and output y2.
The outputs occurred at each timestep of decoder is the actual output. The model will predict the output until the
END symbol occurs.
A stack of several recurrent units where each predicts an output y_t at a time step t.
Each recurrent unit accepts a hidden state from the previous unit and produces an output as well as its own hidden
state.
In the question-answering problem, the output sequence is a collection of all words from the answer. Each word is
represented as y_i where i is the order of that word.
Example: Decoder.
Any hidden state h_i is computed using the formula:
As you can see, we are just using the previous hidden state to compute the next one.
Output Layer
We use Softmax activation function at the output layer.
It is used to produce the probability distribution from a vector of values with the target class of high probability.
The output y_t at time step t is computed using the formula:
We calculate the outputs using the hidden state at the current time step together with the respective weight
W(S). Softmax is used to create a probability vector that will help us determine the final output (e.g. word in the
question-answering problem).
Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 15
BPTT works by unfolding the RNN over time, creating a series of interconnected feedforward networks. Each
time step corresponds to one layer in this unfolded network, and the weights between layers are shared across
time steps. The unfolded network can be thought of as a very deep feedforward network, where the weights
are shared across layers.
During training, the error is backpropagated through the unfolded network, and the weights are updated using
gradient descent. This allows the network to learn to predict the output at each time step based on the input at
that time step as well as the previous time steps.
However, BPTT has some challenges, such as the vanishing gradient problem, where the gradients become
very small as they propagate back in time, making it difficult to learn long-term dependencies. To address this
issue, various modifications of BPTT have been proposed, such as truncated backpropagation through time
and gradient clipping.
Uses of BPTT:
BPTT is a widely used technique for training recurrent neural networks (RNNs) that can be used for various
applications such as speech recognition, language modeling, and time series prediction. Here are some
specific use cases for BPTT:
Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 16
Speech recognition: BPTT can be used to train RNNs for speech recognition tasks, where the network takes
in a sequence of audio samples and predicts the corresponding text. BPTT allows the network to learn the
temporal dependencies in the audio signal and use them to make accurate predictions.
Language modeling: BPTT can also be used to train RNNs for language modeling tasks, where the network
predicts the probability distribution of the next word in a sequence given the previous words. This can be
useful for applications such as text generation and machine translation.
Time series prediction: BPTT can be used to train RNNs for time series prediction tasks, where the network
takes in a sequence of data points and predicts the next value in the sequence. BPTT allows the network to
learn the temporal dependencies in the data and use them to make accurate predictions.
Overall, BPTT is a powerful tool for training RNNs to model sequential data, and it has been applied
successfully to a wide range of applications in various fields such as speech recognition, natural language
processing, and finance.
Example of BPTT:
Let’s consider a simple example of using BPTT to train a recurrent neural network (RNN) for time series
prediction. Suppose we have a time series dataset that consists of a sequence of data points: {x1, x2, x3, …,
xn}.
The goal is to train an RNN to predict the next value in the sequence, xn+1, given the previous values in the
sequence.
To do this, we can use BPTT to backpropagate errors through time and update the weights of the RNN.
Here’s how the BPTT algorithm might work:
1. Initialize the weights of the RNN randomly.
2. Feed the first input x1 into the RNN and compute the output y1.
3. Compute the loss between the predicted output y1 and the actual output x2.
4. Backpropagate the error through the network using the chain rule, updating the weights at each time
step.
5. Feed the second input x2 into the RNN and compute the output y2.
6. Compute the loss between the predicted output y2 and the actual output x3.
7. Backpropagate the error through the network again, updating the weights at each time step.
8. Repeat steps 5–7 for the entire sequence of inputs {x1, x2, x3, …, xn}.
9. Test the RNN on a separate validation set and adjust the hyperparameters as necessary.
During training, the weights of the RNN are updated based on the gradients computed by backpropagating the
errors through time. This allows the RNN to learn the temporal dependencies in the data and make accurate
predictions for the next value in the sequence.
Overall, BPTT is a powerful technique for training RNNs to model sequential data, and it has been successfully
applied to a wide range of applications in various fields.
Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 17
Limitations of BPTT:
While backpropagation through time (BPTT) is a powerful technique for training recurrent neural networks
(RNNs), it has some limitations:
1. Computational complexity: BPTT requires computing the gradient at each time step, which can be
computationally expensive for long sequences. This can lead to slow training times and may require
specialized hardware to train large-scale models.
2. Vanishing gradients: BPTT is prone to the problem of vanishing gradients, where the gradients
become very small as they propagate back in time. This can make it difficult to learn long-term
dependencies, which are important for many sequential data modeling tasks.
3. Exploding gradients: On the other hand, BPTT is also prone to the problem of exploding gradients,
where the gradients become very large as they propagate back in time. This can lead to unstable training
and can cause the weights of the network to become unbounded, resulting in NaN values.
4. Memory limitations: BPTT requires storing the activations of each time step, which can be memory-
intensive for long sequences. This can limit the size of the sequence that can be processed by the
network.
5. Difficulty in parallelization: BPTT is inherently sequential, which makes it difficult to parallelize
across multiple GPUs or machines. This can limit the scalability of the training process.
Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 18
Just like a simple RNN, an LSTM also has a hidden state where H(t-1) represents the hidden state of the
previous timestamp and Ht is the hidden state of the current timestamp. In addition to that, LSTM also
has a cell state represented by C(t-1) and C(t) for the previous and current timestamps, respectively.
Here the hidden state is known as Short term memory, and the cell state is known as Long term memory.
Refer to the following image.
It is interesting to note that the cell state carries the information along with all the timestamps.
Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 19
Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 20
Input Gate:
Let’s take another example.
“Bob knows swimming. He told me over the phone that he had served the navy for four long years.”
So, in both these sentences, we are talking about Bob. However, both give different kinds of information
about Bob. In the first sentence, we get the information that he knows swimming. Whereas the second
sentence tells, he uses the phone and served in the navy for four years.
Now just think about it, based on the context given in the first sentence, which information in the second
sentence is critical? First, he used the phone to tell, or he served in the navy. In this context, it doesn’t
matter whether he used the phone or any other medium of communication to pass on the information.
The fact that he was in the navy is important information, and this is something we want our model to
remember for future computation. This is the task of the Input gate.
The input gate is used to quantify the importance of the new information carried by the input. Here is the
equation of the input gate
Here,
Xt: Input at the current timestamp t
Ui: weight matrix of input
Ht-1: A hidden state at the previous timestamp
Wi: Weight matrix of input associated with hidden state
Again we have applied the sigmoid function over it. As a result, the value of I at timestamp t will be
between 0 and 1.
New Information
Now the new information that needed to be passed to the cell state is a function of a hidden state at the
previous timestamp t-1 and input x at timestamp t. The activation function here is tanh. Due to the tanh
function, the value of new information will be between -1 and 1. If the value of Nt is negative, the
information is subtracted from the cell state, and if the value is positive, the information is added to the
cell state at the current timestamp.
Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 21
However, the Nt won’t be added directly to the cell state. Here comes the updated equation:
Here, Ct-1 is the cell state at the current timestamp, and the others are the values we have calculated
previously.
Output Gate:
Now consider this sentence.
“Bob single-handedly fought the enemy and died for his country. For his contributions, brave______.”
During this task, we have to complete the second sentence. Now, the minute we see the word brave, we
know that we are talking about a person. In the sentence, only Bob is brave, we can not say the enemy is
brave, or the country is brave. So based on the current expectation, we have to give a relevant word to
fill in the blank. That word is our output, and this is the function of our Output gate.
Here is the equation of the Output gate, which is pretty similar to the two previous gates.
Its value will also lie between 0 and 1 because of this sigmoid function. Now to calculate the current
hidden state, we will use Ot and tanh of the updated cell state. As shown below.
It turns out that the hidden state is a function of Long term memory (Ct) and the current output. If you
need to take the output of the current timestamp, just apply the SoftMax activation on hidden state Ht.
Here the token with the maximum score in the output is the prediction.
This is the More intuitive diagram of the LSTM network.
Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)