Nria20-Dl - Unit-4 Notes-Final

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

B.

Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 Page |1

UNIT 4: SEQUENCE MODELLING –RECURRENT AND RECURSIVE NETS

TOPIC-1: RECURRENT NEURAL NETWORKS

What is Recurrent Neural Network (RNN)?


Recurrent Neural Network(RNN) is a type of Neural Network where the output from the previous step is fed
as input to the current step. In traditional neural networks, all the inputs and outputs are independent of each
other. Still, in cases when it is required to predict the next word of a sentence, the previous words are required
and hence there is a need to remember the previous words. Thus RNN came into existence, which solved this
issue with the help of a Hidden Layer. The main and most important feature of RNN is its Hidden state, which
remembers some information about a sequence. The state is also referred to as Memory State since it
remembers the previous input to the network. It uses the same parameters for each input as it performs the
same task on all the inputs or hidden layers to produce the output. This reduces the complexity of parameters,
unlike other neural networks.

Recurrent Neural Network


How RNN differs from Feedforward Neural Network?
Artificial neural networks that do not have looping nodes are called feed forward neural networks. Because all
information is only passed forward, this kind of neural network is also referred to as a multi-layer neural
network.
Information moves from the input layer to the output layer – if any hidden layers are present – unidirectionally
in a feedforward neural network. These networks are appropriate for image classification tasks, for example,
where input and output are independent. Nevertheless, their inability to retain previous inputs automatically
renders them less useful for sequential data analysis.

Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 Page |2

Recurrent Neuron and RNN Unfolding


The fundamental processing unit in a Recurrent Neural Network (RNN) is a Recurrent Unit, which is not
explicitly called a “Recurrent Neuron.” This unit has the unique ability to maintain a hidden state, allowing
the network to capture sequential dependencies by remembering previous inputs while processing. Long Short-
Term Memory (LSTM) and Gated Recurrent Unit (GRU) versions improve the RNN’s ability to handle long-
term dependencies.

Recurrent Neuron

RNN Unfolding
Types Of RNN
There are four types of RNNs based on the number of inputs and outputs in the network.
1. One to One
2. One to Many
3. Many to One
4. Many to Many
One to One
This type of RNN behaves the same as any simple Neural network it is also known as Vanilla Neural
Network. In this Neural network, there is only one input and one output.

Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 Page |3

One To Many
In this type of RNN, there is one input and many outputs associated with it. One of the most used examples
of this network is Image captioning where given an image we predict a sentence having Multiple words.

Many to One
In this type of network, Many inputs are fed to the network at several states of the network generating only
one output. This type of network is used in the problems like sentimental analysis. Where we give multiple
words as input and predict only the sentiment of the sentence as output.

Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 Page |4

Many to Many

In this type of neural network, there are multiple inputs and multiple outputs corresponding to a problem.
One Example of this Problem will be language translation. In language translation, we provide multiple
words from one language as input and predict multiple words from the second language as output.

Recurrent Neural Network Architecture


RNNs have the same input and output architecture as any other deep neural architecture. However,
differences arise in the way information flows from input to output. Unlike Deep neural networks where
we have different weight matrices for each Dense network in RNN, the weight across the network remains
the same. It calculates state hidden state Hi for every input Xi . By using the following formulas:
h= σ(UX + Wh-1 + B)
Y = O(Vh + C)
Hence
Y = f (X, h , W, U, V, B, C)
Here S is the State matrix which has element si as the state of the network at timestep i
The parameters in the network are W, U, V, c, b which are shared across timestep

Recurrent Neural Architecture

Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 Page |5

How does RNN work?


The Recurrent Neural Network consists of multiple fixed activation function units, one for each time step.
Each unit has an internal state which is called the hidden state of the unit. This hidden state signifies the
past knowledge that the network currently holds at a given time step. This hidden state is updated at every
time step to signify the change in the knowledge of the network about the past. The hidden state is updated
using the following recurrence relation:-

The formula for calculating the current state:

Formula for applying Activation function(tanh)

The formula for calculating output:

These parameters are updated using Backpropagation.

Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 Page |6

TOPIC-2: BI-DIRECTIONAL RNNs


A bi-directional recurrent neural network (Bi-RNN) is a type of recurrent neural network (RNN) that processes
input data in both forward and backward directions. The goal of a Bi-RNN is to capture the contextual
dependencies in the input data by processing it in both directions, which can be useful in various natural
language processing (NLP) tasks.

In a Bi-RNN, the input data is passed through two separate RNNs: one processes the data in the forward
direction, while the other processes it in the reverse direction. The outputs of these two RNNs are then
combined in some way to produce the final output.

One common way to combine the outputs of the forward and reverse RNNs is to concatenate them. Still, other
methods, such as element-wise addition or multiplication, can also be used. The choice of combination method
can depend on the specific task and the desired properties of the final output.

Need for Bi-directional RNNs


 A uni-directional recurrent neural network (RNN) processes input sequences in a single direction, either from left
to right or right to left.
 This means the network can only use information from earlier time steps when making predictions at later time
steps.
 This can be limiting, as the network may not capture important contextual information relevant to the output
prediction.
 For example, in natural language processing tasks, a uni-directional RNN may not accurately predict the next
word in a sentence if the previous words provide important context for the current word.

Consider an example where we could use the recurrent network to predict the masked word in a sentence.

1. Apple is my favorite _____.


2. Apple is my favourite _____, and I work there.
3. Apple is my favorite _____, and I am going to buy one.

In the first sentence, the answer could be fruit, company, or phone. But it can not be a fruit in the second and
third sentences.

A Recurrent Neural Network that can only process the inputs from left to right may not accurately predict the
right answer for sentences discussed above.

To perform well on natural language tasks, the model must be able to process the sequence in both directions.

Bi-directional RNNs

 A bidirectional recurrent neural network (RNN) is a type of recurrent neural network (RNN) that processes input
sequences in both forward and backward directions.
 This allows the RNN to capture information from the input sequence that may be relevant to the output prediction.
Still, the same could be lost in a traditional RNN that only processes the input sequence in one direction.
 This allows the network to consider information from the past and future when making predictions rather than
just relying on the input data at the current time step.

Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 Page |7

 This can be useful for tasks such as language processing, where understanding the context of a word or phrase
can be important for making accurate predictions.
 In general, bidirectional RNNs can help improve a model's performance on various sequence-based tasks.

This means that the network has two separate RNNs:

1. One that processes the input sequence from left to right


2. Another one that processes the input sequence from right to left.

These two RNNs are typically called forward and backward RNNs, respectively.

During the forward pass of the RNN, the forward RNN processes the input sequence in the usual way by taking
the input at each time step and using it to update the hidden state. The updated hidden state is then used to
predict the output.

Merge Modes in Bidirectional RNN


In a bidirectional recurrent neural network (RNN), two separate RNNs process the input data in opposite
directions (forward and backward). The output of these two RNNs is then combined, or "merged," in some
way to produce the final output of the model.

There are several ways in which the outputs of the forward and backward RNNs can be merged, depending on
the specific needs of the model and the task it is being used for. Some common merge modes include:

1. Concatenation: In this mode, the outputs of the forward and backward RNNs are concatenated together,
resulting in a single output tensor that is twice as long as the original input.
2. Sum: In this mode, the outputs of the forward and backward RNNs are added together element-wise, resulting
in a single output tensor that has the same shape as the original input.
3. Average: In this mode, the outputs of the forward and backward RNNs are averaged element-wise, resulting in
a single output tensor that has the same shape as the original input.
4. Maximum: In this mode, the maximum value of the forward and backward outputs is taken at each time step,
resulting in a single output tensor with the same shape as the original input.

Which merge mode to use will depend on the specific needs of the model and the task it is being used for.
Concatenation is generally a good default choice and works well in many cases, but other merge modes may
be more appropriate for certain tasks.

Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 Page |8

TOPIC-3: Encoder-Decoder Models in RNN


Encoders with RNNs:
Understanding text is an iterative process for humans: When we read a sentence, we process each word
accumulating information up to the end of the text.
A system accumulating information composed of similar units repeated over time is a Recurrent
Neural Network (RNN) in the Deep Learning field.
In general, a text encoder turns text into a numeric representation. This task can be implemented in many
different ways but, in this tutorial, what we mean by encoders are RNN encoders. Let’s see a diagram:

Depending on the textbook, we can find it in rolled representation:

So each block is composed of the following elements at time t:

Block Input:

 Input Vector (encoding the word)


 Hidden state vector (containing the sequence state before the current block)

Block Output:

Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 Page |9

Decoders with RNNs:


Unlike encoders, decoders unfold a vector representing the sequence state and return something
meaningful for us like text, tags, or labels.
An essential distinction with encoders is that decoders require both, the hidden state and the output from the
previous state.
When the decoder starts processing, there’s no previous output, so we use a special token <start> for those
cases.
Let’s make it clearer with the example below, which shows how machine translation works:

The encoder produced state C representing the sentence in the source language (English): I love learning.
Then, the decoder unfolded that state C into the target language (Spanish): Amo el aprendizaje.
C could be considered a vectorized representation of the whole sequence or, in other words, we could use an
encoder as a rough mean to obtain embeddings from a text of arbitrary length, but this is not the
proper way to do it, as we’ll see in another tutorial.

Also, in the decoder part of the diagram, there should be a softmax function that finds the word from the
vocabulary with the highest probability for that input and hidden state.
Let’s update our diagram with these additional details:

Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 10

Architectures and Their Applications


1. Many to One
This is widely used for classification, typically sentiment analysis or tagging. The input is a sequence of
words, and the output is a category. This output is produced by the last block of the sequence:

2. One to Many
The main application of this architecture is text generation. The input is a topic, and the output is the
sequence of words generated for that topic:

Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 11

3. Many to Many (1st version)


This is a very popular architecture in Machine Translation (also known as seq2seq). The input is a sequence
of words, and so is the output.
The network “waits” for the encoding step to finish producing the internal state. It just starts decoding when
the encoder finished:

4. Many to Many (2nd version)


Common applications for this architecture are video captioning and part of speech tagging. At the same time
the frames go, the captions/tags are produced, so there’s no waiting for a final state before decoding it:

TOPIC-4: Seq2Seq Models in RNN


Seq2Seq (Sequence-to-Sequence) models are a type of neural network, an exceptional Recurrent Neural
Network architecture, designed to transform one data sequence into another. They are handy for tasks where
the input and output are sequences of varying lengths, which traditional neural networks struggle to handle,
such as solving complex language problems like machine translation, question answering, creating chatbots,
text summarization, etc.

Use Cases of the Sequence to Sequence Models

 Machine Translation: One of the most prominent applications of Seq2Seq models is translating text
from one language to another, such as converting English sentences into French sentences.
 Text Summarization: Seq2Seq models can generate concise summaries of longer documents, capturing
the essential information while omitting less relevant details.

Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 12

 Speech Recognition: Converting spoken language into written text. Seq2Seq models can be trained to
map audio signals (sequences of sound) to their corresponding transcriptions (sequences of words).
 Chatbots and Conversational AI: These models can generate human-like responses in a conversation,
taking the previous sequence of user inputs and generating appropriate replies.
 Image Captioning: Seq2Seq models can describe the content of an image in natural language. The
encoder processes the image (often using Convolutional Neural Networks, CNNs) to produce a context
vector, which the decoder converts into a descriptive sentence.
 Video Captioning: Similar to image captioning but with videos, Seq2Seq models generate descriptive
texts for video content, capturing the sequence of actions and scenes.
 Time Series Prediction involves predicting the future values of a sequence based on past observations.
This application is expected in finance (stock prices), meteorology (weather forecasting), and more.
 Code Generation: This process generates code snippets or entire programs from natural language
descriptions, which is helpful in programming assistants and automated software engineering tools.

Encoder-Decoder Architecture
The most common architecture used to build Seq2Seq models is Encoder-Decoder architecture. The model
consists of 3 parts: encoder, intermediate (encoder) vector and decoder.

Encoder
 Multiple RNN cells can be stacked together to form the encoder. RNN reads each inputs sequentially
 For every timestep (each input) t, the hidden state (hidden vector) h is updated according to the input at that
timestep X[i].
 After all the inputs are read by encoder model, the final hidden state of the model represents the context/summary
of the whole input sequence.
 Example: Consider the input sequence “I am a Student” to be encoded. There will be totally 4 timesteps ( 4
tokens) for the Encoder model. At each time step, the hidden state h will be updated using the previous hidden
state and the current input.

Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 13

Example: Encoder
 At the first timestep t1, the previous hidden state h0 will be considered as zero or randomly chosen. So the first
RNN cell will update the current hidden state with the first input and h0. Each layer outputs two things — updated
hidden state and the output for each stage. The outputs at each stage are rejected and only the hidden states will be
propagated to the next layer.
 The hidden states h_i are computed using the formula:

 At second timestep t2, the hidden state h1 and the second input X[2] will be given as input , and the hidden state
h2 will be updated according to both inputs. Then the hidden state h1 will be updated with the new input and will
produce the hidden state h2. This happens for all the four stages wrt example taken.
 A stack of several recurrent units (LSTM or GRU cells for better performance) where each accepts a single
element of the input sequence, collects information for that element, and propagates it forward.
 In the question-answering problem, the input sequence is a collection of all words from the question. Each word is
represented as x_i where i is the order of that word.
This simple formula represents the result of an ordinary recurrent neural network. As you can see, we just apply
the appropriate weights to the previously hidden state h_(t-1) and the input vector x_t.
Encoder Vector
 This is the final hidden state produced from the encoder part of the model. It is calculated using the formula
above.
 This vector aims to encapsulate the information for all input elements in order to help the decoder make accurate
predictions.
 It acts as the initial hidden state of the decoder part of the model.
Decoder
 The Decoder generates the output sequence by predicting the next output Yt given the hidden state ht.
 The input for the decoder is the final hidden vector obtained at the end of encoder model.
 Each layer will have three inputs, hidden vector from previous layer ht-1 and the previous layer output yt-1,
original hidden vector h.

Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 14

 At the first layer, the output vector of encoder and the random symbol START, empty hidden state ht-1 will be
given as input, the outputs obtained will be y1 and updated hidden state h1 (the information of the output will be
subtracted from the hidden vector).
 The second layer will have the updated hidden state h1 and the previous output y1 and original hidden vector h as
current inputs, produces the hidden vector h2 and output y2.
 The outputs occurred at each timestep of decoder is the actual output. The model will predict the output until the
END symbol occurs.
 A stack of several recurrent units where each predicts an output y_t at a time step t.
 Each recurrent unit accepts a hidden state from the previous unit and produces an output as well as its own hidden
state.
 In the question-answering problem, the output sequence is a collection of all words from the answer. Each word is
represented as y_i where i is the order of that word.

Example: Decoder.
 Any hidden state h_i is computed using the formula:

As you can see, we are just using the previous hidden state to compute the next one.
Output Layer
 We use Softmax activation function at the output layer.
 It is used to produce the probability distribution from a vector of values with the target class of high probability.
 The output y_t at time step t is computed using the formula:

We calculate the outputs using the hidden state at the current time step together with the respective weight
W(S). Softmax is used to create a probability vector that will help us determine the final output (e.g. word in the
question-answering problem).

Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 15

TOPIC-5: Back Propagation through Time (BPTT) Model in RNNs:


Backpropagation through time (BPTT) is a method used in recurrent neural networks (RNNs) to train the
network by backpropagating errors through time. In a traditional feedforward neural network, the data flows
through the network in one direction, from the input layer through the hidden layers to the output layer.
However, in RNNs, there are connections between nodes in different time steps, which means that the output
of the network at one time step depends on the input at that time step as well as the previous time steps.

BPTT works by unfolding the RNN over time, creating a series of interconnected feedforward networks. Each
time step corresponds to one layer in this unfolded network, and the weights between layers are shared across
time steps. The unfolded network can be thought of as a very deep feedforward network, where the weights
are shared across layers.

During training, the error is backpropagated through the unfolded network, and the weights are updated using
gradient descent. This allows the network to learn to predict the output at each time step based on the input at
that time step as well as the previous time steps.

However, BPTT has some challenges, such as the vanishing gradient problem, where the gradients become
very small as they propagate back in time, making it difficult to learn long-term dependencies. To address this
issue, various modifications of BPTT have been proposed, such as truncated backpropagation through time
and gradient clipping.

Uses of BPTT:

BPTT is a widely used technique for training recurrent neural networks (RNNs) that can be used for various
applications such as speech recognition, language modeling, and time series prediction. Here are some
specific use cases for BPTT:

Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 16

Speech recognition: BPTT can be used to train RNNs for speech recognition tasks, where the network takes
in a sequence of audio samples and predicts the corresponding text. BPTT allows the network to learn the
temporal dependencies in the audio signal and use them to make accurate predictions.

Language modeling: BPTT can also be used to train RNNs for language modeling tasks, where the network
predicts the probability distribution of the next word in a sequence given the previous words. This can be
useful for applications such as text generation and machine translation.

Time series prediction: BPTT can be used to train RNNs for time series prediction tasks, where the network
takes in a sequence of data points and predicts the next value in the sequence. BPTT allows the network to
learn the temporal dependencies in the data and use them to make accurate predictions.

Overall, BPTT is a powerful tool for training RNNs to model sequential data, and it has been applied
successfully to a wide range of applications in various fields such as speech recognition, natural language
processing, and finance.

Example of BPTT:

Let’s consider a simple example of using BPTT to train a recurrent neural network (RNN) for time series
prediction. Suppose we have a time series dataset that consists of a sequence of data points: {x1, x2, x3, …,
xn}.
The goal is to train an RNN to predict the next value in the sequence, xn+1, given the previous values in the
sequence.
To do this, we can use BPTT to backpropagate errors through time and update the weights of the RNN.
Here’s how the BPTT algorithm might work:
1. Initialize the weights of the RNN randomly.
2. Feed the first input x1 into the RNN and compute the output y1.
3. Compute the loss between the predicted output y1 and the actual output x2.
4. Backpropagate the error through the network using the chain rule, updating the weights at each time
step.
5. Feed the second input x2 into the RNN and compute the output y2.
6. Compute the loss between the predicted output y2 and the actual output x3.
7. Backpropagate the error through the network again, updating the weights at each time step.
8. Repeat steps 5–7 for the entire sequence of inputs {x1, x2, x3, …, xn}.
9. Test the RNN on a separate validation set and adjust the hyperparameters as necessary.
During training, the weights of the RNN are updated based on the gradients computed by backpropagating the
errors through time. This allows the RNN to learn the temporal dependencies in the data and make accurate
predictions for the next value in the sequence.
Overall, BPTT is a powerful technique for training RNNs to model sequential data, and it has been successfully
applied to a wide range of applications in various fields.

Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 17

Limitations of BPTT:
While backpropagation through time (BPTT) is a powerful technique for training recurrent neural networks
(RNNs), it has some limitations:
1. Computational complexity: BPTT requires computing the gradient at each time step, which can be
computationally expensive for long sequences. This can lead to slow training times and may require
specialized hardware to train large-scale models.
2. Vanishing gradients: BPTT is prone to the problem of vanishing gradients, where the gradients
become very small as they propagate back in time. This can make it difficult to learn long-term
dependencies, which are important for many sequential data modeling tasks.
3. Exploding gradients: On the other hand, BPTT is also prone to the problem of exploding gradients,
where the gradients become very large as they propagate back in time. This can lead to unstable training
and can cause the weights of the network to become unbounded, resulting in NaN values.
4. Memory limitations: BPTT requires storing the activations of each time step, which can be memory-
intensive for long sequences. This can limit the size of the sequence that can be processed by the
network.
5. Difficulty in parallelization: BPTT is inherently sequential, which makes it difficult to parallelize
across multiple GPUs or machines. This can limit the scalability of the training process.

TOPIC-6: Long Short Term Memory (LSTM) Model in RNNs:


LSTM (Long Short-Term Memory) is a recurrent neural network (RNN) architecture widely used in Deep
Learning. It excels at capturing long-term dependencies, making it ideal for sequence prediction tasks.
Unlike traditional neural networks, LSTM incorporates feedback connections, allowing it to process
entire sequences of data, not just individual data points. This makes it highly effective in understanding
and predicting patterns in sequential data like time series, text, and speech.
LSTM Architecture
The architecture of the LSTM. At a high level, LSTM works very much like an RNN cell. Here is the
internal functioning of the LSTM network. The LSTM network architecture consists of three parts, as
shown in the image below, and each part performs an individual function.

Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 18

The Logic Behind LSTM:


The first part chooses whether the information coming from the previous timestamp is to be remembered
or is irrelevant and can be forgotten. In the second part, the cell tries to learn new information from the
input to this cell. At last, in the third part, the cell passes the updated information from the current
timestamp to the next timestamp. This one cycle of LSTM is considered a single-time step.
These three parts of an LSTM unit are known as gates. They control the flow of information in and out
of the memory cell or lstm cell. The first gate is called Forget gate, the second gate is known as the
Input gate, and the last one is the Output gate. An LSTM unit that consists of these three gates and a
memory cell or lstm cell can be considered as a layer of neurons in traditional feedforward neural
network, with each neuron having a hidden layer and a current state.

Just like a simple RNN, an LSTM also has a hidden state where H(t-1) represents the hidden state of the
previous timestamp and Ht is the hidden state of the current timestamp. In addition to that, LSTM also
has a cell state represented by C(t-1) and C(t) for the previous and current timestamps, respectively.
Here the hidden state is known as Short term memory, and the cell state is known as Long term memory.
Refer to the following image.

It is interesting to note that the cell state carries the information along with all the timestamps.

Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 19

Example of LTSM Working


Let’s take an example to understand how LSTM works. Here we have two sentences separated by a full
stop. The first sentence is “Bob is a nice person,” and the second sentence is “Dan, on the Other hand, is
evil”. It is very clear, in the first sentence, we are talking about Bob, and as soon as we encounter the full
stop(.), we started talking about Dan.
As we move from the first sentence to the second sentence, our network should realize that we are no
more talking about Bob. Now our subject is Dan. Here, the Forget gate of the network allows it to forget
about it. Let’s understand the roles played by these gates in LSTM architecture.
Forget Gate:
In a cell of the LSTM neural network, the first step is to decide whether we should keep the information
from the previous time step or forget it. Here is the equation for forget gate.

Let’s try to understand the equation, here


 Xt: input to the current timestamp.
 Uf: weight associated with the input
 Ht-1: The hidden state of the previous timestamp
 Wf: It is the weight matrix associated with the hidden state
Later, a sigmoid function is applied to it. That will make ft a number between 0 and 1. This ft is
later multiplied with the cell state of the previous timestamp, as shown below.

Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 20

Input Gate:
Let’s take another example.
“Bob knows swimming. He told me over the phone that he had served the navy for four long years.”
So, in both these sentences, we are talking about Bob. However, both give different kinds of information
about Bob. In the first sentence, we get the information that he knows swimming. Whereas the second
sentence tells, he uses the phone and served in the navy for four years.
Now just think about it, based on the context given in the first sentence, which information in the second
sentence is critical? First, he used the phone to tell, or he served in the navy. In this context, it doesn’t
matter whether he used the phone or any other medium of communication to pass on the information.
The fact that he was in the navy is important information, and this is something we want our model to
remember for future computation. This is the task of the Input gate.
The input gate is used to quantify the importance of the new information carried by the input. Here is the
equation of the input gate

Here,
 Xt: Input at the current timestamp t
 Ui: weight matrix of input
 Ht-1: A hidden state at the previous timestamp
 Wi: Weight matrix of input associated with hidden state
Again we have applied the sigmoid function over it. As a result, the value of I at timestamp t will be
between 0 and 1.
New Information

Now the new information that needed to be passed to the cell state is a function of a hidden state at the
previous timestamp t-1 and input x at timestamp t. The activation function here is tanh. Due to the tanh
function, the value of new information will be between -1 and 1. If the value of Nt is negative, the
information is subtracted from the cell state, and if the value is positive, the information is added to the
cell state at the current timestamp.

Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)
B.Tech-4-1-CSE-NRIA20 DEEP LEARNING-UNIT-4 P a g e | 21

However, the Nt won’t be added directly to the cell state. Here comes the updated equation:

Here, Ct-1 is the cell state at the current timestamp, and the others are the values we have calculated
previously.
Output Gate:
Now consider this sentence.
“Bob single-handedly fought the enemy and died for his country. For his contributions, brave______.”
During this task, we have to complete the second sentence. Now, the minute we see the word brave, we
know that we are talking about a person. In the sentence, only Bob is brave, we can not say the enemy is
brave, or the country is brave. So based on the current expectation, we have to give a relevant word to
fill in the blank. That word is our output, and this is the function of our Output gate.
Here is the equation of the Output gate, which is pretty similar to the two previous gates.

Its value will also lie between 0 and 1 because of this sigmoid function. Now to calculate the current
hidden state, we will use Ot and tanh of the updated cell state. As shown below.

It turns out that the hidden state is a function of Long term memory (Ct) and the current output. If you
need to take the output of the current timestamp, just apply the SoftMax activation on hidden state Ht.

Here the token with the maximum score in the output is the prediction.
This is the More intuitive diagram of the LSTM network.

Prepared by: Ch. Santhi, Asst Professor-CSE NRI Institute of Technology (Autonomous)

You might also like