RNNBasics
RNNBasics
RNNBasics
6
Processing
In the previous chapter, we covered the basics of natural language processing (NLP). We
covered simple representations of text in the form of the bag-of-words model, and more
advanced word embedding representations that capture the semantic properties of the text.
This chapter aims to build upon word representation techniques by taking a more model-
centric approach to text processing. We will go over some of the core models, such
as recurrent neural networks (RNNs) and Long Short-Term Memory (LSTM) networks.
We will specifically answer the following questions:
What are some core deep learning models for understanding text?
What core concepts form the basis for understanding RNNs?
What core concepts form the basis for understanding LSTMs?
How do you implement basic functionality of an LSTM using TensorFlow?
What are some of the most popular text processing applications of an
RNN/LSTM?
Advanced Natural Language Processing Chapter 6
Before we begin our understanding of deep learning models for text, let's revisit neural
networks for a moment and try to understand why they are not suited for some of the
important text mining applications.
[ 144 ]
Advanced Natural Language Processing Chapter 6
Fixed sized inputs: A neural network architecture has a fixed number of input
layers. As such, it can only take a fixed sized input and output for any task. This
is a limiting factor for many pattern recognition tasks. For example, imagine an
image captioning task, where the goal of the network is taking an image and
generating words as captions. A typical neural network cannot model this task as
for every image, a number of words in the captions are going to be different.
Given a fixed output size, it is not possible for a neural network to efficiently
model this task. Another example is the task of sentiment classification. In this
task, a network should take a sentence as its input and output a single label (for
example, positive or negative). Given a sentence has a varying number of words,
the input for this task has a variable number of inputs. Hence, a typical neural
network cannot model this task. This type of task is often referred to as a sequence
classification task.
Lack of memory: Another limitation of neural networks is their lack of memory.
For example, for the task of sentiment classification, it is important to remember
the sequence of words to classify the sentiment of the whole sentence. For a
neural network, each input unit is processed independently of each other. As
such, the next word token in the sentence has no correlation to any previous
word token in the sentence, which makes the task of classifying the sentence
extremely difficult. A good model that can perform well at such tasks needs to
maintain context or memory:
(KZGFUK\GFKPRWVUQHPGWTCNPGVYQTMU 5QWTEGJVVRUTCYIKVJWDWUGTEQPVGPVEQOEUPEUPIKVJWDKQOCUVGTCUUGVUPPPGWTCNAPGVLRGI
[ 145 ]
Advanced Natural Language Processing Chapter 6
To address these limitations, an RNN is used. This class of deep learning model is our
primary focus in this chapter.
0GWTCNPGVYQTMJQTK\QPVCNN[TQNNGFWR
0GWTCNPGVYQTMXGTVKECNN[TQNNGFWR
The figure Neural network vertically rolled up is a simple RNN representation, which is a one-
to-one RNN; one input is mapped to one output using one hidden layer.
[ 146 ]
Advanced Natural Language Processing Chapter 6
RNN architectures
Typically, RNNs have many different architectures. In this section, we will go over some
basic architectures of RNNs and discuss how they fit various, different text mining
applications:
4001PGVQOCP[CTEJKVGEVWTG
400/CP[VQQPGCTEJKVGEVWTG
[ 147 ]
Advanced Natural Language Processing Chapter 6
400/CP[VQOCP[CTEJKVGEVWTG
Using this model, we can generate multiple output vectors based on each time step. Hence,
such models are widely applicable to a number of time-series or sequence-based data
models.
[ 148 ]
Advanced Natural Language Processing Chapter 6
$CUKE400OQFGN
8CPKUJKPIITCFKGPVRTQDNGOKP400YKVJOWNVKRNKECVKXGITCFKGPVU
To understand this concept in more detail, let's take a look at figure Basic RNN model. It
shows a single layer of hidden neurons through three-time steps. To compute a forward
pass of the gradient across this, we need to compute a derivative of a composite function as
follows:
[ 149 ]
Advanced Natural Language Processing Chapter 6
As you can imagine, each gradient computation here successively becomes smaller, and
multiplying a smaller value with a bigger value leads to diminishing the overall gradient
when computed through a large time step. Hence RNNs cannot be trained in an efficient
manner with longer time steps using this approach. One way to solve this approach is to
use a gating logic as shown in figure Solution to vanishing gradient problem with additive
gradients:
5QNWVKQPVQXCPKUJKPIITCFKGPVRTQDNGOYKVJCFFKVKXGITCFKGPVU
In this logic, instead of multiplying the gradients together, we add them as follows:
As can be seen from the preceding equation, the overall gradient can be computed as a sum
of smaller gradients, which do not diminish even when passed through longer time steps.
This addition is achieved due to the gating logic, which adds the output of the hidden layer
with the original input at every time step, thereby reducing the impact of diminishing
gradients. This gating architecture forms the basis of a new type of RNN also known as
Long Short-Term Memory networks or LSTMs. LSTMs are the most popular way to train
RNNs on a long sequence of temporal data and have been shown to perform reasonably
well on a wide variety of text mining tasks.
[ 150 ]
Advanced Natural Language Processing Chapter 6
Write to memory
Read from memory
Reset memory
.56/%QTGKFGC 5QWTEGJVVRUC[GCTQHCKEQOTQJCPNGPP[TGEWTTGPVPGWTCNPGVYQTMUD
Figure LSTM: Core idea illustrates this core idea. As shown in the figure LSTM: Core idea,
first the value of a previous LSTM cell is passed through a reset gate, which multiplies the
previous state value by a scalar between 0 and 1. If the scalar is close to 1, it leads to the
passing of the value of the previous cell state (remembering the previous state). In case it is
closer to 0, this leads to blocking of the value's previous cell state (forgetting the previous
state). Next, the write gate simply writes the transformed output of the reset gate. Finally,
the read gate reads a view of the output from the write gate:
.56/)CVKPIHWPEVKQPUKPVJGEGNN 5QWTEGJVVRUC[GCTQHCKEQOTQJCPNGPP[TGEWTTGPVPGWTCNPGVYQTMUD
[ 151 ]
Advanced Natural Language Processing Chapter 6
Understanding LSTM's gating logic can be quite complex. To describe this more succinctly,
let us take a detailed look into the cell architecture of LSTM without any inputs as shown in
figure LSTM: Core idea. The gating functions now have well-defined labels. We can call them
as follows:
1. f gate: Often referred to as the forget gate, this gate applies a sigmoid function to
the input cell value from a previous time step. Since a sigmoid function takes any
value between 0 and 1, this gate amounts to forgetting a portion of the previous
cell state value based on the activation of the sigmoid function.
2. g gate: The primary function of this gate is to regulate the additive factor of the
previous cell state value. In other words, the output of the f gate is now added
with a certain scalar controlled by the g gate. Typically, a tanh function between
-1 and 1 is applied in this case. As such, this gate often acts as an incrementing or
decrementing a counter.
3. i gate: Though g gates regulate the additive factor, the i gate is another tanh
function between 0 and 1 that dictates what portion of the g gate can actually be
added to the output of the f gate.
4. o gate: Also known as the output gate, this gate also uses a sigmoid function to
generate a scaled output, which is then inputted to the hidden state in the current
time step:
.56/$CUKEEGNNCTEJKVGEVWTG 5QWTEGJVVRUC[GCTQHCKEQOTQJCPNGPP[TGEWTTGPVPGWTCNPGVYQTMUD
[ 152 ]
Advanced Natural Language Processing Chapter 6
Figure LSTM: Gating functions in the cell shows the full LSTM cell with both inputs and
hidden states from previous and current time steps. As shown here, each of the preceding
four gates receives inputs from both the hidden state from the previous time step as well as
input from the current time step. The output of the cell is passed to the hidden state of the
current time step, as well as carried forward to the next LSTM cell. Figure End-to-end LSTM
network describes this connection visually.
As shown here, each LSTM cell acts as a separate unit between the input neuron and hidden
neuron across all time steps. Each of these cells is connected across time steps using a two-
channel communication mechanism sharing both the LSTM cell output as well as hidden
neuron activations across different time steps:
'PFVQGPF.56/PGVYQTM 5QWTEGJVVRUC[GCTQHCKEQOTQJCPNGPP[TGEWTTGPVPGWTCNPGVYQTMUD
[ 153 ]
Advanced Natural Language Processing Chapter 6
For example, the word Book is input at time step t and is fed to the hidden state ht:
5GPVKOGPVCPCN[UKUWUKPI.56/
To implement this model in TensorFlow, we need to first define a few variables as follows:
CBUDI@TJ[F
MTUN@VOJUT
OVN@DMBTTFT
NBY@TFRVFODF@MFOHUI
FNCFEEJOH@EJNFOTJPO
OVN@JUFSBUJPOT
As shown previously, CBUDI@TJ[F dictates how many sequences of tokens we can input in
one batch for training. MTUN@VOJUT represents the total number of LSTM cells in the
network. NBY@TFRVFODF@MFOHUI represents the maximum possible length of a given
sequence. Once defined, we now proceed to initialize TensorFlow-specific data structures
for input data as follows:
JNQPSUUFOTPSGMPXBTUG
MBCFMTUGQMBDFIPMEFS UGGMPBU<CBUDI@TJ[FOVN@DMBTTFT>
SBX@EBUBUGQMBDFIPMEFS UGJOU<CBUDI@TJ[FNBY@TFRVFODF@MFOHUI>
[ 154 ]
Advanced Natural Language Processing Chapter 6
Given we are working with word tokens, we would like to represent them using a good
feature representation technique. We propose using word embedding techniques from the
$IBQUFS, NLP- Vector Representation, for this task. Let us assume the word embedding
representation takes a word token and projects it onto an embedding space of
dimension, FNCFEEJOH@EJNFOTJPO. The two-dimensional input data containing raw word
tokens is now transformed into a three-dimensional word tensor with the added dimension
representing the word embedding. We also use pre-computed word embedding, stored in a
XPSE@WFDUPST data structure. We initialize the data structures as follows:
Now that the input data is ready, we look at defining the LSTM model. As shown
previously, we need to create MTUN@VOJUT of a basic LSTM cell. Since we need to perform a
classification at the end, we wrap the LSTM unit with a dropout wrapper. To perform a full
temporal pass of the data on the defined network, we unroll the LSTM using
a EZOBNJD@SOO routine of TensorFlow. We also initialize a random weight matrix and a
constant value of as the bias vector, as follows:
XFJHIUUG7BSJBCMF UGUSVODBUFE@OPSNBM <MTUN@VOJUTOVN@DMBTTFT>
CJBTUG7BSJBCMF UGDPOTUBOU TIBQF<OVN@DMBTTFT>
MTUN@DFMMUGDPOUSJCSOO#BTJD-45.$FMM MTUN@VOJUT
XSBQQFE@MTUN@DFMMUGDPOUSJCSOO%SPQPVU8SBQQFS DFMMMTUN@DFMM
PVUQVU@LFFQ@QSPC
PVUQVUTUBUFUGOOEZOBNJD@SOO XSBQQFE@MTUN@DFMMEBUB
EUZQFUGGMPBU
Once the output is generated by the dynamic unrolled RNN, we transpose its shape,
multiply it by the weight vector, and add a bias vector to it to compute the final prediction
value:
PVUQVUUGUSBOTQPTF PVUQVU<>
MBTUUGHBUIFS PVUQVUJOU PVUQVUHFU@TIBQF <>
QSFEJDUJPO UGNBUNVM MBTUXFJHIU CJBT
XFJHIUUGDBTU XFJHIUUGGMPBU
MBTUUGDBTU MBTUUGGMPBU
CJBTUGDBTU CJBTUGGMPBU
[ 155 ]
Advanced Natural Language Processing Chapter 6
Since the initial prediction needs to be refined, we define an objective function with cross-
entropy to minimize the loss as follows:
MPTTUGSFEVDF@NFBO UGOOTPGUNBY@DSPTT@FOUSPQZ@XJUI@MPHJUT
MPHJUTQSFEJDUJPOMBCFMTMBCFMT
PQUJNJ[FSUGUSBJO"EBN0QUJNJ[FS NJOJNJ[F MPTT
After this sequence of steps, we have a trained, end-to-end LSTM network for sentiment
classification of arbitrary length sentences.
Applications
Today, RNNs (for example, LSTM) have been used in a variety of different applications
ranging from time series data modeling, image classification, and video captioning, as well
as textual analysis. In this section, we will cover some important applications of RNNs for
solving different natural language understanding problems.
Language modeling
Language modeling is one of the fundamental problems in natural language
understanding (NLU). The core idea of a language model is to model important
distributional properties of the words in a given language. Once such a model is learnt, it
can be applied to a sequence of new words to generate the most likely next word token
given the learned distributional representation. More formally, a language model computes
a joint probability over a sequence of words as follows:
Unigram model: It assumes that each word token is independent of the sequence
of words before and after it:
[ 156 ]
Advanced Natural Language Processing Chapter 6
Bigram model: It assumes that each word token is dependent on only the word
token immediately before it:
We can solve the language model estimation problem efficiently through the use of an
LSTM-based network. The following figure illustrates a specific architecture that estimates a
three-gram language model. As shown here, we take a many-to-many LSTM and chunk the
whole sentence into a running window of three-word tokens each. For example, let us
assume a training sentence is: [What, is, the, problem]. The first input sequence is: [What,
is, the] and the output is [is, the, problem]:
.CPIWCIGOQFGNKPIYKVJVTCFKVKQPCN.56/
[ 157 ]
Advanced Natural Language Processing Chapter 6
Sequence tagging
Sequence tagging can be understood as a problem where the model sees a sequence of
words or tokens for each word in the sequence and it is expected to emit a label. In other
words, the model is expected to tag the whole sequence of tokens with an appropriate label
from a known label dictionary. Sequence tagging has some very interesting applications in
natural language understanding such as named entity recognition, part of speech tagging,
and so on:
5GSWGPEGVCIIKPIYKVJVTCFKVKQPCN.56/
[ 158 ]
Advanced Natural Language Processing Chapter 6
As shown in the preceding figure, Sequence Tagging with Traditional LSTM, one can model
this problem with a simple LSTM. One important point to note from this figure is LSTM is
only able to make use of the previous context of data. For example, in hidden state at time t,
the LSTM cell sees input from time t and output from hidden state at time t-1. There is no
way to make use of any future context from time greater than t in this architecture. This is a
strong limitation of traditional LSTM models for the task of sequence tagging.
To address this issue, bi-directional LSTM or B-LSTMs are proposed. The core idea of a bi-
directional LSTM is to have two LSTM layers, one in a forward direction and another in a
backward direction. With this design change, you can now combine information from both
directions to get previous context (forward LSTM) and future context (backward LSTM).
Figure Language modeling with traditional LSTM shows this design in more detail. B-LSTMs
are one of the most popular LSTM variants used for sequence tagging tasks today:
5GSWGPEGVCIIKPIYKVJDKFKTGEVKQPCN.56/
To implement B-LSTM in TensorFlow, we define two LSTM layers, one for forward and one
for backward direction as follows:
MTUN@DFMM@GXUGDPOUSJCSOO#BTJD-45.$FMM MTUN@VOJUT
MTUN@DFMM@CXUGDPOUSJCSOO#BTJD-45.$FMM MTUN@VOJUT
[ 159 ]
Advanced Natural Language Processing Chapter 6
In the previous implementation, we unrolled the LSTM using a dynamic RNN. TensorFlow
provides a similar routine for bidirectional LSTM which we can use as follows:
PVUQVU@GXPVUQVU@CX TUBUF
UGOOCJEJSFDUJPOBM@EZOBNJD@SOO MTUN@DFMM@GXMTUN@DFMM@CXEBUB
EUZQFUGGMPBU
DPOUFYU@SFQUGDPODBU <PVUQVU@GXPVUQVU@CX>BYJT
DPOUFYU@SFQ@GMBUUGSFTIBQF DPOUFYU@SFQ< MTUN@VOJUT>
Now, we initialize weights and bias like before (note, XFJHIU has twice the number of
MTUN@VOJUT as before, one for each directional layer of LSTM):
Now, you can generate predictions based on the current value of weights and compute a
loss value. In the previous example, we computed a cross entropy loss. For sequence
labeling, it is often useful to have a conditional random field (CRF)-based loss function.
You can define these loss functions as follows:
QSFEJDUJPOUGNBUNVM DPOUFYU@SFQ@GMBUXFJHIU CJBT
TDPSFTUGSFTIBQF QSFEJDUJPO<NBY@TFRVFODF@MFOHUIOVN@DMBTTFT>
MPH@MJLFMJIPPEUSBOTJUJPO@QBSBNTUGDPOUSJCDSGDSG@MPH@MJLFMJIPPE
TDPSFTMBCFMTTFRVFODF@MFOHUIT
MPTT@DSGUGSFEVDF@NFBO MPH@MJLFMJIPPE
Machine translation
Machine translation is one of the most recent success stories of NLU. The goal of this
problem is to take a text sentence in a source language, such as English and convert it into
the same sentence in a given target language, such as Spanish. Traditional methods of
solving this problem relied on using phrase-based models. These models typically chunk
the sentences into shorter phrases and translate each of these phrases one by one into a
target language phrase.
[ 160 ]
Advanced Natural Language Processing Chapter 6
Though translation at phrase-level works reasonably well, when you combine these
translated phrases into the target language to generate a fully translated sentence, you find
occasional choppiness, or disfluency. To avoid this limitation of phrase-based machine
translation models, the neural machine translation (NMT) technique is proposed, which
utilizes a variant of RNN to solve this problem:
0GWTCNOCEJKPGVTCPUNCVKQPEQTGKFGC 5QWTEGJVVRUIKVJWDEQOVGPUQTcQYPOV
The core idea of an NMT can be described in figure Neural machine translation core idea. It
consists of two parts: (a) Encoder and (b) Decoder.
The role of the encoder is to take a sentence in the source language and convert it into a
vector representation (also known as thought vector) that captures the overall semantics
and meaning of the sentence. This vector representation is then inputted to a decoder,
which then decodes this into a target language sentence. As you can see, this problem is a
natural fit for a many-to-many RNN architecture. In the previous application example for
sequence labeling, we introduced the concept of B-LSTM that can map an input sequence to
a set of output tokens. Even B-LSTM is unable to map an input sequence to an output
sequence. Hence, to solve this problem with an RNN architecture, we introduce another
variant of RNN also known as the Seq2Seq model.
[ 161 ]
Advanced Natural Language Processing Chapter 6
Figure Neural machine translation architecture with Seq2Seq model describes the core
architecture of a Seq2Seq model applied for the task of NMT:
0GWTCNOCEJKPGVTCPUNCVKQPCTEJKVGEVWTGYKVJ5GS5GSOQFGN 5QWTEGJVVRUIKVJWDEQOVGPUQTcQYPOV
A Seq2Seq model, as shown in Figure Neural machine translation architecture with Seq2Seq
model is essentially composed of two groups of RNNs, Encoders and Decoders. Each of
these RNNs may be composed of either uni-directional LSTM or bidirectional LSTM, may
be composed of multiple hidden layers, and may either use LSTM or GRU as its basic cell
unit type. As shown, an encoder RNN (shown on left) takes source words as its inputs and
projects it onto two hidden layers. These two hidden layers when moved across the time
step, invoke the decoder RNN (shown on right) and project the output onto a projection
and loss layer to generate the most likely candidate words in the target language.
To implement Seq2Seq in TensorFlow, we define a simple LSTM cell for both encoding and
decoding and unroll the encoder with a dynamic RNN module as follows:
MTUN@DFMM@FODPEFSUGOOSOO@DFMM#BTJD-45.$FMM MTUN@VOJUT
MTUN@DFMM@EFDPEFSUGOOSOO@DFMM#BTJD-45.$FMM MTUN@VOJUT
FODPEFS@PVUQVUTFODPEFS@TUBUFUGOOEZOBNJD@SOO MTUN@DFMM@FODPEFS
FODPEFS@EBUBTFRVFODF@MFOHUINBY@TFRVFODF@MFOHUIUJNF@NBKPS5SVF
[ 162 ]
Advanced Natural Language Processing Chapter 6
In case you want to use bi-directional LSTM for this step, you can do that as follows:
MTUN@DFMM@GXUGDPOUSJCSOO#BTJD-45.$FMM MTUN@VOJUT
MTUN@DFMM@CXUGDPOUSJCSOO#BTJD-45.$FMM MTUN@VOJUT
CJ@PVUQVUT
FODPEFS@TUBUFUGOOCJEJSFDUJPOBM@EZOBNJD@SOO MTUN@DFMM@GX
MTUN@DFMM@CX
FODPEFS@EBUB
TFRVFODF@MFOHUI
NBY@TFRVFODF@MFOHUI
UJNF@NBKPS5SVF
FODPEFS@PVUQVUTUGDPODBU CJ@PVUQVUT
Now we need to perform a decoding step to generate the most likely candidate words in the
target language (hypotheses). For this step, TensorFlow provides a EZOBNJD@EFDPEFS
function under the Seq2Seq module. We use it as follows:
USBJOJOH@IFMQFSUGDPOUSJCTFRTFR5SBJOJOH)FMQFS EFDPEFS@EBUB
EFDPEFS@MFOHUITUJNF@NBKPS5SVF
QSPKFDUJPO@MBZFSUGQZUIPOMBZFSTDPSF%FOTF UBSHFU@WPDBCVMBSZ@TJ[F
EFDPEFSUGDPOUSJCTFRTFR#BTJD%FDPEFS EFDPEFS@DFMM
USBJOJOH@IFMQFSFODPEFS@TUBUFPVUQVU@MBZFSQSPKFDUJPO@MBZFS
PVUQVUTTUBUFUGDPOUSJCTFRTFREZOBNJD@EFDPEF EFDPEFS
MPHJUTPVUQVUTSOO@PVUQVU
Lastly, we define a loss function and train the model to minimize the loss:
MPTT
UGSFEVDF@TVN UGOOTPGUNBY@DSPTT@FOUSPQZ@XJUI@MPHJUT MPHJUTQSFEJDUJPO
MBCFMTMBCFMT
PQUJNJ[FSUGUSBJO"EBN0QUJNJ[FS NJOJNJ[F MPTT
Seq2Seq inference
During the inference phase, a trained Seq2Seq model gets a source sentence. It uses this to
obtain an FODPEFS@TUBUF which is used to initialize the decoder. The translation process
starts as soon as the decoder receives a special symbol, <s>, denoting the start of the
decoding process.
[ 163 ]
Advanced Natural Language Processing Chapter 6
The decoder RNN now runs for the current time step and computes the probability
distribution of all the words in the target vocabulary as defined by the QSPKFDUJPO@MBZFS.
It now employs a greedy strategy, where it chooses the most likely word from this
distribution and feeds it as the target input word in the next time step. This process is now
repeated for another time step until the decoder RNN chooses a special symbol, </s>, which
marks the end of translation. Figure Neural machine translation decoding with greedy search
over Seq2Seq model illustrates this greedy search technique with an example:
0GWTCNOCEJKPGVTCPUNCVKQPFGEQFKPIYKVJITGGF[UGCTEJQXGT5GS5GSOQFGN 5QWTEGJVVRUIKVJWDEQOVGPUQTcQYPOV
IFMQFSUGDPOUSJCTFRTFR(SFFEZ&NCFEEJOH)FMQFS EFDPEFS@EBUB
UGGJMM <CBUDI@TJ[F>T T
EFDPEFSUGDPOUSJCTFRTFR#BTJD%FDPEFS
EFDPEFS@DFMMIFMQFSFODPEFS@TUBUF
PVUQVU@MBZFSQSPKFDUJPO@MBZFS
%ZOBNJDEFDPEJOH
OVN@JUFSBUJPOTUGSPVOE UGSFEVDF@NBY NBY@TFRVFODF@MFOHUI
PVUQVUT@UGDPOUSJCTFRTFREZOBNJD@EFDPEF
EFDPEFSNBYJNVN@JUFSBUJPOTOVN@JUFSBUJPOT
USBOTMBUJPOTPVUQVUTTBNQMF@JE
[ 164 ]
Advanced Natural Language Processing Chapter 6
Chatbots
Chatbots are another example of an application that lends itself very well to RNN models.
Figure Sequence tagging with traditional LSTM shows an example of a chatbot application
built with the Seq2Seq model, which was described in the preceding section. Chatbots can
be understood as a special case of machine translation, where the target language is
replaced with a vocabulary of responses for each possible question in the chatbot
knowledge base:
%JCVDQVYKVJ5GS5GS.56/
Summary
In this chapter, we introduced some core deep learning models for understanding text. We
described the core concepts behind sequential modeling of textual data, and what network
architectures are more suited to this type of data processing. We introduced basic concepts
of recurrent neural networks (RNN) and showed why they are difficult to train in practice.
We describe LSTM as a practical form of RNN and sketched their implementation using
TensorFlow. Finally, we covered a number of natural language understanding applications
that can benefit from the application of various RNN architectures.
In next chapter, chapter 7, we will look at how deep learning techniques can be applied to
tasks involving both NLP and images.
[ 165 ]