CS224n: Natural Language Processing With Deep Learning
CS224n: Natural Language Processing With Deep Learning
CS224n: Natural Language Processing With Deep Learning
Learning 1 1
Course Instructors: Christopher
Manning, Richard Socher
Lecture Notes: Part V
Language Models, RNN, GRU and LSTM 2 2
Authors: Milad Mohammadi, Rohit
Mundra, Richard Socher, Lisa Wang,
Winter 2019 Amita Kamath
1 Language Models
1.1 Introduction
Language models compute the probability of occurrence of a number
of words in a particular sequence. The probability of a sequence of
m words {w1 , ..., wm } is denoted as P(w1 , ..., wm ). Since the number
of words coming before a word, wi , varies depending on its location
in the input document, P(w1 , ..., wm ) is usually conditioned on a
window of n previous words rather than all previous words:
i=m i =m
P(w1 , ..., wm ) = ∏ P(wi |w1 , ..., wi−1 ) ≈ ∏ P(wi |wi−n , ..., wi−1 ) (1)
i =1 i =1
count(w1 , w2 )
p ( w2 | w1 ) = (2)
count(w1 )
count(w1 , w2 , w3 )
p ( w3 | w1 , w2 ) = (3)
count(w1 , w2 )
The relationship in Equation 3 focuses on making predictions
based on a fixed window of context (i.e. the n previous words) used
to predict the next word. But how long should the context be? In
some cases, the window of past consecutive n words may not be suf-
ficient to capture the context. For instance, consider the sentence "As
the proctor started the clock, the students opened their ". If the
window only conditions on the previous three words "the students
opened their", the probabilities calculated based on the corpus may
suggest that the next word be "books" - however, if n had been large
enough to include the "proctor" context, the probability might have
suggested "exam".
This leads us to two main issues with n-gram Language Models:
Sparsity and Storage.
What is interesting here is that the same weights W (hh) and W (hx) are
applied repeatedly at each timestep. Thus, the number of parameters
the model has to learn is less, and most importantly, is independent
of the length of the input sequence - thus defeating the curse of di-
mensionality!
Below are the details associated with each parameter in the net-
work:
• x1 , ..., xt−1 , xt , xt+1 , ...x T : the word vectors corresponding to a cor- Figure 4: The inputs and outputs to a
pus with T words. neuron of a RNN
4-5 tokens.
T T |V |
1 1
J=
T ∑ J (t) ( θ ) = −
T ∑ ∑ yt,j × log(ŷt,j ) (8)
t =1 t =1 j =1
Perplexity = 2 J (9)
2. The model size does not increase for longer input sequence
lengths
3. Computation for step t can (in theory) use information from many
steps back.
machine translation, and many other tasks). In the latter two applica-
tions, we want a representation for the sentence, which we can obtain
by taking the element-wise max or mean of all hidden states of the
timesteps in that sentence.
Note: Figure 6 is an alternative representation of RNNs used in
some publications. It represents the RNN hidden layer as a loop.
Sentence 1
"Jane walked into the room. John walked in too. Jane said hi to ___"
Sentence 2
"Jane walked into the room. John walked in too. It was late in the
day, and everyone was walking home after a long day at work. Jane
said hi to ___"
In both sentences, given their context, one can tell the answer to both
blank spots is most likely "John". It is important that the RNN predicts
the next word as "John", the second person who has appeared several
time-steps back in both contexts. Ideally, this should be possible given
what we know about RNNs so far. In practice, however, it turns out
RNNs are more likely to correctly predict the blank spot in Sentence
1 than in Sentence 2. This is because during the back-propagation
phase, the contribution of gradient values gradually vanishes as they
propagate to earlier timesteps, as we will show below. Thus, for long
sentences, the probability that "John" would be recognized as the next
word reduces with the size of the context. Below, we discuss the math-
ematical reasoning behind the vanishing gradient problem.
Consider Equations 5 and 6 at a time-step t; to compute the RNN
error, dE/dW, we sum the error at each time-step. That is, dEt /dW for
every time-step, t, is computed and accumulated.
T
∂E ∂Et
=∑ (10)
∂W t =1
∂W
∂h j
k k≤k W T kk diag[ f 0 (h j−1 )] k≤ βW β h (15)
∂h j−1
The norm of both matrices is calculated through taking their L2-
norm. The norm of f 0 (h j−1 ) can only be as large as 1 given the sigmoid
non-linearity function.
t ∂h j
∂ht
k
∂hk
k=k ∏ ∂h j − 1
k≤ ( βW β h )t−k (16)
j = k +1
descent step. When the gradient descent model hits the high error wall rithm should work
gradient is not th
in the objective function, the gradient is pushed off to a far-away loca- (a case for which
tion on the decision surface. The clipping model produces the dashed as the ratio betwe
still explode).
line where it instead pulls back the error gradient to somewhere close Our hypothesis co
cent success of th
to the original gradient landscape. to other second or
To solve the problem of vanishing gradients, we introduce two tech- ferences between H
order algorithms.
niques. The first technique is that instead of initializing W (hh) ran- Figure 6. We plot the error surface of a single hidden unit
and hence can dea
not necessarily a
domly, start off from an identity matrix initialization. recurrent network, highlighting the existence of high cur-
Figure
vature 7: Gradient
walls. explosion
The solid lines clipping
depicts standard trajectories
new estimate of
date step and can
The second technique is to use the Rectified Linear Units (ReLU) in- that gradient descent might follow. Using dashed arrow
visualization
the diagram shows what would happen if the gradients is
curvature (such as
sis) while most ot
rescaled to a fixed size when its norm is above a threshold.
stead of the sigmoid function. The derivative for the ReLU is either 0 or sumption, i.e., av
steps.
1. This way, gradients would flow through the neurons whose deriva-
tive is 1 without getting attenuated while propagating back through
time-steps.
−
→ ← −
ŷt = g(Uht + c) = g(U [ h t ; h t ] + c) (19)
RNNs can also be multi-layered. Figure 9 shows a multi-layer bi-
directional RNN where each lower layer feeds the next layer. As shown
in this figure, in this network architecture, at time-step t each interme-
diate neuron receives one set of parameters from the previous time-
step (in the same RNN layer), and two sets of parameters from the
previous RNN hidden layer; one input comes from the left-to-right Going Deep
RNN and the other from the right-to-left RNN. y
To construct a Deep RNN with L layers, the above relationships are
modified to the relationships in Equations 20 and 21 where the input
to each intermediate neuron at level i is the output of the RNN at
h (3) ! (i) !"
! (i)
h t = f (W ht(i−1
layer i − 1 at the same time-step, t. The output, ŷ, at each time-step is ! (i) !"
" (i)
the result of propagating input parameters through all hidden layers h (2) h t = f (W ht(i−1
(Equation 22). ! (L ) !
yt = g(U[h t ;h
h (1)
−
→ (i ) −→ ( i −1) − → − → (i ) −→
h t = f ( W (i ) h t + V ( i ) h t −1 + b ( i ) ) (20)
x
Each memory layer passes an intermediate seq
representation
Figure to the next.
9: A deep bi-directional RNN
with three RNN layers.
cs224n: natural language processing with deep learning lecture notes: part v
language models, rnn, gru and lstm 10
←
− (i ) ←− ( i −1) ← − ← − (i ) ←−
h t = f ( W (i ) h t + V ( i ) h t +1 + b ( i ) ) (21)
−
→( L) ←−( L)
ŷt = g(Uht + c) = g(U [ h t ; h t ] + c) (22)
N
1
max
θ N ∑ log pθ (y(n) |x(n) ) (26)
n =1
Extension II: compute every hidden state in the decoder using three
different inputs:
• The previous hidden state (standard) Figure 11: Language model with
three inputs to each decoder neuron:
• Last hidden layer of the encoder (c = h T in Figure 11) (ht−1 , c, yt−1 )
cs224n: natural language processing with deep learning lecture notes: part v
language models, rnn, gru and lstm 11
Beyond the extensions discussed so far, RNNs have been found to per-
form better with the use of more complex units for activation. So far,
we have discussed methods that transition from hidden state ht−1 to
ht using an affine transformation and a point-wise nonlinearity. Here,
we discuss the use of a gated activation function thereby modifying
the RNN architecture. What motivates this? Well, although RNNs
can theoretically capture long-term dependencies, they are very hard
to actually train to do this. Gated recurrent units are designed in a
manner to have more persistent memory thereby making it easier for
RNNs to capture long-term dependencies. Let us see mathematically
how a GRU uses ht−1 and xt to generate the next hidden state ht . We
will then dive into the intuition of this architecture.
z t = σ (W ( z ) x t + U ( z ) h t − 1 ) (Update gate)
r t = σ (W ( r ) x t + U ( r ) h t − 1 ) (Reset gate)
h̃t = tanh(rt ◦ Uht−1 + Wxt ) (New memory)
ht = (1 − zt ) ◦ h̃t + zt ◦ ht−1 (Hidden state)
cs224n: natural language processing with deep learning lecture notes: part v
language models, rnn, gru and lstm 12
the different parameters: W, U, W (r) , U (r) , W (z) , U (z) . These follow the
same backpropagation procedure we have seen in the past.
cs224n: natural language processing with deep learning lecture notes: part v
language models, rnn, gru and lstm 13
4 Long-Short-Term-Memories
i t = σ (W ( i ) x t + U ( i ) h t − 1 ) (Input gate)
f t = σ (W ( f ) x t + U ( f ) h t − 1 ) (Forget gate)
(o ) (o )
o t = σ (W xt + U h t −1 ) (Output/Exposure gate)
c̃t = tanh(W (c) xt + U (c) ht−1 ) (New memory cell)
ct = f t ◦ ct−1 + it ◦ c̃t (Final memory cell)
ht = ot ◦ tanh(ct )
2. Input Gate: We see that the new memory generation stage doesn’t
check if the new word is even important before generating the new
memory – this is exactly the input gate’s function. The input gate
uses the input word and the past hidden state to determine whether
or not the input is worth preserving and thus is used to gate the new
memory. It thus produces it as an indicator of this information.
3. Forget Gate: This gate is similar to the input gate except that it
does not make a determination of usefulness of the input word –
instead it makes an assessment on whether the past memory cell is
useful for the computation of the current memory cell. Thus, the
forget gate looks at the input word and the past hidden state and
produces f t .
4. Final memory generation: This stage first takes the advice of the
forget gate f t and accordingly forgets the past memory ct−1 . Sim-
ilarly, it takes the advice of the input gate it and accordingly gates
the new memory c̃t . It then sums these two results to produce the
final memory ct .