Deep Learning Basics Lecture 9 Recurrent Neural Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Deep Learning Basics

Lecture 9: Recurrent Neural Networks


Princeton University COS 495
Instructor: Yingyu Liang
Introduction
Recurrent neural networks
• Dates back to (Rumelhart et al., 1986)
• A family of neural networks for handling sequential data, which
involves variable length inputs or outputs

• Especially, for natural language processing (NLP)


Sequential data
• Each data point: A sequence of vectors 𝑥 (𝑡) , for 1 ≤ 𝑡 ≤ 𝜏
• Batch data: many sequences with different lengths 𝜏
• Label: can be a scalar, a vector, or even a sequence

• Example
• Sentiment analysis
• Machine translation
Example: machine translation

Figure from: devblogs.nvidia.com


More complicated sequential data
• Data point: two dimensional sequences like images
• Label: different type of sequences like text sentences

• Example: image captioning


Image captioning

Figure from the paper “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”,
by Justin Johnson, Andrej Karpathy, Li Fei-Fei
Computational graphs
A typical dynamic system

𝑠 (𝑡+1) = 𝑓(𝑠 𝑡
; 𝜃)
Figure from Deep Learning,
Goodfellow, Bengio and Courville
A system driven by external data

𝑠 (𝑡+1) = 𝑓(𝑠 𝑡
, 𝑥 (𝑡+1) ; 𝜃)
Figure from Deep Learning,
Goodfellow, Bengio and Courville
Compact view

𝑠 (𝑡+1) = 𝑓(𝑠 𝑡
, 𝑥 (𝑡+1) ; 𝜃)
Figure from Deep Learning,
Goodfellow, Bengio and Courville
square: one step time delay
Compact view

𝑠 (𝑡+1) = 𝑓(𝑠 𝑡
, 𝑥 (𝑡+1) ; 𝜃)
Key: the same 𝑓 and 𝜃 Figure from Deep Learning,
for all time steps Goodfellow, Bengio and Courville
Recurrent neural networks (RNN)
Recurrent neural networks
• Use the same computational function and parameters across different
time steps of the sequence
• Each time step: takes the input entry and the previous hidden state to
compute the output entry
• Loss: typically computed every time step
Label
Recurrent neural networks
Loss

Output

State

Input

Figure from Deep Learning, by Goodfellow, Bengio and Courville


Recurrent neural networks
Math formula:

Figure from Deep Learning,


Goodfellow, Bengio and Courville
Advantage
• Hidden state: a lossy summary of the past
• Shared functions and parameters: greatly reduce the capacity and
good for generalization in learning
• Explicitly use the prior knowledge that the sequential data can be
processed by in the same way at different time step (e.g., NLP)
Advantage
• Hidden state: a lossy summary of the past
• Shared functions and parameters: greatly reduce the capacity and
good for generalization in learning
• Explicitly use the prior knowledge that the sequential data can be
processed by in the same way at different time step (e.g., NLP)

• Yet still powerful (actually universal): any function computable by a


Turing machine can be computed by such a recurrent network of a
finite size (see, e.g., Siegelmann and Sontag (1995))
Training RNN
• Principle: unfold the computational graph, and use backpropagation
• Called back-propagation through time (BPTT) algorithm
• Can then apply any general-purpose gradient-based techniques
Training RNN
• Principle: unfold the computational graph, and use backpropagation
• Called back-propagation through time (BPTT) algorithm
• Can then apply any general-purpose gradient-based techniques

• Conceptually: first compute the gradients of the internal nodes, then


compute the gradients of the parameters
Recurrent neural networks
Math formula:

Figure from Deep Learning,


Goodfellow, Bengio and Courville
Recurrent neural networks
Gradient at 𝐿(𝑡) : (total loss
is sum of those at different
time steps)

Figure from Deep Learning,


Goodfellow, Bengio and Courville
Recurrent neural networks
Gradient at 𝑜(𝑡) :

Figure from Deep Learning,


Goodfellow, Bengio and Courville
Recurrent neural networks
Gradient at 𝑠 (𝜏) :

Figure from Deep Learning,


Goodfellow, Bengio and Courville
Recurrent neural networks
Gradient at 𝑠 (𝑡) :

Figure from Deep Learning,


Goodfellow, Bengio and Courville
Recurrent neural networks
Gradient at parameter 𝑉:

Figure from Deep Learning,


Goodfellow, Bengio and Courville
Variants of RNN
RNN
• Use the same computational function and parameters across different
time steps of the sequence
• Each time step: takes the input entry and the previous hidden state to
compute the output entry
• Loss: typically computed every time step

• Many variants
• Information about the past can be in many other forms
• Only output at the end of the sequence
Example: use the output at the
previous step

Figure from Deep Learning,


Goodfellow, Bengio and Courville
Example: only output at the end

Figure from Deep Learning,


Goodfellow, Bengio and Courville
Bidirectional RNNs
• Many applications: output at time 𝑡 may depend on the whole input
sequence
• Example in speech recognition: correct interpretation of the current
sound may depend on the next few phonemes, potentially even the
next few words

• Bidirectional RNNs are introduced to address this


BiRNNs

Figure from Deep Learning,


Goodfellow, Bengio and Courville
Encoder-decoder RNNs
• RNNs: can map sequence to one vector; or to sequence of same
length

• What about mapping sequence to sequence of different length?

• Example: speech recognition, machine translation, question


answering, etc
Figure from Deep Learning,
Goodfellow, Bengio and Courville

You might also like