Deep Learning Basics Lecture 9 Recurrent Neural Networks

Deep Learning Basics
Lecture 9: Recurrent Neural Networks

Princeton University COS 495
Instructor: Yingyu Liang
Introduction
Recurrent neural networks
• Dates back to (Rumelhart et al., 1986)
• A family of neural networks for handling sequential data, which
involves variable length inputs or outputs
• Especially, for natural language processing (NLP)

Sequential data
• Each data point: A sequence of vectors 𝑥 (𝑡) , for 1 ≤ 𝑡 ≤ 𝜏
• Batch data: many sequences with different lengths 𝜏
• Label: can be a scalar, a vector, or even a sequence
• Example
• Sentiment analysis
• Machine translation
Example: machine translation
Figure from: devblogs.nvidia.com

More complicated sequential data
• Data point: two dimensional sequences like images
• Label: different type of sequences like text sentences
• Example: image captioning

Image captioning
Figure from the paper “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”,
by Justin Johnson, Andrej Karpathy, Li Fei-Fei
Computational graphs
A typical dynamic system
𝑠 (𝑡+1) = 𝑓(𝑠 𝑡
; 𝜃)
Figure from Deep Learning,
Goodfellow, Bengio and Courville
A system driven by external data
𝑠 (𝑡+1) = 𝑓(𝑠 𝑡
, 𝑥 (𝑡+1) ; 𝜃)
Compact view
𝑠 (𝑡+1) = 𝑓(𝑠 𝑡
, 𝑥 (𝑡+1) ; 𝜃)
square: one step time delay
Compact view
𝑠 (𝑡+1) = 𝑓(𝑠 𝑡
, 𝑥 (𝑡+1) ; 𝜃)
Key: the same 𝑓 and 𝜃 Figure from Deep Learning,
for all time steps Goodfellow, Bengio and Courville
Recurrent neural networks (RNN)
• Use the same computational function and parameters across different
time steps of the sequence
• Each time step: takes the input entry and the previous hidden state to
compute the output entry
• Loss: typically computed every time step
Label
Loss
Output
State
Input
Figure from Deep Learning, by Goodfellow, Bengio and Courville

Math formula:

Advantage
• Hidden state: a lossy summary of the past
• Shared functions and parameters: greatly reduce the capacity and
good for generalization in learning
• Explicitly use the prior knowledge that the sequential data can be
processed by in the same way at different time step (e.g., NLP)
Advantage
• Hidden state: a lossy summary of the past
• Shared functions and parameters: greatly reduce the capacity and
good for generalization in learning
• Explicitly use the prior knowledge that the sequential data can be
processed by in the same way at different time step (e.g., NLP)
• Yet still powerful (actually universal): any function computable by a

Turing machine can be computed by such a recurrent network of a
finite size (see, e.g., Siegelmann and Sontag (1995))
Training RNN
• Principle: unfold the computational graph, and use backpropagation
• Called back-propagation through time (BPTT) algorithm
• Can then apply any general-purpose gradient-based techniques
Training RNN
• Principle: unfold the computational graph, and use backpropagation
• Called back-propagation through time (BPTT) algorithm
• Can then apply any general-purpose gradient-based techniques
• Conceptually: first compute the gradients of the internal nodes, then

compute the gradients of the parameters
Math formula:

Gradient at 𝐿(𝑡) : (total loss
is sum of those at different
time steps)

Gradient at 𝑜(𝑡) :

Gradient at 𝑠 (𝜏) :

Gradient at 𝑠 (𝑡) :

Gradient at parameter 𝑉:

Variants of RNN
RNN
• Use the same computational function and parameters across different
time steps of the sequence
• Each time step: takes the input entry and the previous hidden state to
compute the output entry
• Loss: typically computed every time step
• Many variants
• Information about the past can be in many other forms
• Only output at the end of the sequence
Example: use the output at the
previous step

Example: only output at the end

Bidirectional RNNs
• Many applications: output at time 𝑡 may depend on the whole input
sequence
• Example in speech recognition: correct interpretation of the current
sound may depend on the next few phonemes, potentially even the
next few words
• Bidirectional RNNs are introduced to address this

BiRNNs

Encoder-decoder RNNs
• RNNs: can map sequence to one vector; or to sequence of same
length
• What about mapping sequence to sequence of different length?
• Example: speech recognition, machine translation, question

answering, etc

Deep Learning Basics Lecture 9 Recurrent Neural Networks

Uploaded by

Copyright:

Available Formats

Deep Learning Basics Lecture 9 Recurrent Neural Networks

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning Basics Lecture 9 Recurrent Neural Networks

Uploaded by

Copyright:

Available Formats

Deep Learning Basics

Lecture 9: Recurrent Neural Networks

• Especially, for natural language processing (NLP)

Figure from: devblogs.nvidia.com

• Example: image captioning

Figure from Deep Learning, by Goodfellow, Bengio and Courville

Figure from Deep Learning,

• Yet still powerful (actually universal): any function computable by a

• Conceptually: first compute the gradients of the internal nodes, then

Figure from Deep Learning,

Figure from Deep Learning,

Figure from Deep Learning,

Figure from Deep Learning,

Figure from Deep Learning,

Figure from Deep Learning,

Figure from Deep Learning,

Figure from Deep Learning,

• Bidirectional RNNs are introduced to address this

Figure from Deep Learning,

• What about mapping sequence to sequence of different length?

• Example: speech recognition, machine translation, question

You might also like