1
A Review on Deep Learning Techniques for
Video Prediction
arXiv:2004.05214v1 [cs.CV] 10 Apr 2020
S. Oprea, P. Martinez-Gonzalez, A. Garcia-Garcia, J.A. Castro-Vargas, S. Orts-Escolano,
J. Garcia-Rodriguez, and A. Argyros
Abstract—The ability to predict, anticipate and reason about future outcomes is a key component of intelligent decision-making
systems. In light of the success of deep learning in computer vision, deep-learning-based video prediction emerged as a promising
research direction. Defined as a self-supervised learning task, video prediction represents a suitable framework for representation
learning, as it demonstrated potential capabilities for extracting meaningful representations of the underlying patterns in natural videos.
Motivated by the increasing interest in this task, we provide a review on the deep learning methods for prediction in video sequences.
We firstly define the video prediction fundamentals, as well as mandatory background concepts and the most used datasets. Next, we
carefully analyze existing video prediction models organized according to a proposed taxonomy, highlighting their contributions and
their significance in the field. The summary of the datasets and methods is accompanied with experimental results that facilitate the
assessment of the state of the art on a quantitative basis. The paper is summarized by drawing some general conclusions, identifying
open research challenges and by pointing out future research directions.
Index Terms—Video prediction, future frame prediction, deep learning, representation learning, self-supervised learning
✦
1
I NTRODUCTION
Context Frames
Predicted Frames
W
ILL the car hit the pedestrian? That might be one
of the questions that comes to our minds when we
observe Figure 1. Answering this question might be in
principle a hard task; however, if we take a careful look
into the image sequence we may notice subtle clues that can
help us predicting into the future, e.g., the person’s body
indicates that he is running fast enough so he will be able to
escape the car’s trajectory. This example is just one situation
among many others in which predicting future frames in
video is useful.
In general terms, the prediction and anticipation of
future events is a key component of intelligent decisionmaking systems. Despite the fact that we, humans, solve
this problem quite easily and effortlessly, it is extremely
challenging from a machine’s point of view. Some of the
factors that contribute to such complexity are occlusions,
camera movement, lighting conditions, clutter, or object
deformations. Nevertheless, despite such challenging conditions, many predictive methods have been applied with
•
•
•
•
S. Oprea, P. Martinez-Gonzalez A. Garcia-Garcia, J. A. Castro-Vargas,
and J. Garcia-Rodriguez are with the 3D Perception Lab (3DPL), Department of Computer Technology, University of Alicante, Carrer de San
Vicente del Raspeig s/n, E-03690 San Vicent del Raspeig Spain, Spain.
E-mail: {soprea, pmartinez, jacastro, jgarcia}@dtic.ua.es
A. Garcia-Garcia is with the Institute of Space Sciences (ICE-CSIC),
Campus UAB, Carrer de Can Magrans s/n, E-08193 Barcelona, Spain.
E-mail:
[email protected].
S. Orts-Escolano is with the Department of Computer Science and
Artificial Intelligence (DCCIA), University of Alicante, Carrer de San
Vicente del Raspeig s/n, E-03690 San Vicent del Raspeig Spain, Spain.
E-mail:
[email protected].
A. Argyros is with the Institute of Computer Science, FORTH, Heraklion
GR-700 13, Greece and with the Computer Science Department, University of Crete, Heraklion, Rethimno 741 00, Greece.
E-mail:
[email protected].
...
(Xt−n, . . . , Xt)
Ŷt+1
Ŷt+m
Fig. 1: A pedestrian appeared from behind the white
car with the intention of crossing the street. The autonomous car must make a call: hit the emergency braking routine or not. This all comes down to predict the
next frames (Ŷt+1 , . . . , Ŷt+m ) given a sequence of context
frames (Xt−n , . . . , Xt ), where n and m denote the number
of context and predicted frames, respectively. From these
predictions at a representation level (RGB, high-level semantics, etc.) a decision-making system would make the car
avoid the collision.
a certain degree of success in a broad range of application
domains such as autonomous driving, robot navigation and
human-machine interaction. Some of the tasks in which
future prediction has been applied successfully are: anticipating activities and events [1]–[4], long-term planning [5],
future prediction of object locations [6], video interpolation
[7], predicting instance/semantic segmentation maps [8]–
[10], prediction of pedestrian trajectories in traffic [11],
anomaly detection [12], precipitation nowcasting [13], [14],
and autonomous driving [15].
The great strides made by deep learning algorithms
in a variety of research fields such as semantic segmen-
2
tation [16], human action recognition and prediction [17],
object pose estimation [18] and registration [19] to name
a few, motivated authors to explore deep representationlearning models for future video frame prediction. What
made the deep architectures take a leap over the traditional
approaches is their ability to learn adequate representations from high-dimensional data in an end-to-end fashion without hand-engineered features [20]. Deep learningbased models fit perfectly into the learning by prediction
paradigm, enabling the extraction of meaningful spatiotemporal correlations from video data in a self-supervised
fashion.
In this review, we put our focus on deep learning techniques and how they have been extended or applied to
future video prediction. We limit this review to the future
video prediction given the context of a sequence of previous
frames, leaving aside methods that predict future from a
static image. In this context, the terms video prediction,
future frame prediction, next video frame prediction, future
frame forecasting, and future frame generation are used
interchangeably. To the best of our knowledge, this is the
first review in the literature that focuses on video prediction
using deep learning techniques.
This review is organized as follows. First, Sections 2
and 3 lay down the terminology and explain important
background concepts that will be necessary throughout the
rest of the paper. Next, Section 4 surveys the datasets used
by the video prediction methods that are carefully reviewed
in Section 5, providing a comprehensive description as well
as an analysis of their strengths and weaknesses. Section 6
analyzes typical metrics and evaluation protocols for the
aforementioned methods and provides quantitative results
for them in the reviewed datasets. Section 7 presents a
brief discussion on the presented proposals and enumerates
potential future research directions. Finally, Section 8 summarizes the paper and draws conclusions about this work.
2
V IDEO P REDICTION
The ability to predict, anticipate and reason about future
events is the essence of intelligence [21] and one of the main
goals of decision-making systems. This idea has biological
roots, and also draws inspiration from the predictive coding
paradigm [22] borrowed from the cognitive neuroscience
field [23]. From a neuroscience perspective, the human
brain builds complex mental representations of the physical
and causal rules that govern the world. This is primarily
through observation and interaction [24]–[26]. The common
sense we have about the world arises from the conceptual
acquisition and the accumulation of background knowledge
from early ages, e.g. biological motion and intuitive physics
to name a few. But how can the brain check and refine
the learned mental representations from its raw sensory
input? The brain is continuously learning through prediction, and refines the already understood world models from
the mismatch between its predictions and what actually
happened [27]. This is the essence of the predictive coding
paradigm that early works tried to implement as computational models [22], [28]–[30].
Video prediction task closely captures the fundamentals
of the predictive coding paradigm and it is considered the
intermediate step between raw video data and decision
making. Its potential to extract meaningful representations
of the underlying patterns in video data makes the video
prediction task a promising avenue for self-supervised representation learning.
2.1
Problem Definition
We formally define the task of predicting future
frames in videos, i.e. video prediction, as follows. Let
Xt ∈ Rw×h×c be the t-th frame in the video sequence
X = (Xt−n , . . . , Xt−1 , Xt ) with n frames, where w, h,
and c denote width, height, and number of channels,
respectively. The target is to predict the next frames
Y = (Ŷt+1 , Ŷt+2 , . . . , Ŷt+m ) from the input X.
Under the assumption that good predictions can only be
the result of accurate representations, learning by prediction
is a feasible approach to verify how accurately the system
has learned the underlying patterns in the input data. In
other words, it represents a suitable framework for representation learning [31], [32]. The essence of predictive learning
paradigm is the prediction of plausible future outcomes
from a set of historical inputs. On this basis, the task of video
prediction is defined as: given a sequence of video frames
as context, predict the subsequent frames –generation of
continuing video given a sequence of previous frames. Different from video generation that is mostly unconditioned,
video prediction is conditioned on a previously learned
representation from a sequence of input frames. At a first
glance, and in the context of learning paradigms, we can
think about the future video frame prediction task as a
supervised learning approach because the target frame acts
as a label. However, as this information is already available
in the input video sequence, no extra labels or human
supervision is needed. Therefore, learning by prediction is a
self-supervised task, filling the gap between supervised and
unsupervised learning.
2.2
Exploiting the Time Dimension of Videos
Unlike static images, videos provide complex transformations and motion patterns in the time dimension. At a fine
granularity, if we focus on a small patch at the same spatial
location across consecutive time steps, we could identify
a wide range of local visually similar deformations due
to the temporal coherence. In contrast, by looking at the
big picture, consecutive frames would be visually different
but semantically coherent. This variability in the visual
appearance of a video at different scales is mainly due to,
occlusions, changes in the lighting conditions, and camera
motion, among other factors. From this source of temporally
ordered visual cues, predictive models are able to extract
representative spatio-temporal correlations depicting the
dynamics in a video sequence. For instance, Agrawal et
al. [33] established a direct link between vision and motion,
attempting to reduce supervision efforts when training deep
predictive models.
Recent works study how important is the time dimension for video understanding models [34]. The implicit
temporal ordering in videos, also known as the arrow of
time, indicates whether a video sequence is playing forward
or backward. This temporal direction is also used in the
3
Time
Probabilistic Deterministic
Context Frame
2.4
Fig. 2: At top, a deterministic environment where a geometric object, e.g. a black square, starts moving following a
random direction. At bottom, probabilistic outcome. Darker
areas correspond to higher probability outcomes. As uncertainty is introduced, probabilities get blurry and averaged.
Figure inspired by [38].
literature as a supervisory signal [35]–[37]. This further
encouraged predictive models to implicitly or explicitly
model the spatio-temporal correlations of a video sequence
to understand the dynamics of a scene. The time dimension
of a video reduces the supervision effort and makes the
prediction task self-supervised.
2.3
model is a crucial aspect. Probabilistic approaches dealing
with these issues are discussed in Section 5.6.
Dealing with Stochasticity
Predicting how a square is moving, could be extremely
challenging even in a deterministic environment such as
the one represented in Figure 2. The lack of contextual
information and the multiple equally probable outcomes
hinder the prediction task. But, what if we use two consecutive frames as context? Under this configuration and
assuming a physically perfect environment, the square will
be indefinitely moving in the same direction. This represents
a deterministic outcome, an assumption that many authors
made in order to deal with future uncertainty. Assuming a
deterministic outcome would narrow the prediction space to
a unique solution. However, this assumption is not suitable
for natural videos. The future is by nature multimodal, since
the probability distribution defining all the possible future
outcomes in a context has multiple modes, i.e. there are multiple equally probable and valid outcomes. Furthermore, on
the basis of a deterministic universe, we indirectly assume
that all possible outcomes are reflected in the input data.
These assumptions make the prediction under uncertainty
an extremely challenging task.
Most of the existing deep learning-based models in the
literature are deterministic. Although the future is uncertain,
a deterministic prediction would suffice some easily predictable situations. For instance, most of the movement of a
car is largely deterministic, while only a small part is uncertain. However, when multiple predictions are equally probable, a deterministic model will learn to average between
all the possible outcomes. This unpredictability is visually
represented in the predictions as blurriness, especially on
long time horizons. As deterministic models are unable to
handle real-world settings characterized by chaotic dynamics, authors considered that incorporating uncertainty to the
The Devil is in the Loss Function
The design and selection of the loss function for the video
prediction task is of utmost importance. Pixel-wise losses,
e.g. Cross Entropy (CE), ℓ2 , ℓ1 and Mean-Squared Error
(MSE), are widely used in both unstructured and structured predictions. Although leading to plausible predictions
in deterministic scenarios, such as synthetic datasets and
video games, they struggle with the inherent uncertainty
of natural videos. In a probabilistic environment, with different equally probable outcomes, pixel-wise losses aim to
accommodate uncertainty by blurring the prediction, as we
can observe in Figure 2. In other words, the deterministic
loss functions average out multiple equally plausible outcomes in a single, blurred prediction. In the pixel space,
these losses are unstable to slight deformations and fail to
capture discriminative representations to efficiently regress
the broad range of possible outcomes. This makes difficult
to draw predictions maintaining the consistency with our
visual similarity notion. Besides video prediction, several
studies analyzed the impact of different loss functions in
image restoration [39], classification [40], camera pose regression [41] and structured prediction [42], among others.
This fosters reasoning about the importance of the loss
function, particularly when making long-term predictions
in high-dimensional and multimodal natural videos.
Most of distance-based loss functions, such as based on
ℓp norm, come from the assumption that data is drawn
from a Gaussian distribution. But, how these loss functions address multimodal distributions? Assuming that
a pixel is drawn from a bimodal distribution with two
equally likely modes M o1 and M o2 , the mean value
M o = (M o1 + M o2 )/2 would minimize the ℓp -based losses
over the data, even if M o has very low probability [43]. This
suggests that the average of two equally probable outcomes
would minimize distance-based losses such as, the MSE
loss. However, this applies to a lesser extent when using
ℓ1 norm as the pixel values would be the median of the two
equally likely modes in the distribution. In contrast to the ℓ2
norm that emphasizes outliers with the squaring term, the ℓ1
promotes sparsity thus making it more suitable for prediction in high-dimensional data [43]. Based on the ℓ2 norm,
the MSE is also commonly used in the training of video
prediction models. However, it produces low reconstruction
errors by merely averaging all the possible outcomes in
a blurry prediction as uncertainty is introduced. In other
words, the mean image would minimize the MSE error as
it is the global optimum, thus avoiding finer details such
as facial features and subtle movements as they are noise
for the model. Most of the video prediction approaches
rely on pixel-wise loss functions, obtaining roughly accurate
predictions in easily predictable datasets.
One of the ultimate goals of many video prediction
approaches is to palliate the blurry predictions when it
comes to uncertainty. For this purpose, authors broadly
focused on: directly improving the loss functions; exploring adversarial training; alleviating the training process
by reformulating the problem in a higher-level space; or
4
exploring probabilistic alternatives. Some promising results
were reported by combining the loss functions with sophisticated regularization terms, e.g. the Gradient Difference
Loss (GDL) to enhance prediction sharpness [43] and the
Total Variation (TV) regularization to reduce visual artifacts
and enforce coherence [7]. Perceptual losses were also used
to further improve the visual quality of the predictions [44]–
[48]. However, in light of the success of the Generative Adversarial Networks (GANs), adversarial training emerged as
a promising alternative to disambiguate between multiple
equally probable modes. It was widely used in conjunction with different distance-based losses such as: MSE [49],
ℓ2 [50]–[52], or a combination of them [43], [53]–[57]. To alleviate the training process, many authors reformulated the
optimization process in a higher-level space (see Section 5.5).
While great strides have been made to mitigate blurriness,
most of the existing approaches still rely on distance-based
loss functions. As a consequence, the regress-to-the-mean
problem remains an open issue. This has further encouraged
authors to reformulate existing deterministic models in a
probabilistic fashion.
3
BACKBONE D EEP L EARNING A RCHITECTURES
In this section, we will briefly review the most common
deep networks that are used as building blocks for the video
prediction models discussed in this review: convolutional
neural networks, recurrent networks, and generative models.
3.1
Convolutional Models
Convolutional layers are the basic building blocks of deep
learning architectures designed for visual reasoning since
the Convolutional Neural Networks (CNNs) efficiently
model the spatial structure of images [58]. As we focus
on the visual prediction, CNNs represent the foundation of
predictive learning literature. However, their performance
is limited by the intra-frame and inter-frame dependencies.
Convolutional operations account for short-range intraframe dependencies due to their limited receptive fields,
determined by the kernel size. This is a well-addressed
issue, that many authors circumvented by (1) stacking more
convolutional layers [59], (2) increasing the kernel size (although it becomes prohibitively expensive), (3) by linearly
combining multiple scales [43] as in the reconstruction process of a Laplacian pyramid [60], (4) using dilated convolutions to capture long-range spatial dependencies [61], (5)
enlarging the receptive fields [62], [63], or subsampling, i.e.
using pooling operations in exchange for losing resolution.
The latter could be mitigated by using residual connections [64], [65] to preserve resolution while increasing the
number of stacking convolutions. But even addressing these
limitations, would CNNs be able to predict in a longer time
horizon?
Vanilla CNNs lack of explicit inter-frame modeling capabilities. To properly model inter-frame variability in a video
sequence, 3D convolutions come into play as a promising
alternative to recurrent modeling. Several video prediction
approaches leveraged 3D convolutions to capture temporal consistency [66]–[70]. Also modeling time dimension,
Amersfoort et al. [71] replicated a purely convolutional
approach in time to address multi-scale predictions in the
transformation space. In this case, the learned affine transforms at each time step play the role of a recurrent state.
3.2
Recurrent Models
Recurrent models were specifically designed to model a
spatio-temporal representation of sequential data such as
video sequences. Among other sequence learning tasks,
such as machine translation, speech recognition and video
captioning, to name a few, Recurrent Neural Networks
(RNNs) [72] demonstrated great success in the video prediction scenario [10], [13], [49], [50], [52], [53], [53], [70], [73]–
[85]. Vanilla RNNs have some important limitations when
dealing with long-term representations due to the vanishing
and exploding gradient issues, making the Backpropagation
through time (BPTT) cumbersome. By extending classical
RNNs to more sophisticated recurrent models, such as
Long Short-Term Memory (LSTM) [86] and Gated Recurrent
Unit (GRU) [87], these problems were mitigated. Shi et
al. extended the use of LSTM-based models to the image
space [13]. While some authors explored multidimensional
LSTM (MD-LSTM) [88], others stacked recurrent layers
to capture abstract spatio-temporal correlations [49], [89].
Zhang et al. addressed the duplicated representations along
the same recurrent paths [90].
3.3
Generative Models
Whilst discriminative models learn the decision boundaries between classes, generative models learn the underlying distribution of individual classes. More formally,
discriminative models capture the conditional probability
p(y|x), while generative models capture the joint probability
p(x, y), or p(x) in the absence of labels y . The goal of
generative models is the following: given some training
data, generate new samples from the same distribution. Let
input data ∼ pdata (x) and generated samples ∼ pmodel (x)
where, pdata and pmodel are the underlying input data and
model’s probability distribution respectively. The training
process consists in learning a pmodel (x) similar to pdata (x).
This is done by explicitly, e.g VAEs, or implicitly, e.g. GANs,
estimating a density function from the input data. In the
context of video prediction, generative models are mainly
used to cope with future uncertainty by generating a wide
spectrum of feasible predictions rather than a single eventual outcome.
3.3.1 Explicit Density Modeling
These models explicitly define and solve for pmodel (x).
PixelRNNs and PixelCNNs [91]: These are a type of Fully
Visible Belief Networks (FVBNs) [92], [93] that explicitly
define a tractable density and estimate the joint distribution p(x) as a product of conditional distributions over
the pixels. Informally, they turn pixel generation into a
sequential modeling problem, where next pixel values are
determined by previously generated ones. In PixelRNNs,
this conditional dependency on previous pixels is modeled
using two-dimensional (2d) LSTMs. On the other hand,
dependencies are modeled using convolutional operations
5
over a context region, thus making training faster. In a nutshell, these methods are outputting a distribution over pixel
values at each location in the image, aiming to maximize the
likelihood of the training data being generated. Further improvements of the original architectures have been carried
out to address different issues. The Gated PixelCNN [94] is
computationally more efficient and improves the receptive
fields of the original architecture. In the same work, authors also explored conditional modeling of natural images,
where the joint probability distribution is conditioned on a
latent vector —it represents a high-level image description.
This further enabled the extension to video prediction [95].
Variational Autoencoders (VAEs): These models are an extension of Autoencoders (AEs) that encode and reconstruct
its own input data x in order to capture a low-dimensional
representation z containing the most meaningful factors of
variation in x. Extending this architecture to generation,
VAEs aim to sample new images from a prior over the
underlying latent representation z . VAEs represent a probabilistic spin over the deterministic latent space in AEs.
Instead of directly optimizing the density function, which is
intractable, they derive and optimize a lower bound on the
likelihood. Data is generated from the learned distribution
by perturbing the latent variables. In the video prediction
context, VAEs are the foundation of many probabilistic
models dealing with future uncertainty [9], [38], [55], [81],
[85], [96], [97]. Although these variational approaches are
able to generate various plausible outcomes, the predictions
are blurrier and of lower quality compared to state-of-theart GAN-based models. Many approaches were taken to
leverage the advantages of variational inference: combined
adversarial training with VAEs [55], and others incorporated
latent probabilistic variables into deterministic models, such
as Variational Recurrent Neural Networks (VRNNs) [97],
[98] and Variational Encoder-Decoders (VEDs) [99].
3.3.2 Implicit Density Modeling
These models learn to sample from pmodel without explicitly
defining it.
GANs [100]: are the backbone of many video prediction
approaches [43], [49]–[55], [57], [65], [67], [68], [78], [101]–
[106]. Inspired on game theory, these networks consist of
two models that are jointly trained as a minimax game
to generate new fake samples that resemble the real data.
On one hand, we have the discriminator model featuring
a probability distribution function describing the real data.
On the other hand, we have the generator which tries
to generate new samples that fool the discriminator. In
their original formulation, GANs are unconditioned –the
generator samples new data from a random noise, e.g.
Gaussian noise. Nevertheless, Mirza et al. [107] proposed
the conditional Generative Adversarial Network (cGAN), a
conditional version where the generator and discriminator
are conditioned on some extra information, e.g. class labels,
previous predictions, and multimodal data, among others.
CGANs are suitable for video prediction, since the spatiotemporal coherence between the generated frames and the
input sequence is guaranteed. The use of adversarial training for the video prediction task, represented a leap over
the previous state-of-the-art methods in terms of prediction
quality and sharpness. However, adversarial training is
unstable. Without an explicit latent variable interpretation,
GANs are prone to mode collapse —generator fails to cover
the space of possible predictions by getting stuck into a
single mode [99], [108]. Moreover, GANs often struggle to
balance between the adversarial and reconstruction loss,
thus getting blurry predictions. Among the dense literature
on adversarial networks, we find some other interesting
works addressing GANs limitations [109], [110].
4
DATASETS
As video prediction models are mostly self-supervised, they
need video sequences as input data. However, some video
prediction methods rely on extra supervisory signals, e.g.
segmentation maps, and human poses. This makes outof-domain video datasets perfectly suitable for video prediction. This section describes the most relevant datasets,
discussing their pros and cons. Datasets were organized
according to their main purpose and summarized in Table 1.
4.1
Action and Human Pose Recognition Datasets
KTH [111]: is an action recognition dataset which includes
2391 video sequences of 4 seconds mean duration, each of
them containing an actor performing an action taken with a
static camera, over homogeneous backgrounds, at 25 frames
per second (fps) and with its resolution downsampled to
160 × 120 pixels. Just 6 different actions are performed, but
it was the biggest dataset of this kind at its moment.
Weizmann [112]: is also an action recognition dataset, created for modelling actions as space-time shapes. For this
reason, it was recorded at a higher frame rate (50 fps). It just
includes 90 video sequences, but performing 10 different
actions. It uses a static-camera, homogeneous backgrounds
and low resolution (180 × 144 pixels). KTH and Weizmann
are usually used together due to their similarities in order
to augment the amount of available data.
HMDB-51 [113]: is a large-scale database for human motion
recognition. It claims to represent the richness of human
motion taking profit from the huge amount of video available online. It is composed by 6766 normalized videos
(with mean duration of 3.15 seconds) where humans appear
performing one of the 51 considered action categories. Moreover, a stabilized dataset version is provided, in which camera movement is disabled by detecting static backgrounds
and displacing the action as a window. It also provides
interesting data for each sequence such as body parts visible,
point of view respect the human, and if there is camera
movement or not. It also exists a joint-annotated version
called J-HMBD [114] in which the key points of joints were
mannually added for 21 of the HMDB actions.
UCF101 [115]: is an action recognition dataset of realistic
action videos, collected from YouTube. It has 101 different
action categories, and it is an extension of UCF50, which
has 50 action categories. All videos have a frame rate of 25
fps and a resolution of 320 × 240 pixels. Despite being the
most used dataset among predictive models, a problem it
6
has is that only a few sequences really represent movement,
i.e. they often show an action over a fixed background.
level for both semantic and instance segmentation, following the format proposed by the Cityscapes [121] dataset.
Penn Action Dataset [116]: is an action and human pose
recognition dataset from the University of Pennsylvania. It
contains 2326 video sequences of 15 different actions, and
it also provides human joint and viewpoint (position of the
camera respect the human) annotations for each sequence.
Each action is balanced in terms of different viewpoints
representation.
Cityscapes [121]: is a large-scale database which focuses on
semantic understanding of urban street scenes. It provides
semantic, instance-wise, and dense pixel annotations for 30
classes grouped into 8 categories. The dataset consist of
around 5000 fine annotated images (1 frame in 30) and
20 000 coarse annotated ones (one frame every 20 seconds
or 20 meters run by the car). Data was captured in 50
cities during several months, daytimes, and good weather
conditions. All frames are provided as stereo pairs, and the
dataset also includes vehicle odometry obtained from invehicle sensors, outside temperature, and GPS tracks.
Human3.6M [117]: is a human pose dataset in which 11
actors with marker-based suits were recorded performing
15 different types of actions. It features RGB images, depth
maps (time-of-flight range data), poses and scanned 3D
surface meshes of all actors. Silhouette masks and 2D
bounding boxes are also provided. Moreover, the dataset
was extended by inserting high-quality 3D rigged human
models (animated with the previously recorded actions) in
real videos, to create a realistic and complex background.
THUMOS-15 [118]: is an action recognition challenge that
was celebrated in 2015. It didn’t just focus on recognizing an
action in a video, but also on determining the time span in
which that action occurs. With that purpose, the challenge
provided a dataset that extends UCF101 [115] (trimmed
videos with one action) with 2100 untrimmed videos where
one or more actions take place (with the correspondent temporal annotations) and almost 3000 relevant videos without
any of the 101 proposed actions.
4.2
Driving and Urban Scene Understanding Datasets
CamVid [136]: the Cambridge-driving Labeled Video
Database is a driving/urban scene understanding dataset
which consists of 5 video sequences recorded with a 960 ×
720 pixels resolution camera mounted on the dashboard of
a car. Four of those sequences were sampled at 1 fps, and
one at 15 fps, resulting in 701 frames which were manually
per-pixel annotated for semantic segmentation (under 32
classes). It was the first video sequence dataset of this kind
to incorporate semantic annotations.
CalTech Pedestrian Dataset [119]: is a driving dataset focused on detecting pedestrians, since its unique annotations
are pedestrian bounding boxes. It is conformed of approximately 10 hours of 640 × 480 30fps video taken from a vehicle driving through regular traffic in an urban environment,
making a total of 250 000 annotated frames distributed in
137 approximately minute-long segments. The total pedestrian bounding boxes is 350 000, identifying 2300 unique
pedestrians. Temporal correspondence between bounding
boxes and detailed occlusion labels are also provided.
Kitti [120]: is one of the most popular datasets for mobile
robotics and autonomous driving, as well as a benchmark
for computer vision algorithms. It is composed by hours of
traffic scenarios recorded with a variety of sensor modalities, including high-resolution RGB, gray-scale stereo cameras, and a 3D laser scanner. Despite its popularity, the
original dataset did not contain ground truth for semantic
segmentation. However, after various researchers manually
annotated parts of the dataset to fit their necessities, in 2015
Kitti dataset was updated with 200 annotated frames at pixel
Comma.ai steering angle [137]: is a driving dataset composed by 7.25 hours of largely highway routes. It was
recorded as 360 × 180 camera images at 20 fps (divided in
11 different clips), and steering angles, among other vehicle
data (speed, GPS, etc.).
Apolloscape [122]: is a driving/urban scene understanding
dataset that focuses on 3D semantic reconstruction of the
environment. It provides highly precise information about
location and 6D camera pose, as well as a much bigger
amount of dense per-pixel annotations than other datasets.
Along with that, depth information is retireved from a
LIDAR sensor, that allows to semantically reconstruct the
scene in 3D as a point cloud. It also provides RGB stereo
pairs as video sequences recorded under various weather
conditions and daytimes. This video sequences and their
per-pixel instance annotations make this dataset very interesting for a wide variety of predictive models.
4.3
Object and Video Classification Datasets
Sports1M [123]: is a video classification dataset that also
consists of annotated YouTube videos. In this case, it is
fully focused on sports: its 487 classes correspond to the
sport label retrieved from the YouTube Topics API. Video
resolution, duration and frame rate differ across all available
videos, but they can be normalized when accessed from
YouTube. It is much bigger than UCF101 (over 1 million
videos), and movement is also much more frequent.
Youtube-8M [124]: Sports1M [123] dataset is, since 2016,
part of a bigger one called YouTube8M, which follows the
same philosophy, but with all kind of videos, not just sports.
Moreover, it has been updated in order to improve the quality and precision of their annotations. In 2019 YouTube-8M
Segments was released with segment-level human-verified
labels on about 237 000 video segments on 1000 different
classes, which are collected from the validation set of the
YouTube-8M dataset. Since YouTube is the biggest video
source on the planet, having annotations for some of their
videos at segment level is great for predictive models.
YFCC100M [125]: Yahoo Flickr Creative Commons 100 Million
Dataset is a collection of 100 million images and videos
uploaded to Flickr between 2004 and 2014. All those media
files were published in Flickr under Creative Commons
license, overcoming one of the biggest issues affecting existing multimedia datasets, licensing and volume. Although
only 0.8% of the elements of the dataset are videos, it is
7
TABLE 1: Summary of the most widely used datasets for video prediction (S/R: Synthetic/Real, st: stereo, de: depth,
ss: semantic segmentation, is: instance segmentation, sem: semantic, I/O: Indoor/Outdoor environment, bb: bounding
box, Act: Action label, ann: annotated, env: environment, ToF: Time of Flight, vp: camera viewpoints respect human).
provided data and ground-truth
name1
year
S/R
#videos
#frames
#ann. frames
resolution
#classes
RGB
st
de
ss
is
other annotations
env.
X
X
X
X
X
X
X
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
ToF
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
Act.
Act.
Act., vp
Act.
Act., Human poses, vp
Act., Human poses & meshes
Act., Time span
O
O
I/O
I/O
I/O
I/O
I/O
X
X
X
X
X
X
✗
✗
X
X
✗
X
✗
✗
LiDAR
stereo
✗
LiDAR
X
✗
X
X
✗
X
✗
✗
X
X
✗
X
✗
Pedestrian bb & occlusions
Odometry
Odometry, temp, GPS
Steering angles & speed
Odometry, GPS
O
O
O
O
O
O
X
X
X
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
Sport label
Topic label, Segment info
User tags, Localization
I/O
I/O
I/O
X
X
X
X
X
X
X
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
X
✗
✗
✗
✗
✗
✗
✗
✗
Arm pose
Arm pose
Arm pose
I/O
I
I
I
I
X
X
X
X
X
X
✗
✗
✗
X
✗
X
✗
✗
✗
✗
X
stereo
✗
✗
✗
✗
X
✗
✗
✗
✗
X
X
✗
User tags, human bb
Object bb
✗
Human poses, bb
Normal maps, 6D poses
✗
I/O
I
I/O
I
O
Action and human pose recognition datasets
KTH [111]
Weizmann [112]
HMDB-51 [113]
UCF101 [115]
Penn Action D. [116]
Human3.6M [117]
THUMOS-15 [118]
2004
2007
2011
2012
2013
2014
2017
R
R
R
R
R
SR
R
2391
90
6766
13 320
2326
40002
18 404
250 0002
90002
639 300
2 000 0002
163 841
3 600 000
3 000 0002
Camvid [77]
CalTech Pedest. [119]
Kitti [120]
Cityscapes [121]
Comma.ai [75]
Apolloscape [122]
2008
2009
2013
2016
2016
2018
R
R
R
R
R
R
5
137
151
50
11
4
18 202
1 000 0002
48 791
7 000 0002
522 0002
200 000
Sports1m [123]
YouTube8M [124]
YFCC100M [125]
2014
2016
2016
R
R
SR
1 133 158
8 200 000
8000
n/a
n/a
n/a
0
0
0
Bouncing balls [126]
Van Hateren [127]
NORBvideos [128]
Moving MNIST [74]
Robotic Pushing [89]
BAIR Robot [129]
RoboNet [130]
2008
2012
2013
2015
2016
2017
2019
S
R
R
SR
R
R
R
4000
56
110 560
custom3
57 000
45 000
161 000
20 000
3584
552 800
custom3
1 500 0002
n/a
15 000 000
0
0
All (is)
0
0
0
0
ViSOR [131]
PROST [132]
Arcade Learning [133]
Inria 3DMovie v2 [134]
Robotrix [16]
UASOL [135]
2010
2010
2013
2016
2018
2019
R
R
S
R
S
R
1529
4 (10)
custom3
27
67
33
1 360 0002
4936 (9296)
custom3
2476
3 039 252
165 365
0
0
0
0
0
0
0
160 × 120
180 × 144
var × 240
320 × 240
480 × 270
1000x1000
320 × 240
6 (action)
10 (action)
51 (action)
101 (action)
15 (action)
15 (action)
101 (action)
Driving and urban scene understanding datasets
701 (ss)
250 000 (bb)
200 (ss)
25 000 (ss)
0
146 997 (ss)
960 × 720
640 × 480
1392 × 512
2048 × 1024
160 × 320
3384 × 2710
32 (sem)
30 (sem)
30 (sem)
25 (sem)
Object and video classification datasets
640 × 360 (var.)
variable
variable
487 (sport)
1000 (topic)
-
Video prediction datasets
150 × 150
128 × 128
640 × 480
64 × 64
640 × 512
n/a
variable
5 (object)
-
Other-purpose and multi-purpose datasets
1
2
3
0
All (bb)
0
235 (is)
All (ss)
0
variable
variable
210 × 160
960 × 540
1920 × 1080
2280 × 1282
39 (sem)
-
some dataset names have been abbreviated to enhance table’s readability.
values estimated based on the framerate and the total number of frames or videos, as the original values are not provided by the authors.
custom indicates that as many frames as needed can be generated. This is related to datasets generated from a game, algorithm or simulation, involving interaction or randomness.
still useful for predictive models due to the great variety of
these, and therefore the challenge that it represents.
4.4
Video Prediction Datasets
Standard bouncing balls dataset [126]: is a common test
set for models that generate high dimensional sequences. It
consists of simulations of three balls bouncing in a box. Its
clips can be generated randomly with custom resolution but
the common structure is composed by 4000 training videos,
200 testing videos and 200 more for validation. This kind of
datasets are purely focused on video prediction.
Van Hateren Dataset of natural videos (version [127]): is
a very small dataset of 56 videos, each 64 frames long, that
has been widely used in unsupervised learning. Original
images were taken and given for scientific use by the
photographer Hans van Hateren, and they feature moving
animals in grasslands along rivers and streams. Its frame
size is 128 × 128 pixels. The version we are reviewing
is the one provided along with the work of Cadieu and
Olshausen [127].
NORBvideos [128]: NORB (NYU Object Recognition Benchmark) dataset [138] is a compilation of static stereo pairs
of 50 homogeneously colored objects from various points
of view and 6 lightning conditions. Those images were
processed to obtain their object masks and even their casted
shadows, allowing them to augment the dataset introducing
random backgrounds. Viewpoints are determined by rotating the camera through 9 elevations and 18 azimuths (every
20 degrees) around the object. NORBvideos dataset was built
by sequencing all these frames for each object.
Moving MNIST [74] (M-MNIST): is a video prediction
dataset built from the composition of 20-frame video sequences where two handwritten digits from the MNIST
database are combined inside a 64 × 64 patch, and moved
with some velocity and direction along frames, potentially
overlapping between them. This dataset is almost infinite (as
new sequences can be generated on the fly), and it also has
interesting behaviours due to occlusions and the dynamics
of digits bouncing off the walls of the patch. For these reasons, this dataset is widely used by many predictive models.
A stochastic variant of this dataset is also available. In the
original M-MNIST the digits move with constant velocity
and bounce off the walls in a deterministic manner. In
contrast, in SM-MNIST digits move with a constant velocity
along a trajectory until they hit at wall at which point they
bounce off with a random speed and direction. In this way,
8
moments of uncertainty (each time a digit hits a wall) are
interspersed with deterministic motion.
Robotic Pushing Dataset [89]: is a dataset created for
learning about physical object motion. It consist on 640×512
pixels image sequences of 10 different 7-degree-of-freedom
robotic arms interacting with real-world physical objects.
No additional labeling is given, the dataset was designed
to model motion at pixel level through deep learning algorithms based on convolutional LSTM (ConvLSTM).
BAIR Robot Pushing Dataset (used in [129]): BAIR (Berkeley Artificial Intelligence Research) group has been working
on robots that can learn through unsupervised training (also
known in this case as self-supervised), this is, learning the
consequences that its actions (movement of the arm and
grip) have over the data it can measure (images from two
cameras). In this way, the robot assimilates physics of the
objects and can predict the effects that its actions will generate on the environment, allowing it to plan strategies to
achieve more general goals. This was improved by showing
the robot how it can grab tools to interact with other objects.
The dataset is composed by hours of this self-supervised
learning with the robotic arm Sawyer.
RoboNet [130]: is a dataset composed by the aggregation of various self-supervised training sequences of seven
robotic arms from four different research laboratories. The
previously described BAIR group is one of them, along
with Stanford AI Laboratory, Grasp Lab of the University of
Pennsylvania and Google Brain Robotics. It was created with
the goal of being a standard, in the same way as ImageNet
is for images, but for robotic self-supervised learning. Several experiments have been performed studying how the
transfer among robotic arms can be achieved.
4.5
Other-purpose and Multi-purpose Datasets
ViSOR [131]: ViSOR (Video Surveillance Online Repository)
is a repository designed with the aim of establishing an open
platform for collecting, annotating, retrieving, and sharing
surveillance videos, as well as evaluating the performance
of automatic surveillance systems. Its raw data could be
very useful for video prediction due to its implicit static
camera.
PROST [132]: is a method for online tracking that used
ten manually annotated videos to test its performance.
Four of them were created by PROST authors, and they
conform the dataset with the same name. The remaining six
sequences were borrowed from other authors, who released
their annotated clips to test their tracking methods. We will
consider both 4-sequences PROST dataset and 10-sequences
aggregated dataset when providing statistics. In each video,
different challenges are presented for tracking methods:
occlusions, 3D motion, varying illumination, heavy appearance/scale changes, moving camera, motion blur, among
others. Provided annotations include bounding boxes for
the object/element being tracked.
Arcade Learning Environment [133]: is a platform that enables machine learning algorithms to interact with the Atari
2600 open-source emulator Stella to play over 500 Atari
games. The interface provides a single 2D frame of 210×160
pixels resolution at 60 fps in real-time, and up to 6000 fps
when it is running at full speed. It also offers the possibility
of saving and restoring the state of a game. Although its
obvious main application is reinforcement learning, it could
also be profitable as source of almost-infinite interactive
video sequences from which prediction models can learn.
Inria 3DMovie Dataset v2 [134]: is a video dataset which
extracted its data from the StreetDance 3D stereo movies.
The dataset includes stereo pairs, and manually generated
ground-truth for human segmentation, poses and bounding
boxes. The second version of this dataset, used in [134],
is composed by 27 clips, which represent 2476 frames, of
which just a sparse subset of 235 were annotated.
RobotriX [16]: is a synthetic dataset designed for assistance
robotics, that consist of sequences where a humanoid robot
is moving through various indoor scenes and interacting
with objects, recorded from multiple points of view, including robot-mounted cameras. It provides a huge variety
of ground-truth data generated synthetically from highlyrealistic environments deployed on the cutting-edge game
engine UnrealEngine, through the also available tool UnrealROX [139]. RGB frames are provided at 1920 × 1080
pixels resolution and at 60 fps, along with pixel-precise
instance masks, depth and normal maps, and 6D poses of
objects, skeletons and cameras. Moreover, UnrealROX is an
open source tool for retrieving ground-truth data from any
simulation running in UnrealEngine.
UASOL [135]: is a large-scale dataset consisting of highresolution sequences of stereo pairs recorded outdoors at
pedestrian (egocentric) point of view. Along with them,
precise depth maps are provided, computed offline from
stereo pairs by the same camera. This dataset is intended to
be useful for depth estimation, both from single and stereo
images, research fields where outdoor and pedestrian-pointof-view data is not abundant. Frames were taken at a
resolution of 2280 × 1282 pixels at 15 fps.
5
V IDEO P REDICTION M ETHODS
In the video prediction literature we find a broad range of
different methods and approaches. Early models focused
on directly predicting raw pixel intensities, by implicitly
modeling scene dynamics and low-level details (Section 5.1).
However, extracting a meaningful and robust representation from raw videos is challenging, since the pixel space
is highly dimensional and extremely variable. From this
point, reducing the supervision effort and the representation
dimensionality emerged as a natural evolution. On the one
hand, the authors aimed to disentangle the factors of variation from the visual content, i.e. factorizing the prediction
space. For this purpose, they: (1) formulated the prediction
problem into an intermediate transformation space by explicitly modeling the source of variability as transformations
between frames (Section 5.2); (2) separated motion from the
visual content with a two-stream computation (Section 5.3).
On the other hand, some models narrowed the output
space by conditioning the predictions on extra variables
(Section 5.4), or reformulating the problem in a higher-level
space (Section 5.5). High-level representation spaces are
increasingly more attractive, since intelligent systems rarely
9
Video Prediction
Through Direct
Pixel Synthesis
Factorizing the
Prediction Space
Narrowing the
Prediction Space
By Incorporating
Uncertainty
Implicit Modeling
of Scene Dynamics
Using Explicit
Transformations
By Conditioning on
Extra Variables
Using Probabilistic
Approaches
With Explicit Motion from Content
Separation
To High-level
Feature Space
Fig. 3: Classification of video prediction models.
rely on raw pixel information for decision making. Besides
simplifying the prediction task, some other works addressed
the future uncertainty in predictions. As the vast majority of
video prediction models are deterministic, they are unable
to manage probabilistic environments. To address this issue,
several authors proposed modeling future uncertainty with
probabilistic models (Section 5.6).
So far in the literature, there is no specific taxonomy that
classifies video prediction models. In this review, we have
classified the existing methods according to the video prediction problem they addressed and following the classification illustrated in Figure 3. For simplicity, each subsection
extends directly the last level in the taxonomy. Moreover,
some methods in this review can be classified in more
than one category since they addressed multiple problems.
For instance, [9], [54], [85] are probabilistic models making
predictions in a high-level space as they addressed both the
future uncertainty and high dimensionality in videos. The
category of these models were specified according to their
main contribution. The most relevant methods, ordered in a
chronological order, are summarized in Table 2 containing
low-level details. Prediction is a widely discussed topic
in different fields and at different levels of abstraction.
For instance, the future prediction from a static image [3],
[106], [140]–[143], vehicle behavior prediction [144] and
human action prediction [17] are a different but inspiring
research fields. Although related, the aforementioned topics
are outside the scope of this particular review, as it focuses
purely on the video prediction methods using a sequence of
previous frames as context and is limited to 2D RGB data.
5.1
Direct Pixel Synthesis
Initial video prediction models attempted to directly predict
future pixel intensities without any explicit modeling of
the scene dynamics. Ranzato et al. [73] discretized video
frames in patch clusters using k-means. They assumed that
non-overlapping patches are equally different in a k-means
discretized space, yet similarities can be found between
patches. The method is a convolutional extension of a RNNbased model [145] making short-term predictions at the
patch-level. As the full-resolution frame is a composition
of the predicted patches, some tilling effect can be noticed.
Predictions of large and fast-moving objects are accurate,
however, when it comes to small and slow-moving objects there is still room for improvement. These are common issues for most methods making predictions at the
patch-level. Addressing longer-term predictions, Srivastava
et al. [74] proposed different AE-based approaches incorporating LSTM units to model the temporal coherence. Using
convolutional [146] and flow [147] percepts alongside RGB
image patches, authors tested the models on multi-domain
tasks and considered both unconditioned and conditioned
decoder versions. The latter only marginally improved the
prediction accuracy. Replacing the fully connected LSTMs
with convolutional LSTMs, Shi et al. proposed an end-to-end
model efficiently exploiting spatial correlations [13]. This
enhanced prediction accuracy and reduced the number of
parameters.
Inspired on adversarial training: Building on the recent
success of the Laplacian Generative Adversarial Network
(LAPGAN), Mathieu et al. proposed the first multi-scale
architecture for video prediction that was trained in an adversarial fashion [43]. Their novel GDL regularization combined with ℓ1 -based reconstruction and adversarial training
represented a leap over the previous state-of-the-art models [73], [74] in terms of prediction sharpness. However, it
was outperformed by the Predictive Coding Network (PredNet) [75] which stacked several ConvLSTMs vertically connected by a bottom-up propagation of the local ℓ1 error computed at each level. Previously to PredNet, the same authors
proposed the Predictive Generative Network (PGN) [49], an
end-to-end model trained with a weighted combination of
adversarial loss and MSE on synthetic data. However, no
tests on natural videos and comparison with state-of-the-art
predictive models were carried out. Using a similar training
strategy as [43], Zhou et al. used a convolutional AE to learn
long-term dependencies from time-lapse videos [103]. Build
on Progressively Growing GANs (PGGANs) [148], Aigner et
al. proposed the FutureGAN [69], a three-dimensional (3d)
convolutional Encoder-decoder (ED)-based model. They
used the Wasserstein GAN with gradient penalty (WGANGP) loss [149] and conducted experiments on increasingly
complex datasets. Extending [13], Zhang et al. proposed
a novel LSTM-based architecture where hidden states are
updated along a z-order curve [70]. Dealing with distortion
and temporal inconsistency in predictions and inspired by
the Human Visual System (HVS), Jin et al. [150] first incorporated multi-frequency analysis into the video prediction
task to decompose images into low and high frequency
bands. This allowed high-fidelity and temporally consistent
predictions with the ground truth, as the model better lever-
10
ages the spatial and temporal details. The proposed method
outperformed previous state-of-the-art in all metrics except
in the Learned Perceptual Image Patch Similarity (LPIPS),
where probabilistic models achieved a better performance
since their predictions are clearer and realistic but less
consistent with the ground truth. Distortion and blurriness
are further accentuated when it comes to predict under fast
camera motions. To this end, Shouno [151] implemented a
hierarchical residual network with top-down connections.
Leveraging parallel prediction at multiple scales, authors
reported finer details and textures under fast and large
camera motion.
Bidirectional flow: Under the assumption that video sequences are symmetric in time, Kwon et al. [101] explored
a retrospective prediction scheme training a generator for
both, forward and backward prediction (reversing the input sequence to predict the past). Their cycle GAN-based
approach ensure the consistency of bidirectional prediction
through retrospective cycle constraints. Similarly, Hu et
al. [57] proposed a novel cycle-consistency loss used to train
a GAN-based approach (VPGAN). Future frames are generated from a sequence of context frames and their variation
in time, denoted as Z . Under the assumption that Z is
symmetric in the encoding space, it is manipulated by the
model manipulates to generate desirable moving directions.
In the same spirit, other works focused on both, forward
and backward predictions [37], [152]. Enabling state sharing
between the encoder and decoder, Oliu et al. proposed the
folded Recurrent Neural Network (fRNN) [153], a recurrent
AE architecture featuring GRUs that implement a bidirectional flow of the information. The model demonstrated a
stratified representation, which makes the topology more
explainable, as well as efficient compared to regular AEs in
terms or memory consumption and computational requirements.
Exploiting 3D convolutions: for modeling short-term features, Wang et al. [66] integrated them into a recurrent
network demonstrating state-of-the-art results in both video
prediction and early activity recognition. While 3D convolutions efficiently preserves local dynamics, RNNs model
the long-range context. The eidetic 3d LSTM (E3d-LSTM)
network, represented in Figure 4, features a gated-controlled
self-attention module, i.e. eidetic 3D memory, that effectively
manages historical memory records across multiple time
steps. This enables long-range video reasoning, outperforming previous approaches. Some other works used 3D
convolutional operations to model the time dimension [69].
Analyzing the previous works, Byeon et al. [76] identified
a lack of spatial-temporal context in the representations,
leading to blurry results when it comes to the future uncertainty. Although authors addressed this contextual limitation with dilated convolutions and multi-scale architectures,
the context representation progressively vanishes in longterm predictions. To address this issue, they proposed a
context-aware model that efficiently aggregates per-pixel
contextual information at each layer and in multiple directions. The core of their proposal is a context-aware layer
consisting of two blocks, one aggregating the information
from multiple directions and the other blending them into a
Fig. 4: Representation of the 3D encoder-decoder architecture of E3d-LSTM [66]. After reducing T consecutive input
frames to high-dimensional feature maps, these are directly
fed into a novel eidetic module for modeling long-term spatiotemporal dependencies. Finally, stacked 3D CNN decoder
outputs the predicted video frames. For classification tasks
the hidden states can be directly used as the learned video
representation. Figure extracted from [66].
It
It+1
It
It+1
(x, y)
(x, y)
P (x, y)
(x, y)
(x + u, y + v)
It+1 (x, y) = f (It (x + u, y + v))
(a) Vector-based.
It+1 (x, y) = K(x, y) ∗ P (x, y)
(b) Kernel-based.
Fig. 5: Representation of transformation-based approaches.
(a) Vector-based with a bilinear interpolation. (b) Applying transformations as a convolutional operation. Figure
inspired by [154].
unified context.
Extracting a robust representation from raw pixel values
is an overly complicated task due to the high-dimensionality
of the pixel space. The per-pixel variability between consecutive frames, causes an exponential growth in the prediction
error on the long-term horizon.
5.2
Using Explicit Transformations
Let X = (Xt−n , . . . , Xt−1 , Xt ) be a video sequence of n
frames, where t denotes time. Instead of learning the visual appearance, transformation-based approaches assume
that visual information is already available in the input
sequence. To deal with the strong similarity and pixel redundancy between successive frames, these methods explicitly
model the transformations that takes a frame at time t to the
frame at t+1. These models are formally defined as follows:
Yt+1 = T (G (Xt−n:t ) , Xt−n:t ) ,
(1)
11
where G is a learned function that outputs future transformation parameters, which applied to the last observed
frame Xt using the function T , generates the future frame
prediction Yt+1 . According to the classification of Reda
et al. [154], T function can be defined as a vector-based
resampling such as bilinear sampling, or adaptive kernelbased resampling, e.g. using convolutional operations. For
instance, a bilinear sampling operation is defined as:
Yt+1 (x, y) = f (Xt (x + u, y + v)) ,
(2)
where f is a bilinear interpolator such as [7], [155], [156],
(u, v) is a motion vector predicted by G , and Xt (x, y)
is a pixel value at (x,y) in the last observed frame Xt .
Approaches following this formulation are categorized as
vector-based resampling operations and are depicted in
Figure 5a.
On the other side, in the kernel-based resampling, the
G function predicts the kernel K(x, y) which is applied as a
convolution operation using T , as depicted in Figure 5b and
is mathematically represented as follows:
Yt+1 (x, y) = K(x, y) ∗ Pt (x, y),
(3)
where K(x, y) ∈ RN xN is the 2D kernel predicted by the
function G and Pt (x, y) is an N × N patch centered at (x, y).
Combining kernel and vector-based resampling into a
hybrid solution, Reda et al. [154] proposed the Spatially
Displaced Convolution (SDC) module that synthesizes highresolution images applying a learned per-pixel motion vector and kernel at a displaced location in the source image.
Their 3D CNN model trained on synthetic data and featuring the SDC modules, reported promising predictions of a
high-fidelity.
5.2.1 Vector-based Resampling
Bilinear models use multiplicative interactions to extract
transformations from pairs of observations in order to relate
images, such as Gated Autoencoders (GAEs) [157]. Inspired
by these models, Michalski et al. proposed the Predictive
Gating Pyramid (PGP) [158] consisting on a recurrent pyramid of stacked GAEs. To the best of our knowledge, this
was the first attempt to predict future frames in the affine
transform space. Multiple GAEs are stacked to represent a
hierarchy of transformations and capture higher-order dependencies. From the experiments on predicting frequency
modulated sin-waves, authors stated that standard RNNs
were outperformed in terms of accuracy. However, no performance comparison was conducted on videos.
Based on the Spatial Transformer (ST) module [159]:
To provide spatial transformation capabilities to existing
CNNs, Jaderberg et al. [159] proposed the ST module represented in Figure 6. It regresses different affine transformation parameters for each input, to be applied as a single
transformation to the whole feature map(s) or image(s).
Moreover, it can be incorporated at any part of the CNNs
and it is fully differentiable. The ST module is the essence of
vector-based resampling approaches for video prediction.
As an extension, Patraucean et al. [77] modified the grid
generator to consider per-pixel transformations instead of a
single dense transformation map for the entire image. They
nested a LSTM-based temporal encoder into a spatial AE,
Fig. 6: A representation of the spatial transformer module
proposed by [159]. First, the localization network regresses
the transformation parameters, denoted as θ, from the input
feature map U . Then, the grid generator creates a sampling
grid from the predicted transformation parameters. Finally,
the sampler produces the output map by sampling the input
at the points defined in the sampling grid. Figure extracted
from [159].
proposing the AE-convLSTM-flow architecture. The prediction is generated by resampling the current frame with the
flow-based predicted transformation. Using the components
of the AE-convLSTM-flow architecture, Lu et al. [78] assembled an extrapolation module which is unfolded in time
for multi-step prediction. Their Flexible Spatio-semporal
Network (FSTN) features a novel loss function using the
DeePSiM perceptual loss [44] in order to mitigate blurriness.
An exhaustive experimentation and ablation study was
carried out, testing multiple combinations of loss functions.
Also inspired on the ST module for the volume sampling
layer, Liu et al. proposed the Deep Voxel Flow (DVF) architecture [7]. It consists of a multi-scale flow-based ED model
originally designed for the video frame interpolation task,
but also evaluated on a predictive basis reporting sharp
results. Liang et al. [55] use a flow-warping layer based on a
bilinear interpolation. Finn et al. proposed the Spatial Transformer Predictor (STP) motion-based model [89] producing
2D affine transformations for bilinear sampling. Pursuing
efficiency, Amersfoort et al. [71] proposed a CNN designed
to predict local affine transformations of overlapping image
patches. Unlike the ST module, authors estimated transformations of input frames off-line and at patch level. As the
model is parameter-efficient, it was unfolded in time for
multi-step prediction. This resembles RNNs as the parameters are shared over time and the local affine transforms
play the role of recurrent states.
5.2.2
Kernel-based Resampling
As a promising alternative to the vector-based resampling,
recent approaches synthesize pixels by convolving input
patches with a predicted kernel. However, convolutional
operations are limited in learning spatial invariant representations of complex transformations. Moreover, due to
their local receptive fields, global spatial information is not
fully preserved. Using larger kernels would help to preserve global features, but in exchange for a higher memory
consumption. Pooling layers are another alternative, but
loosing spatial resolution. Preserving spatial resolution at a
low computational cost is still an open challenge for future
video frame prediction task. Transformation layers used
12
in vector-based resampling [7], [77], [159] enabled CNNs
to be spatially invariant and also inspired kernel-based
architectures.
Inspired on the Convolutional Dynamic Neural Advection (CDNA) module [89]: In addition to the STP vectorbased model, Finn et al. [89] proposed two different kernelbased motion prediction modules outperforming previous
approaches [43], [80], (1) the Dynamic Neural Advection
(DNA) module predicting different distributions for each
pixel and (2) the CDNA module that instead of predicting
different distributions for each pixel, it predicts multiple
discrete distributions applied convolutionally to the input.
While, CDNA and STP mask out objects that are moving
in consistent directions, the DNA module produces perpixel motion. These modules inspired several kernel-based
approaches. Similar to the CDNA module, Klein et al.
proposed the Dynamic Convolutional Layer (DCL) [160]
for short-range weather prediction. Likewise, Brabandere
et al. [161] proposed the Dynamic Filter Networks (DFN)
generating sample (for each image) and position-specific
(for each pixel) kernels. This enabled sophisticated and
local filtering operations in comparison with the ST module,
that is limited to global spatial transformations. Different
to the CDNA model, the DFN uses a softmax layer to
filter values of greater magnitude, thus obtaining sharper
predictions. Moreover, temporal correlations are exploited
using a parameter-efficient recurrent layer, much simpler
than [13], [74]. Exploiting adversarial training, Vondrick et
al. proposed a cGAN-based model [102] consisting of a
discriminator similar to [67] and a CNN generator featuring a transformer module inspired on the CDNA model.
Different from the CDNA model, transformations are not
applied recurrently on a per-frame basis. To deal with inthe-wild videos and make predictions invariant to camera
motion, authors stabilized the input videos. However, no
performance comparison with previous works has been
conducted.
Relying on kernel-based transformations and improving [162], Luc et al. [163] proposed the Transformation-based
& TrIple Video Discriminator GAN (TrIVD-GAN-FP) featuring a novel recurrent unit that computes the parameters
of a transformation used to warp previous hidden states
without any supervision. These Transformation-based Spatial Recurrent Units (TSRUs) are generic modules and can
replace any traditional recurrent unit in currently existent
video prediction approaches.
Object-centric representation: Instead of focusing on the
whole input, Chen et al. [50] modeled individual motion of
local objects, i.e. object-centered representations. Based on
the ST module and a pyramid-like sampling [164], authors
implemented an attention mechanism for object selection.
Moreover, transformation kernels were generated dynamically as in the DFN, to then apply them to the last patch
containing an object. Although object-centered predictions
is novel, performance drops when dealing with multiple
objects and occlusions as the attention module fails to distinguish them correctly.
Fig. 7: MCnet with Multi-scale Motion-Content Residuals.
While the motion encoder captures the temporal dynamics
in a sequence of image differences, the content encoder
extracts meaningful spatial features from the last observed
RGB frame. After that, the network computes motioncontent features that are fed into the decoder to predict the
next frame. Figure extracted from [65].
5.3
Explicit Motion from Content Separation
Drawing inspiration from two-stream architectures for action recognition [165], video generation from a static image [67], and unconditioned video generation [68], authors
decided to factorize the video into content and motion to
process each on a separate pathway. By decomposing the
high-dimensional videos, the prediction is performed on
a lower-dimensional temporal dynamics separately from
the spatial layout. Although this makes end-to-end training
difficult, factorizing the prediction task into more tractable
problems demonstrated good results.
The Motion-content Network (MCnet) [65], represented
in Figure 7 was the first end-to-end model that disentangled scene dynamics from the visual appearance. Authors
performed an in-depth performance analysis ensuring the
motion and content separation through generalization capabilities and stable long-term predictions compared to models that lack of explicit motion-content factorization [43],
[74]. In a similar fashion, yet working in a higher-level pose
space, Denton et al. proposed Disentangled-representation
Net (DRNET) [79] using a novel adversarial loss —it isolates
the scene dynamics from the visual content, considered as
the discriminative component— to completely disentangle
motion dynamics from content. Outperforming [43], [65],
the DRNET demonstrated a clean motion from content
separation by reporting plausible long-term predictions on
both synthetic and natural videos. To improve prediction
variability, Liang et al. [55] fused the future-frame and
future-flow prediction into a unified architecture with a
shared probabilistic motion encoder. Aiming to mitigate
the ghosting effect in disoccluded regions, Gae et al. [166]
proposed a two-staged approach consisting of a separate
computation of flow and pixel predictions. As they focused
on inpainting occluded regions of the image using flow
information, they improved results on disoccluded areas
avoiding undesirable artifacts and enhancing sharpness.
Separating the moving objects and the static background,
13
Wu et al. [167] proposed a two-staged architecture that
firstly predicts the static background to then, using this
information, predict the moving objects in the foreground.
Final results are generated through composition and by
means of a video inpainting module. Reported predictions
are quite accurate, yet performance was not contrasted with
the latest video prediction models.
Although previous approaches disentangled motion
from content, they have not performed an explicit decomposition into low-dimensional components. Addressing this issue, Hsieh et al. proposed the Decompositional
Disentangled Predictive Autoencoder (DDPAE) [168] that
decomposes the high-dimensional video into components
represented with low-dimensional temporal dynamics. On
the Moving MNIST dataset, DDPAE first decomposes images into individual digits (components) to then factorize
each digit into its visual appearance and spatial location,
being the latter easier to predict. Although experiments
were performed on synthetic data, this approach represents
a promising baseline to disentangle and decompose natural
videos. Moreover, it is applicable to other existing models to
improve their predictions.
5.4
Conditioned on Extra Variables
Conditioning the prediction on extra variables such as vehicle odometry or robot state, among others, would narrow
the prediction space. These variables have a direct influence
on the dynamics of the scene, providing valuable information that facilitates the prediction task. For instance, the
motion captured by a camera placed on the dashboard of
an autonomous vehicle is directly influenced by the wheelsteering and acceleration. Without explicitly exploiting this
information, we rely blindly on the model’s capabilities to
correlate the wheel-steering and acceleration with the perceived motion. However, the explicit use of these variables
would guide the prediction.
Following this paradigm, Oh et al. first made longterm video predictions conditioned by control inputs from
Atari games [80]. Although the proposed ED-based models
reported very long-term predictions (+100), performance
drops when dealing with small objects (e.g. bullets in
Space Invaders) and while handling stochasticity due to the
squared error. However, by simply minimizing ℓ2 error can
lead to accurate and long-term predictions for deterministic
synthetic videos, such as those extracted from Atari video
games. Building on [80], Chiappa et al. [169] proposed
alternative architectures and training schemes alongside an
in-depth performance analysis for both short and long-term
prediction. Similar model-based control from visual inputs
performed well in restricted scenarios [170], but was inadequate for unconstrained environments. These deterministic
approaches are unable to deal with natural videos in the
absence of control variables.
To address this limitation, the models proposed by Finn
et al. [89] successfully made predictions on natural images,
conditioned on the robot state and robot-object interactions performed in a controlled scenario. These models
predict per-pixel transformations conditioned by the previous frame, to finally combine them using a composition
mask. They outperformed [43], [80] on both conditioned
and unconditioned predictions, however the quality of
long-term predictions degrades over time because of the
blurriness caused by the MSE loss function. Also, using
high-dimensional sensory such as images, Dosovitskiy et
al. [171] proposed a sensorimotor control model which enables interaction in complex and dynamic 3d environments.
The approach is a reinforcement learning (RL)-based techniques, with the difference that instead of building upon a
monolithic state and a scalar reward, the authors consider
high-dimensional input streams, such as raw visual input,
alongside a stream of measurements or player statistics.
Although the outputs are future measurements instead of
visual predictions, it was proven that using multivariate
data benefits decision-making over conventional scalar reward approaches.
5.5
In the High-level Feature Space
Despite the vast work on video prediction models, there
is still room for improvement in natural video prediction.
To deal with the curse of dimensionality, authors reduced
the prediction space to high-level representations, such
as semantic and instance segmentation, and human pose.
Since the pixels are categorical, the semantic space greatly
simplifies the prediction task, yet unexpected deformations
in semantic maps and disocclusions, i.e. initially occluded
scene entities become visible, induce uncertainty. However,
high-level prediction spaces are more tractable and constitute good intermediate representations. By bypassing the
prediction in the pixel space, models become able to report
longer-term and more accurate predictions.
5.5.1 Semantic Segmentation
In recent years, semantic and instance representations have
gained increasing attention, emerging as a promising avenue for complete scene understanding. By decomposing
the visual scene into semantic entities, such as pedestrians, vehicles and obstacles, the output space is narrowed
to high-level scene properties. This intermediate representation represents a more tractable space as pixel values
of a semantic map are categorical. In other words, scene
dynamics are modeled at the semantic entity level instead
of being modeled at the pixel level. This has encouraged
authors to (1) leverage future prediction to improve parsing
results [51] and (2) directly predict segmentation maps into
the future [8], [56], [172].
Exploring the scene parsing in future frames, Jin et al.
proposed the Parsing with prEdictive feAtuRe Learning
(PEARL) framework [51] which was the first to explore the
potential of a GAN-based frame prediction model to improve per-pixel segmentation. Specifically, this framework
conducts two complementary predictive learning tasks.
Firstly, it captures the temporal context from input data
by using a single-frame prediction network. Then, these
temporal features are embedded into a frame parsing network through a transform layer for generating per-pixel
future segmentations. Although the predictive net was not
compared with existing approaches, PEARL outperforms
the traditional parsing methods by generating temporally
consistent segmentations. In a similar fashion, Luc et al. [56]
extended the msCNN model of [43] to the novel task of
14
Fig. 8: Two-staged method proposed by Chiu et al. [173]. In the upper half, the student network consists on an ED-based
architecture featuring a 3D convolutional forecasting module. It performs the forecasting task guided by an additional loss
generated by the teacher network (represented in the lower half). Figure extracted from [173].
predicting semantic segmentations of future frames, using
softmax pre-activations instead of raw pixels as input. The
use of intermediate features or higher-level data as input
is a common practice in the video prediction performed
in the high-level feature space. Some authors refer to this
type or input data as percepts. Luc et al. explored different
combinations of loss functions, inputs (using RGB information alongside percepts), and outputs (autoregressive and
batch models). Results on short, medium and long-term
predictions are sound, however, the models are not endto-end and they do not capture explicitly the temporal
continuity across frames. To address this limitation and
extending [51], Jin et al. first proposed a model for jointly
predicting motion flow and scene parsing [174]. Flow-based
representations implicitly draw temporal correlations from
the input data, thus producing temporally coherent perpixel segmentations. As in [56], the authors tested different
network configurations, as using Res101-FCN percepts for
the prediction of semantic maps, and also performed multistep prediction up to 10 time-steps into the future. Perpixel accuracy improved when segmenting small objects,
e.g. pedestrians and traffic signs, which are more likely to
vanish in long-term predictions. Similarly, except that time
dimension is modeled with LSTMs instead of motion flow
estimation, Nabavi et al. proposed a simple bidirectional EDLSTM [82] using segmentation masks as input. Although
the literature on knowledge distillation [175], [176] stated
that softmax pre-activations carry more information than
class labels, this model outperforms [56], [174] on short-term
predictions.
Another relevant idea is to use both motion flow estimation alongside LSTM-based temporal modeling. In this
direction, Terwilliger et al. [10] proposed a novel method
performing a LSTM-based feature-flow aggregation. Authors also tried to further simplify the semantic space by
disentangling motion from semantic entities [65], achieving low overhead and efficiency. The prediction problem
was decomposed into two subtasks, that is, current frame
segmentation and future optical flow prediction, which are
finally combined with a novel end-to-end warp layer. An
improvement on short-term predictions were reported over
previous works [56], [174], yet performing worse on midterm predictions.
A different approach was proposed by Vora et al. [83]
which first incorporated structure information to predict
future 3D segmented point clouds. Their geometry-based
model consists of several derivable sub-modules: (1) the
pixel-wise segmentation and depth estimation modules
which are jointly used to generate the 3d segmented point
cloud of the current RGB frame; and (2) an LSTM-based
module trained to predict future camera ego-motion trajectories. The future 3d segmented point clouds are obtained
by transforming the previous point clouds with the predicted ego-motion. Their short-term predictions improved
the results of [56], however, the use of structure information
for longer-term predictions is not clear.
The main disadvantage of two-staged, i.e. not end-toend, approaches [10], [56], [82], [83], [174] is that their
performance is constrained by external supervisory signals,
e.g. optical flow [177], segmentation [178] and intermediate features or percepts [61]. Breaking this trend, Chiu et
al. [173] first solved jointly the semantic segmentation and
forecasting problems in a single end-to-end trainable model
by using raw pixels as input. This ED architecture is based
on two networks, with one performing the forecasting task
(student) and the other (teacher) guiding the student by
means of a novel knowledge distillation loss. An in-depth
ablation study was performed, validating the performance
of the ED architectures as well as the 3D convolution used
for capturing temporal scale instead of a LSTM or ConvLSTM, as in previous works.
Avoiding the flood of deterministic models, Bhattacharyya et al. proposed a Bayesian formulation of the
ResNet model in a novel architecture to capture model
and observation uncertainty [9]. As main contribution, their
dropout-based Bayesian approach leverages synthetic likelihoods [179], [180] to encourage prediction diversity and deal
with multi-modal outcomes. Since Cityscapes sequences
have been recorded in the frame of reference of a moving vehicle, authors conditioned the predictions on vehicle
odometry.
5.5.2
Instance Segmentation
While great strides have been made in predicting future
segmentation maps, the authors attempted to make predictions at a semantically richer level, i.e. future prediction of
semantic instances. Predicting future instance-level segmentations is a challenging and weakly unexplored task. This
15
is because instance labels are inconsistent and variable in
number across the frames in a video sequence. Since the
representation of semantic segmentation prediction models
is of fixed-size, they cannot directly address semantics at the
instance level.
To overcome this limitation and introducing the novel
task of predicting instance segmentations, Luc et al. [8]
predict fixed-sized feature pyramids, i.e. features at multiple
scales, used by the Mask R-CNN [181] network. The combination of dilated convolutions and multi-scale, efficiently
preserve high-resolution details improving the results over
previous methods [56]. To further improve predictions, Sun
et al. [84] focused on modeling not only the spatio-temporal
correlations between the pyramids, but also the intrinsic
relations among the feature layers inside them. By enriching the contextual information using the proposed Context
Pyramid ConvLSTMs (CP-ConvLSTM), an improvement in
the prediction was noticed. Although the authors have
not shown any long-term predictions nor compared with
semantic segmentation models, their approach is currently
the state of the art in the task of predicting instance segmentations, outperforming [8].
5.5.3
Other High-level Spaces
Although semantic and instance segmentation spaces were
the most used in video prediction, other high-level spaces
such as human pose and keypoints represent a promising
avenue.
Human Pose: As the human pose is a low-dimensional and
interpretable structure, it represents a cheap supervisory
signal for predictive models. This fostered pose-guided prediction methods, where future frame regression in the pixel
space is conditioned by intermediate prediction of human
poses. However, these methods are limited to videos with
human presence. As this review focuses on video prediction,
we will briefly review some of the most relevant methods
predicting human poses as an intermediate representation.
From a supervised prediction of human poses, Villegas et
al. [53] regress future frames through analogy making [182].
Although background is not considered in the prediction,
authors compared the model against [13], [43] reporting
long-term results. To make the model unsupervised on the
human pose, Wichers et al. [52] adopted different training
strategies: end-to-end prediction minimizing the ℓ2 loss,
and through analogy making, constraining the predicted
features to be close to the outputs of the future encoder.
Different from [53], in this work the predictions are made
in the feature space. As a probabilistic alternative, Walker et
al. [54] fused a conditioned Variational Autoencoder (cVAE)based probabilistic pose predictor with a GAN. While the
probabilistic predictor enhances the diversity in the predicted poses, the adversarial network ensures prediction
realism. As this model struggles with long-term predictions,
Fushishita et al. [183] addressed long-term video prediction of multiple outcomes avoiding the error accumulation
and vanishing gradients by using a unidimensional CNN
trained in an adversarial fashion. To enable multiple predictions, they have used additional inputs ensuring trajectory
and behavior variability at a human pose level. To better
preserve the visual appearance in the predictions than [53],
[65], [108], Tang et al. [184] firstly predict human poses using
a LSTM-based model to then synthesize pose-conditioned
future frames using a combination of different networks: a
global GAN modeling the time-invariant background and
a coarse human pose, a local GAN refining the coarsepredicted human pose, and a 3D-AE to ensure temporal
consistency across frames.
Keypoints-based representations: The keypoint coordinate
space is a meaningful, tractable and structured representation for prediction, ensuring stable learning. It enforces
model’s internal representation to contain object-level information. This leads to better results on tasks requiring objectlevel understanding such as, trajectory prediction, action
recognition and reward prediction. As keypoints are a natural representation of dynamic objects, Minderer et al. [85]
reformulated the prediction task in the keypoint coordinate
space. They proposed an AE architecture with a keypointbased representational bottleneck, consisting of a VRNN
that predicts dynamics in the keypoint space. Although
this model qualitatively outperforms the Stochastic Video
Generation (SVG) [81], Stochastic Adversarial Video Prediction (SAVP) [108] and EPVA [52] models, the quantitative
evaluation reported similar results.
5.6
Incorporating Uncertainty
Although high-level representations significantly reduce the
prediction space, the underlying distribution still has multiple modes. In other words, different plausible outcomes
would be equally probable for the same input sequence.
Addressing multimodal distributions is not straightforward
for regression and classification approaches, as they regress
to the mean and aim to discretize a continuous highdimensional space, respectively. To deal with the inherent
unpredictability of natural videos, some works introduced
latent variables into existing deterministic models or directly relied on generative models such as GANs and VAEs.
Inspired by the DVF, Xue et al. [199] proposed a cVAEbased [219], [220] multi-scale model featuring a novel cross
convolutional layer trained to regress the difference image
or Eulerian motion [221]. Background on natural videos is
not uniform, however the model implicitly assumes that the
difference image would accurately capture the movement
in foreground objects. Introducing latent variables into a
convolutional AE, Goroshin et al. [208] proposed a probabilistic model for learning linearized feature representations
to linearly extrapolate the predicted frame in a feature space.
Uncertainty is introduced to the loss by using a cosine
distance as an explicit curvature penalty. Authors focused
on evaluating the linearization properties, yet the model
was not contrasted to previous works. Extending [141],
[199], Fragkiadaki et al. [96] proposed several architectural
changes and training schemes to handle marginalization
over stochastic variables, such as sampling from the prior
and variational inference. They proposed a stochastic ED
architecture that predicts future optical flow, i.e., dense pixel
motion field, used to spatially transform the current frame
into the next frame prediction. To introduce uncertainty
in predictions, the authors proposed the k-best-sample-loss
(MCbest) that draws K outcomes penalizing those similar
to the ground-truth.
16
TABLE 2: Summary of video prediction models (c: convolutional; r: recurrent; v: variational; ms: multi-scale; st: stacked; bi:
bidirectional; P: Percepts; M: Motion; PL: Perceptual Loss; AL: Adversarial Loss; S/R: using Synthetic/Real datasets; SS:
Semantic Segmentation; D: Depth; S: State; Po: Pose; O: Odometry; IS: Instance Segmentation; ms: multi-step prediction;
pred-fr: number of predicted frames, ⋆ 1-5 frames, ⋆ ⋆ 5-10 frames, ⋆ ⋆ ⋆ 10-100 frames, ⋆ ⋆ ⋆ ⋆ over 100 frames; ood:
indicates if model was tested on out-of-domain tasks).
details
method
year
based on
architecture
Ranzato et al. [73]
Srivastava et al. [74]
PGN [49]
Shi et al. [13]
BeyondMSE [43]
PredNet [75]
ContextVP [76]
fRNN [153]
E3d-LSTM [66]
Kwon et al. [101]
Znet [70]
VPGAN [57]
Jin et al. [150]
Shouno et al. [151]
2014
2015
2015
2015
2016
2017
2018
2018
2019
2019
2019
2019
2020
2020
[145], [185]
[186]
[74]
[60], [187]
[22], [188]
[88], [189]
[13]
[45], [192], [193]
[13]
[79], [193]
[75]
rCNN
LSTM-AE
LSTM-cED
cLSTM
msCNN
stLSTMs
MD-LSTM
cGRU-AE
r3D-CNN
cycleGAN
cLSTM
GAN
cED-GAN
GAN
PGP [158]
Patraucean et al. [77]
DFN [161]
Amersfoort et al. [71]
FSTN [78]
Vondrick et al. [102]
Chen et al. [50]
DVF [7]
SDC-Net [154]
TrIVD-GAN-FP [163]
2014
[156]
2015
[186]
2016
[89], [160]
2017
[77]
2017
[44], [77]
2017
[67], [89]
2017 [71], [159], [161]
2017
[159]
2018
[196], [197]
2020 [142], [162], [166]
st-rGAEs
LSTM-cAE
r-cED
CNN
LSTM-cED
cGAN
rCNN-ED
ms-cED
CNN
DVD-GAN
MCnet [65]
Dual-GAN [55]
DRNET [79]
DPG [166]
2017
2017
2017
2019
[13], [165], [199]
[100]
[65]
[89], [142]
LSTM-cED
VAE-GAN
LSTM-ED
cED
Oh et al. [80]
Finn et al. [89]
2015
2016
[13]
[13], [80]
rED
st-cLSTMs
Villegas et al. [53]
PEARL [51]
S2S [56]
Walker et al. [54]
Jin et al. [174]
EPVA (EPVA) [52]
Nabavi et al. [82]
F2F et al. [8]
Vora et al. [83]
Chiu et al. [173]
Bayes-WD-SL [9]
Sun et al. [84]
Terwilliger et al. [10]
Struct-VRNN [85]
2017
2017
2017
2017
2017
2018
2018
2018
2018
2019
2019
2019
2019
2019
[182], [203], [204]
[43]
[205]
[51], [56], [77]
[53]
[56], [174]
[56], [181]
[56], [174]
[8]
[65], [174]
[206], [207]
LSTM-cED
cED
msCNN
vED
cED
LSTM-ED
biLSTM-cED
st-msCNN
LSTM
3D-cED
bayesResNet
st-ms-cLSTM
M-cLSTM
cVRNN
Goroshin et al. [208]
Fragkiadaki et al. [96]
EEN [99]
SV2P [38]
SVG [81]
Castrejon et al. [97]
Hu et al. [15]
2015
2017
2017
2018
2018
2019
2020
[209]
[141], [199]
[22], [75], [212]
[89]
[38]
[81], [98]
[56], [162], [216]
cAE
vED
vED
CDNA
LSTM-cED
vRNN
cED
datasets (train, valid, test)
evaluation
input
output
MS
loss function
S/R
pred-fr
ood
code
RGB
RGB,P
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB,Z
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
✗
X
✗
✗
X
X
X
X
X
X
X
X
X
X
CE
CE, ℓ2
M SE, AL
CE
ℓ1 , GDL, AL
ℓ 1 ,ℓ 2
ℓ1 , GDL
ℓ1
ℓ1 , ℓ2 , CE
ℓ1 , LoG, AL
ℓ2 , BCE, AL
ℓ1 , Lcycle , AL
ℓ2 , GDL, AL
Lp , AL, P L
R
SR
S
S
R
SR
R
SR
SR
R
SR
R
R
R
⋆
⋆⋆⋆
⋆
⋆⋆⋆
⋆⋆
⋆⋆
⋆⋆
⋆⋆⋆
⋆⋆⋆
⋆⋆⋆
⋆⋆⋆
⋆⋆⋆
⋆⋆⋆
⋆⋆⋆
✗
X
✗
X
✗
X
✗
✗
X
✗
✗
✗
✗
✗
✗
X
✗
✗
X
X
✗
X
X
✗
✗
✗
✗
✗
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
X
✗
X
X
X
X
X
X
X
X
ℓ2
ℓ 2 , ℓδ
BCE
M SE
ℓ 2 , ℓδ , P L
CE, AL
CE, ℓ2 , GDL, AL
ℓ1 , T V
ℓ1 , P L
Lhinge [55]
SR
SR
SR
SR
SR
R
SR
R
SR
R
⋆
⋆
⋆⋆⋆
⋆⋆
⋆⋆⋆
⋆⋆⋆
⋆⋆
⋆
⋆⋆
⋆⋆⋆
✗
X
X
✗
✗
X
✗
X
X
✗
✗
X
X
✗
✗
✗
✗
X
✗
✗
RGB
RGB
RGB
RGB
X
X
X
X
ℓp , GDL, AL
ℓ1 , KL, AL
ℓ2 , CE, AL
ℓp , T V, P L, CE
R
R
SR
SR
⋆⋆⋆
⋆⋆
⋆⋆⋆⋆
⋆⋆
✗
✗
X
✗
X
✗
X
✗
RGB
RGB
X
X
ℓ2
ℓ2
S
R
⋆⋆⋆⋆
⋆⋆⋆
X
✗
X
X
RGB,Po
SS
SS
RGB
SS,M
RGB
SS
P,SS,IS
ego-M
SS
SS
P,IS
SS
RGB
X
✗
X
X
X
X
X
X
✗
✗
X
X
X
X
ℓ2 , P L, AL [44]
ℓ2 , AL
ℓ1 , GDL, AL
ℓ2 , CE, KL, AL
ℓ1 , GDL, CE
ℓ2 , AL
CE
ℓ2
ℓ1
CE, M SE
KL
ℓ2 , [181]
CE, ℓ1
ℓ2 , KL
R
R
R
R
R
SR
R
R
R
R
SR
R
R
SR
⋆⋆⋆⋆
⋆
⋆⋆⋆
⋆⋆⋆
⋆⋆⋆
⋆⋆⋆⋆
⋆⋆
⋆⋆⋆
⋆
⋆⋆
⋆⋆⋆
⋆⋆
⋆⋆⋆
⋆⋆
X
X
✗
X
X
X
✗
X
X
✗
X
✗
✗
X
✗
✗
X
✗
✗
X
✗
X
✗
✗
X
✗
X
X
RGB
RGB
RGB
RGB
RGB
RGB
SS,D,M
✗
✗
X
X
X
X
X
ℓ2 , penalty
KL, M Cbest
ℓ 1 , ℓ2
ℓp , KL
ℓ2 , KL
KL
CE, ℓδ , Ld , Lc , Lp
SR
R
SR
SR
SR
SR
R
⋆
⋆
⋆⋆
⋆⋆⋆
⋆⋆⋆⋆
⋆⋆⋆
⋆⋆⋆
✗
X
✗
✗
✗
✗
X
✗
✗
X
X
X
✗
✗
Direct Pixel Synthesis
[115], [127]
[74], [113], [115], [123]
[126]
[74]
[115], [123]
[117], [119], [120], [137]
[115], [117], [119], [120]
[74], [111], [115]
[74], [111], [190], [191]
[115], [119], [120], [194], [195]
[74], [111]
[111], [129]
[111], [119], [120], [129]
[119], [120]
Using Explicit Transformations
[126], [128]
[74], [113], [131], [132]
[74], [115]
[74], [115]
[74], [115], [123], [131], [132]
[125]
[74], [115]
[115], [118]
[119], [124]
[115], [129], [198]
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB,M
RGB
Explicit Motion from Content Separation
[111], [112], [115], [123]
[115], [118]–[120]
[74], [111], [138], [200]
[119], [201], [202]
RGB
RGB
RGB
RGB
Conditioned on Extra Variables
[133]
[89], [117]
RGB,A
RGB,A,S
In the High-level Feature Space
[116], [117]
[121], [136]
[121], [136]
[115], [116]
[121], [137]
[117]
[121]
[121]
[121]
[121], [122]
[121]
[121], [134]
[121]
[90], [117]
RGB,Po
RGB
P
RGB,Po
RGB,P
RGB
P
P
ego-M
RGB
SS,O
P
RGB,P
RGB
Incorporating Uncertainty
[138], [210]
[117], [211]
[213]–[215]
[89], [117], [129]
[74], [111], [129]
[74], [121], [129]
[121], [122], [217], [218]
RGB
RGB
RGB
RGB
RGB
RGB
RGB
17
Incorporating latent variables into the deterministic
CDNA architecture for the first time, Babaeizadeh et
al. proposed the Stochastic Variational Video Prediction
(SV2P) [38] model handling natural videos. Their timeinvariant posterior distribution is approximated from the
entire input video sequence. Authors demonstrated that, by
explicitly modeling uncertainty with latent variables, the
deterministic CDNA model is outperformed. By combining a standard deterministic architecture (LSTM-ED) with
stochastic latent variables, Denton et al. proposed the SVG
network [81]. Different from SV2P, the prior is sampled from
a time-varying posterior distribution, i.e. it is a learned-prior
instead of fixed-prior sampled from the same distribution.
Most of the VAEs use a fixed Gaussian as a prior, sampling
randomly at each time step. Exploiting the temporal dependencies, a learned-prior predicts high variance in uncertain
situations, and a low variance when a deterministic prediction suffices. The SVG model is easier to train and reported
sharper predictions in contrast to [38]. Built upon SVG,
Villegas et al. [222] implemented a baseline to perform an
in-depth empirical study on the importance of the inductive
bias, stochasticity, and model’s capacity in the video prediction task. Different from previous approaches, Henaff et al.
proposed the Error Encoding Network (EEN) [99] that incorporates uncertainty by feeding back the residual error —the
difference between the ground truth and the deterministic
prediction— encoded as a low-dimensional latent variable.
In this way, the model implicitly separates the input video
into deterministic and stochastic components.
On the one hand, latent variable-based approaches cover
the space of possible outcomes, yet predictions lack of
realism. On the other hand, GANs struggle with uncertainty,
but predictions are more realistic. Searching for a tradeoff between VAEs and GANs, Lee et al. [108] proposed
the SAVP model, being the first to combine latent variable
models with GANs to improve variability in video predictions, while maintaining realism. Under the assumption
that blurry predictions of VAEs are a sign of underfitting, Castrejon et al. extended the VRNNs to leverage a
hierarchy of latent variables and better approximate data
likelihood [97]. Although the backpropagation through a
hierarchy of conditioned latents is not straightforward,
several techniques alleviated this issue such as, KL beta
warm-up, dense connectivity pattern between inputs and
latents, Ladder Variational Autoencoders (LVAEs) [223]. As
most of the probabilistic approaches fail in approximating
the true distribution of future frames, Pottorff et al. [224]
reformulated the video prediction task without making any
assumption about the data distribution. They proposed the
Invertible Linear Embedding (ILE) enabling exact maximum
likelihood learning of video sequences, by combining an
invertible neural network [225], also known as reversible
flows, and a linear time-invariant dynamic system. The
ILE handles nonlinear motion in the pixel space and scales
better to longer-term predictions compared to adversarial
models [43].
While previous variational approaches [81], [108] focused on predicting a single frame of low resolution in
restricted, predictable or simulated datasets, Hu et al. [15]
jointly predict full-frame ego-motion, static scene, and object
dynamics on complex real-world urban driving. Featuring
a novel spatio-temporal module, their five-component architecture learns rich representations that incorporate both
local and global spatio-temporal context. Authors validated
the model on predicting semantic segmentation, depth and
optical flow, two seconds in the future outperforming existing spatio-temporal architectures. However, no performance
comparison with [81], [108] has been carried out.
6
P ERFORMANCE E VALUATION
This section presents the results of the previously analyzed
video prediction models on the most popular datasets on
the basis of the metrics described below.
6.1
Metrics and Evaluation Protocols
For a fair evaluation of video prediction systems, multiple
aspects in the prediction have to be addressed such as
whether the predicted sequences look realistic, are plausible and cover all possible outcomes. To the best of our
knowledge, there are no evaluation protocols and metrics
that evaluate the predictions by fulfilling simultaneously all
these aspects.
The most widely used evaluation protocols for video
prediction rely on image similarity-based metrics such
as, Mean-Squared Error (MSE), Structural Similarity Index
Measure (SSIM) [226], and Peak Signal to Noise Ratio
(PSNR). However, evaluating a prediction according to the
mismatch between its visual appearance and the ground
truth is not always reliable. In practice, these metrics penalize all predictions that deviate from the ground truth. In
other words, they prefer blurry predictions nearly accommodating the exact ground truth than sharper and plausible
but imperfect generations [97], [108], [227]. Pixel-wise metrics do not always reflect how accurate a model captured
video scene dynamics and their temporal variability. In
addition, the success of a metric is influenced by the loss
function used to train the model. For instance, the models
trained with MSE loss function would obviously perform
well on MSE metric, but also on PSNR metric as it is based
on MSE. Suffering from similar problems, SSIM measures
the similarity between two images, from −1 (very dissimilar) to +1 (the same image). As a difference, it measures
similarities on image patches instead of performing pixelwise comparison. These metrics are easily fooled by learning
to match the background in predictions. To address this
issue, Mathieu et al. [43] evaluated the predictions only on
the dynamic parts of the sequence, avoiding background
influence.
As the pixel space is multimodal and highlydimensional, it is challenging to evaluate how accurately
a prediction sequence covers the full distribution of possible outcomes. Addressing this issue, some probabilistic
approaches [81], [97], [108] adopted a different evaluation
protocol to assess prediction coverage. Basically, they sample multiple random predictions and then they search for
the best match with the ground truth sequence. Finally,
they report the best match using common metrics. This
represents the most common evaluation protocol for probabilistic video prediction. Other methods [97], [150], [151]
also reported results using: LPIPS [227] as a perceptual
18
TABLE 3: Results on M-MNIST (Moving MNIST). Predicting
the next y frames from x context frames (x → y ). † results
reported by Oliu et al. [153], ‡ results reported by Wang et al.
[66], ∗ results reported by Wang et al. [232], ⊳ results reported
by Wang et al. [233]. MSE represents per-pixel average MSE
(10−3 ). MSE⋄ represents per-frame error.
M-MNIST
(10 → 10)
method
BeyondMSE [43]
Srivastava et al. [74]
Shi et al. [13]
DFN [161]
CDNA [89]
VLN [234]
Patraucean et al. [77]
MCnet [65]†
RLN [235]†
PredNet [75]†
fRNN [153]
PredRNN [232]
VPN [95]
Znet [70]
PredRNN++ [233]
E3d-LSTM [66]
MSE
27.48†
17.37†
43.9
42.54
42.54
41.61
9.47
-
MSE⋄
SSIM
M-MNIST
(10 → 30)
PSNR
122.6∗ 0.713∗ 15.969†
118.3∗ 0.690∗ 18.183†
96.5‡ 0.713‡
89.0‡ 0.726‡
84.2‡ 0.728‡
13.857
13.857
13.968
68.4‡ 0.819‡ 21.386
56.8
0.867
64.1‡ 0.870‡
50.5
0.877
46.5
0.898
41.3 0.910
-
CE
MSE⋄
SSIM
341.2
367.2∗
285.2
346.6∗
187.7
179.8
97.0
87.6
-
180.1⊳
156.2⊳
149.5⊳
142.3⊳
0.583⊳
0.597⊳
0.601⊳
0.609⊳
129.6⊳
91.1
-
0.620⊳
0.733
-
metric comparing CNN features, or Frchet Video Distance
(FVD) [228] to measure sample realism by comparing underlying distributions of predictions and ground truth. Moreover, Lee et al. [108] used the VGG Cosine Similarity metric
that performs cosine similarity to the features extracted with
the VGGnet [146] from the predictions.
Some other alternative metrics include the inception
score [229] introduced to deal with GANs mode collapse
problem by measuring the diversity of generated samples;
perceptual similarity metrics, such as DeePSiM [44]; measuring sharpness based on difference of gradients [43];
Parzen window [230], yet deficient for high-dimensional
images; and the Laplacian of Gaussians (LoG) [60], [231]
used in [101]. In the semantic segmentation space, authors
used the popular Intersection over Union (IoU) metric.
Inception score was also widely used to report results on
different methods [54], [65], [67], [79]. Differently, on the
basis of the EPVA model [52] a quantitative evaluation was
performed, based on the confidence of an external method
trained to identify whether the generated video contains
a recognizable person. While some authors [10], [43], [56]
evaluated the performance only on the dynamic parts of
the image, other directly opted for visual human evaluation
through Amazon Mechanical Turk (AMT) workers, without
a direct quantitative evaluation.
6.2
TABLE 4: Results on KTH dataset. Predicting the next y
frames from x context frames (x → y ). † results reported
by Oliu et al. [153], ‡ results reported by Wang et al. [66], ∗
results reported by Zhang et al. [70], ⊳ results reported by
Jin et al. [150]. Per-pixel average MSE (10−3 ). Best results
are represented in bold.
Results
In this section we report the quantitative results of the
most relevant methods reviewed in the previous sections.
To achieve a wide comparison, we limited the quantitative
results to the most common metrics and datasets. We have
distributed the results in different tables, given the large
variation in the evaluation protocols of the video prediction
models.
Many authors evaluated their methods on the Moving
MNIST synthetic environment. Although it represents a
KTH
(10 → 10)
KTH
(10 → 20)
KTH
(10 → 40)
method
MSE
PSNR
SSIM
PSNR
SSIM
PSNR
Srivastava et al. [74]†
PredNet [75]†
BeyondMSE [43]†
fRNN [153]
MCnet [65]
RLN [235]†
Shi et al. [13]‡
SAVP [108]⊳
VPN [95]∗
DFN [161]‡
fRNN [153]‡
Znet [70]
SV2P invariant [38]⊳
SV2P variant [38]⊳
PredRNN [232]
VarNet [236]⊳
SAVP-VAE [108]⊳
PredRNN++ [233]
MSNET [237]
E3d-LSTM [66]
Jin et al. [150]
9.95
3.09
1.80
1.75
1.65†
1.39
-
21.22
28.42
29.34
29.299 0.771⊳
30.95† 0.804‡
31.27
0.712
0.746
0.746
0.794
0.771
0.817
0.826
0.838
0.839
0.843
0.852
0.865
0.876
0.879
0.893
26.12⊳
25.95‡
23.58
25.38
23.76
27.26
26.12
27.58
27.56
27.79
27.55
28.48
27.77
28.47
27.08
29.31
29.85
0.678⊳
0.73⊳
0.639
0.701
0.652
0.678
0.778
0.789
0.703‡
0.739
0.811
0.741‡
0.810
0.851
23.77⊳
23.89⊳
22.85
23.97
23.01
23.77
25.92
26.12
24.16‡
25.37
26.18
25.21‡
27.24
27.56
restricted and quasi-deterministic scenario, long-term predictions are still challenging. The black and homogeneous
background induce methods to accurately extrapolate black
frames and vanish the predicted digits in the long-term
horizon. Under this configuration, the E3d-LSTM network
demonstrated how their memory attention mechanism improved the performance over previous methods. Reported
errors remain stable in both short-term and longer-term predictions. Moreover, it also reported the second best results
on the KTH dataset, after [150] which achieved the best
overall performance and demonstrated quality predictions
on natural videos. E3d-LSTM was also tested on the TaxiBJ
dataset [190] comparing their method with [95], [153], [232],
[233].
Performing short-term predictions in the KTH dataset,
the Recurrent Ladder Network (RLN) outperformed MCnet
and fRNN by a slight margin. The RLN architecture draws
similarities with fRNN, except that the former uses bridge
connections and the latter, state sharing that improves
memory consumption. On the Moving MNIST and UCF101
datasets, fRNN outperformed RLN. Other interesting methods to highlight are PredRNN and PredRNN++, both providing close results to E3d-LSTM. State-of-the-art results
using different metrics were reported on Caltech Pedestrian
by Kwon et al. [101] and Jin et al. [150]. The former —as
its retrospective prediction scheme represented a leap over
the previous state-of-the-art— was also the overall winner
on the UCF-101 dataset meanwhile the latter outperformed
previous methods on the BAIR Push dataset.
On the one hand, some approaches have been evaluated
19
TABLE 5: Results on Caltech Pedestrian. Predicting the next
y frames from x context frames (x → y ). † reported by Kwon
et al. [101], ‡ reported by Reda et al. [154], ∗ reported by Gao
et al. [166], ⊳ reported by Jin et al. [150]. Per-pixel average
MSE (10−3 ). Best results are represented in bold.
SM-MNIST
(5 → 10)
Caltech Pedestrian
(10 → 1)
method
method
MSE
SSIM
PSNR
LPIPS
BeyondMSE [43]‡
MCnet [65]‡
DVF [7]∗
Dual-GAN [55]
CtrlGen [142]∗
PredNet [75]†
ContextVP [76]
GAN-VGG [151]
G-VGG [151]
SDC-Net [154]
Kwon et al. [101]
DPG [166]
G-MAE [151]
GAN-MAE [151]
Jin et al. [150]
3.42
2.50
2.41
2.42
1.94
1.62
1.61
−
-
0.847
0.879
0.897
0.899
0.900
0.905
0.921
0.916
0.917
0.918
0.919
0.923
0.923
0.923
0.927
26.2
26.5
27.6
28.7
29.2
28.2
29.1
5.57⊳
6.38⊳
7.47⊳
6.03⊳
3.61
3.52
5.04⊳
4.30
4.09
5.89
TABLE 6: Results on UCF-101 dataset. Predicting the next x
frames from y context frames (x → y ). † results reported by
Oliu et al. [153]. Per-pixel average MSE (10−3 ). Best results
are represented in bold.
UCF-101
(10 → 10)
method
Srivastava et al. [74]†
PredNet [75]†
BeyondMSE [43]†
MCnet [65]
RLN [235]†
fRNN [153]
BeyondMSE [43]
Dual-GAN [55]
DVF [7]
ContextVP [76]
Kwon et al. [101]
UCF-101
(4 → 1)
MSE
PSNR
MSE
SSIM
PSNR
148.66
15.50
9.26
9.40†
9.18
9.08
-
10.02
19.87
22.78
23.46†
23.56
23.87
-
1.37
0.91
0.92
0.94
0.94
0.92
0.94
31.0
32
30.5
33.4
34.9
35.0
on other datasets: SDC-Net [154] outperformed [43], [65] on
YouTube8M, TrIVD-GAN-FP outperformed [162], [238] on
Kinetics-600 test set [198]. On the other hand, some explored
out-of-domain tasks [13], [66], [102], [161] (see ood column
in Table 2).
6.2.1
TABLE 7: Results on SM-MNIST (Stochastic Moving
MNIST), BAIR Push and Cityscapes datasets. † results reported by Castrejon et al. [97]. ‡ results reported by Jin et al.
[150].
Results on Probabilistic Approaches
Video prediction probabilistic methods have been mainly
evaluated on the Stochastic Moving MNIST, Bair Push and
Cityscapes datasets. Different from the original Moving
MNIST dataset, the stochastic version includes uncertain
digit trajectories, i.e. the digits bounce off the border with
a random new direction. On this dataset, both versions of
Castrejon et al. models (1L, without a hierarchy of latents,
and 3L with a 3-level hierarchy of latents) outperform SVG
SVG [81]
SAVP [108]
SAVP-VAE [108]
SV2P inv. [38]‡
vRNN 1L [97]
vRNN 3L [97]
Jin et al. [150]
BAIR Push
(2 → 28)
Cityscapes
(2 → 28)
FVD
SSIM
FVD
SSIM
PSNR
FVD
SSIM
90.81†
63.81
57.17
-
0.688†
0.763
0.760
-
256.62†
143.43†
149.22
143.40
-
0.816†
0.795†
0.815‡
0.817
0.829
0.822
0.844
17.72‡
18.42‡
19.09‡
20.36
21.02
1300.26†
682.08
567.51
-
0.574†
0.609
0.628
-
TABLE 8: Results on Cityscapes dataset. Predicting the next
y time-steps of semantic segmented frames from 4 context
frames (4 → y ). ‡ IoU results on eight moving objects
classes. † results reported by Chiu et al. [173]
(4 → 1)
method
S2S [56]‡
S2S-maskRCNN [8]‡
S2S [56]
Nabavi et al. [82]
F2F [8]
Vora et al. [83]
S2S-Res101-FCN [174]
Terwilliger et al. [10]‡
Chiu et al. [173]
Jin et al. [174]
Bayes-WD-SL [9]
Terwilliger et al. [10]
Cityscapes
(4 → 3) (4 → 9)
(4 → 10)
IoU
IoU
IoU
IoU
62.60‡
71.37
72.43
75.3
73.2
55.3
55.4
59.4
60.06
61.2
61.47
62.6
65.1
65.53
66.1
66.7
67.1
40.8
42.4
47.8
41.2
45.4
46.3
50.52
52.5
51.5
50.8
-
53.9
52.5
by a large margin. On the Bair Push dataset, SAVP reported
sharper and more realistic-looking predictions than SVG
which suffer of blurriness. However, both models were
outperformed by [97] as well on the Cityscapes dataset.
The model based on a 3-level hierarchy of latents [97]
outperform previous works on all three datasets, showing
the advantages of the extra expressiveness of this model.
6.2.2 Results on the High-level Prediction Space
Most of the methods have chosen the semantic segmentation
space to make predictions. Although they relied on different datasets for training, performance results were mostly
reported on the Cityscapes dataset using the IoU metric.
Authors explored short-term (next-frame prediction), midterm (+3 time steps in the future) and long-term (up to
+10 time step in the future) predictions. On the semantic
segmentation prediction space, Bayes-WD-SL [9], the model
proposed by Terwilliger et al. [10], and Jin et al. [51] reported
the best results. Among these methods, it is noteworthy
that Bayes-WD-SL was the only one to explore prediction
diversity on the basis of a Bayesian formulation.
In the instance segmentation space, the F2F pioneering
method [8] was outperformed by Sun et al. [84] on short and
mid-term predictions using the AP50 and AP evaluation
metrics. On the other hand, in the keypoint coordinate
20
space, the seminal model of Minderer et al. [85] qualitatively
outperforms SVG [81], SAVP [108] and EPVA [52], yet pixelwise metrics reported similar results. In the human pose
space, Tang et al. [184] by regressing future frames from human pose predictions outperformed SAVP [108], MCnet [65]
and [53] on the basis of the PSNR and SSIM metrics on the
Penn Action and J-HMDB [114] datasets.
7
D ISCUSSION
The video prediction literature ranges from a direct synthesis of future pixel intensities, to complex probabilistic
models addressing prediction uncertainty. The range between these approaches consists of methods that try to
factorize or narrow the prediction space. Simplifying the
prediction task has been a natural evolution of video prediction models, influenced by several open research challenges
discussed below. Due to the curse of dimensionality and
the inherent pixel variability, developing a robust prediction
based on raw pixel intensities is overly-complicated. This
often leads to the regression-to-the-mean problem, visually
represented as blurriness. Making parametric models larger
would improve the quality of predictions, yet this is currently incompatible with high-resolution predictions due
to memory constraints. Transformation-based approaches
propagate pixels from previous frames based on estimated
flow maps. In this case, prediction quality is directly influenced by the accuracy of the estimated flow. Similarly,
the prediction in a high-level space is mostly conditioned
by the quality of some extra supervisory signals such as
semantic maps and human poses, to name a few. Erroneous
supervision signals would harm prediction quality.
Analyzing the impact of the inductive bias on the performance of a video prediction model, Villegas et al. [222]
demonstrated the maximization of the SVG model [81]
performance with minimal inductive bias (e.g. segmentation
or instance maps, optical flow, adversarial losses, etc.) by increasing progressively the scale of computation. A common
assumption when addressing the prediction task in a highlevel feature space, is the direct improvement of long-term
predictions as a result of simplifying the prediction space.
Even if the complexity of the prediction space is reduced,
it is still multimodal when dealing with natural videos.
For instance, when it comes to long-term predictions in the
semantic segmentation space, most of the models reported
predictions only up to ten time steps into the future. This
directly suggests that the choice of the prediction space is
still an unsolved problem. Finding a trade-off between the
complexity of the prediction space and the output quality is
challenging. An overly-simplified representation could limit
the prediction on complex data such as natural videos. Although abstract predictions suffice for many of the decisionmaking systems based on visual reasoning, prediction in
pixel space is still being addressed.
From the analysis performed in this review and in line
with the conclusions extracted from [222] we state that:
(1) including recurrent connections and stochasticity in a
video prediction model generally lead to improved performance; (2) increasing model capacity while maintaining a
low inductive bias also improves prediction performance;
(3) multi-step predictions conditioned by previously generated outputs are prone to accumulate errors, diverging
from the ground truth when addressing long-term horizons; (4) authors predicted further in the future without
relying on high-level feature spaces; (5) combining pixelwise losses with adversarial training somewhat mitigates
the regression-to-the-mean issue.
7.1
Research Challenges
Despite the wealth of currently existing video prediction
approaches and the significant progress made in this field,
there is still room to improve state-of-the-art algorithms. To
foster progress, open research challenges must be clearly
identified and disentangled. So far in this review, we have
already discussed about: (1) the importance of spatiotemporal correlations as a self-supervisory signal for predictive models; (2) how to deal with future uncertainty and
model the underlying multimodal distributions of natural
videos; (3) the over-complicated task of learning meaningful
representations and deal with the curse of dimensionality;
(4) pixel-wise loss functions and blurry results when dealing
with equally probable outcomes, i.e. probabilistic environments. These issues define the open research challenges in
video prediction.
Currently existing methods are limited to short-term
horizons. While frames in the immediate future are extrapolated with high accuracy, in the long term horizon
the prediction problem becomes multimodal by nature.
Initial solutions consisted on conditioning the prediction on
previously predicted frames. However, these autoregressive
models tend to accumulate prediction errors that progressively diverge the generated prediction from the expected
outcome. On the other hand, due to memory issues, there
is a lack of resolution in predictions. Authors tried to
address this issue by composing the full-resolution image
from small predicted patches. However, as the results are
not convincing because of the annoying tilling effect, most
of the available models are still limited to low-resolution
predictions. In addition to the lack of resolution and longterm predictions, models are still prone to the regress-to-themean problem that consists on averaging the output frame
to accommodate multiple equally probable outcomes. This
is directly related to the pixel-wise loss functions, that focus
the learning process on the visual appearance. The choice
of the loss function is an open research problem with a
direct influence on the prediction quality. Finally, the lack
of reliable and fair evaluation models makes the qualitative
evaluation of video prediction challenging and represents
another potential open problem.
7.2
Future Directions
Based on the reviewed research identifying the state-ofthe-art video prediction methods, we present some future
promising research directions.
Consider alternative loss functions: Pixel-wise loss functions are widely used in the video prediction task, causing
blurry predictions when dealing with uncontrolled environments or long-term horizon. In this regard, great efforts have
been made in the literature for identifying more suitable
21
loss functions for the prediction task. However, despite the
existing wide spectrum of loss functions, most models still
blindly rely on deterministic loss functions.
strides have been made, there is still room for improvement
in video prediction using deep learning techniques.
Alternatives to RNNs: Currently, RNNs are still widely
used in this field to model temporal dependencies, and
achieved state-of-the-art results on different benchmarks
[66], [153], [232], [233]. Nevertheless, some methods also
relied on 3D convolutions to further enhance video prediction [66], [173] representing a promising avenue.
ACKNOWLEDGMENTS
Use synthetically generated videos: Simplifying the prediction is a current trend in the video prediction literature. A
vast amount of video prediction models explored higherlevel features spaces to reformulate the prediction task into
a more tractable problem. However, this mostly conditions
the prediction to the accuracy of an external source of supervision such as optical flow, human pose, pre-activations
(percepts) extracted from supervised networks, and more.
However, this issue could be alleviated by taking advantage of existing fully-annotated and photorealistic synthetic
datasets or by using data generation tools. Video prediction
in photorealistic synthetic scenarios has not been explored
in the literature.
Evaluation metrics: Since the most widely used evaluation
protocols for video prediction rely on image similaritybased metrics, the need for fairer evaluation metrics is
imminent. A fair metric should not penalize predictions that
deviate from the ground truth at the pixel level, if their
content represents a plausible future prediction in a higher
level, i.e., the dynamics of the scene correspond to the reality
of the labels. In this regard, some methods evaluate the
similarity between distributions or at a higher-level. However, there is still room for improvement in the evaluation
protocols for video prediction and generation [239].
This work has been funded by the Spanish Government
PID2019-104818RB-I00 grant for the MoDeaAS project. This
work has also been supported by two Spanish national
grants for PhD studies, FPU17/00166, and ACIF/2018/197
respectively. Experiments were made possible by a generous
hardware donation from NVIDIA.
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
8
C ONCLUSION
In this review, after reformulating the predictive learning
paradigm in the context of video prediction, we have closely
reviewed the fundamentals on which it is based: exploiting
the time dimension of videos, dealing with stochasticity, and
the importance of the loss functions in the learning process. Moreover, an analysis of the backbone deep learningbased architectures for this task was performed in order to
provide the reader the necessary background knowledge.
The core of this study encompasses the analysis and classification of more than 50 methods and the datasets they
have used. Methods were analyzed from three perspectives:
method description, contribution over the previous works
and performance results. They have also been classified
according to a proposed taxonomy based on their main
contribution. In addition, we have presented a comparative
summary of the datasets and methods in tabular form so
as the reader, at a glance, could identify low-level details.
In the end, we have discussed the performance results on
the most popular datasets and metrics to finally provide
useful insight in shape of future research directions and
open problems. In conclusion, video prediction is a promising avenue for the self-supervised learning of rich spatiotemporal correlations to provide prediction capabilities to
existing intelligent decision-making systems. While great
[9]
[10]
[11]
[12]
[13]
[14]
[15]
M. H. Nguyen and F. D. la Torre, “Max-margin early event
detectors,” in 2012 IEEE Conference on Computer Vision and Pattern
Recognition, Providence, RI, USA, June 16-21, 2012, 2012, pp. 2863–
2870.
K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert, “Activity
forecasting,” in Computer Vision - ECCV 2012 - 12th European
Conference on Computer Vision, Florence, Italy, October 7-13, 2012,
Proceedings, Part IV, 2012, pp. 201–214.
C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual
representations from unlabeled video,” in 2016 IEEE Conference
on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,
NV, USA, June 27-30, 2016, 2016, pp. 98–106.
K. Zeng, W. B. Shen, D. Huang, M. Sun, and J. C. Niebles, “Visual
forecasting by imitating dynamics in natural sequences,” in IEEE
International Conference on Computer Vision, ICCV 2017, Venice,
Italy, October 22-29, 2017, 2017, pp. 3018–3027.
S. Shalev-Shwartz, N. Ben-Zrihem, A. Cohen, and A. Shashua,
“Long-term planning by short-term prediction,” CoRR, vol.
abs/1602.01580, 2016.
O. Makansi, E. Ilg, O. Cicek, and T. Brox, “Overcoming limitations of mixture density networks: A sampling and fitting framework for multimodal future prediction,” in The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), June 2019.
Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala, “Video frame
synthesis using deep voxel flow,” in IEEE International Conference
on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017,
2017, pp. 4473–4481.
P. Luc, C. Couprie, Y. LeCun, and J. Verbeek, “Predicting future
instance segmentation by forecasting convolutional features,” in
Computer Vision - ECCV 2018 - 15th European Conference, Munich,
Germany, September 8-14, 2018, Proceedings, Part IX, 2018, pp. 593–
608.
A. Bhattacharyya, M. Fritz, and B. Schiele, “Bayesian prediction
of future street scenes using synthetic likelihoods,” in ICLR
(Poster). OpenReview.net, 2019.
A. Terwilliger, G. Brazil, and X. Liu, “Recurrent flow-guided
semantic forecasting,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Jan 2019, pp. 1703–1712.
A. Bhattacharyya, M. Fritz, and B. Schiele, “Long-term on-board
prediction of people in traffic scenes under uncertainty,” in 2018
IEEE Conference on Computer Vision and Pattern Recognition, CVPR
2018, Salt Lake City, UT, USA, June 18-22, 2018, 2018, pp. 4194–
4202.
W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for
anomaly detection - A new baseline,” in CVPR. IEEE Computer
Society, 2018, pp. 6536–6545.
X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo,
“Convolutional LSTM network: A machine learning approach
for precipitation nowcasting,” in Advances in Neural Information
Processing Systems 28: Annual Conference on Neural Information
Processing Systems 2015, December 7-12, 2015, Montreal, Quebec,
Canada, 2015, pp. 802–810.
X. Shi, Z. Gao, L. Lausen, H. Wang, D.-Y. Yeung, W.-k. Wong,
and W.-c. WOO, “Deep learning for precipitation nowcasting: A
benchmark and a new model,” in Advances in Neural Information
Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran
Associates, Inc., 2017, pp. 5617–5627.
A. Hu, F. Cotter, N. Mohan, C. Gurau, and A. Kendall, “Probabilistic future prediction for video scene understanding,” CoRR,
vol. abs/2003.06409, 2020.
22
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
A. Garcia-Garcia, P. Martinez-Gonzalez, S. Oprea, J. A. CastroVargas, S. Orts-Escolano, J. Garcia-Rodriguez, and A. JoverAlvarez, “The robotrix: An extremely photorealistic and verylarge-scale indoor dataset of sequences with robot trajectories
and interactions,” in 2018 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 6790–6797.
Y. Kong and Y. Fu, “Human action recognition and prediction: A
survey,” CoRR, vol. abs/1806.11230, 2018.
C. Sahin, G. Garcia-Hernando, J. Sock, and T. Kim, “A review on
object pose recovery: from 3d bounding box detectors to full 6d
pose estimators,” CoRR, vol. abs/2001.10609, 2020.
V. Villena-Martinez, S. Oprea, M. Saval-Calvo, J. A. López,
A. F. Guilló, and R. B. Fisher, “When deep learning meets
data alignment: A review on deep registration networks
(drns),” CoRR, vol. abs/2003.03167, 2020. [Online]. Available:
https://arxiv.org/abs/2003.03167
Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,” Nature,
vol. 521, no. 7553, pp. 436–444, 2015.
J. Hawkins and S. Blakeslee, On Intelligence. USA: Times Books,
2004.
R. P. N. Rao and D. H. Ballard, “Predictive coding in the visual cortex: a functional interpretation of some extra-classical
receptive-field effects,” Nature Neuroscience, vol. 2, no. 1, pp. 79–
87, 1999.
D. Mumford, “On the computational architecture of the neocortex,” Biological Cybernetics, vol. 66, no. 3, pp. 241–251, 1992.
A. Cleeremans and J. L. McClelland, “Learning the structure of
event sequences.” Journal of Experimental Psychology: General, vol.
120, no. 3, p. 235, 1991.
A. Cleeremans and J. Elman, Mechanisms of implicit learning:
Connectionist models of sequence processing. MIT press, 1993.
R. Baker, M. Dexter, T. E. Hardwicke, A. Goldstone, and
Z. Kourtzi, “Learning to predict: Exposure to temporal sequences
facilitates prediction of future events,” Vision research, vol. 99, pp.
124–133, 2014.
H. E. M. den Ouden, P. Kok, and F. P. de Lange, “How prediction
errors shape perception, attention, and motivation,” in Front.
Psychology, 2012.
W. R. Softky, “Unsupervised pixel-prediction,” in Advances in
Neural Information Processing Systems 8, NIPS, Denver, CO, USA,
November 27-30, 1995, 1995, pp. 809–815.
G. Deco and B. Schürmann, “Predictive coding in the visual
cortex by a recurrent network with gabor receptive fields,” Neural
Processing Letters, vol. 14, no. 2, pp. 107–114, 2001.
A. Hollingworth, “Constructing visual representations of natural
scenes: the roles of short- and long-term visual memory.” Journal
of experimental psychology. Human perception and performance, vol.
30 3, pp. 519–37, 2004.
Y. Bengio, A. C. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Trans. Pattern Anal.
Mach. Intell., vol. 35, no. 8, pp. 1798–1828, 2013.
X. Wang and A. Gupta, “Unsupervised learning of visual representations using videos,” in 2015 IEEE International Conference on
Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015,
2015, pp. 2794–2802.
P. Agrawal, J. Carreira, and J. Malik, “Learning to see by moving,” in 2015 IEEE International Conference on Computer Vision,
ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015, pp. 37–45.
D.-A. Huang, V. Ramanathan, D. Mahajan, L. Torresani,
M. Paluri, L. Fei-Fei, and J. Carlos Niebles, “What makes a video
a video: Analyzing temporal information in video understanding
models and datasets,” in The IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), June 2018.
L. C. Pickup, Z. Pan, D. Wei, Y. Shih, C. Zhang, A. Zisserman,
B. Schölkopf, and W. T. Freeman, “Seeing the arrow of time,” in
2014 IEEE Conference on Computer Vision and Pattern Recognition,
CVPR 2014, Columbus, OH, USA, June 23-28, 2014, 2014, pp. 2043–
2050.
D. Wei, J. J. Lim, A. Zisserman, and W. T. Freeman, “Learning
and using the arrow of time,” in CVPR. IEEE Computer Society,
2018, pp. 8052–8060.
I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: Unsupervised learning using temporal order verification,” in Computer
Vision - ECCV 2016 - 14th European Conference, Amsterdam, The
Netherlands, October 11-14, 2016, Proceedings, Part I, 2016, pp. 527–
544.
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[55]
[56]
[57]
M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and
S. Levine, “Stochastic variational video prediction,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver,
BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.
OpenReview.net, 2018.
H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image restoration with neural networks,” IEEE Trans. Computational
Imaging, vol. 3, no. 1, pp. 47–57, 2017.
K. Janocha and W. M. Czarnecki, “On loss functions for deep
neural networks in classification,” CoRR, vol. abs/1702.05659,
2017.
A. Kendall and R. Cipolla, “Geometric loss functions for camera
pose regression with deep learning,” in The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), July 2017.
J.-J. Hwang, T.-W. Ke, J. Shi, and S. X. Yu, “Adversarial structure
matching for structured prediction tasks,” in The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), June 2019.
M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video
prediction beyond mean square error,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico,
May 2-4, 2016, Conference Track Proceedings, 2016.
A. Dosovitskiy and T. Brox, “Generating images with perceptual
similarity metrics based on deep networks,” in NIPS, 2016, pp.
658–666.
J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for realtime style transfer and super-resolution,” in ECCV (2), ser. Lecture Notes in Computer Science, vol. 9906. Springer, 2016, pp.
694–711.
C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham,
A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi,
“Photo-realistic single image super-resolution using a generative
adversarial network,” in CVPR. IEEE Computer Society, 2017,
pp. 105–114.
M. S. M. Sajjadi, B. Schölkopf, and M. Hirsch, “Enhancenet: Single
image super-resolution through automated texture synthesis,” in
ICCV. IEEE Computer Society, 2017, pp. 4501–4510.
J. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros, “Generative
visual manipulation on the natural image manifold,” in ECCV
(5), ser. Lecture Notes in Computer Science, vol. 9909. Springer,
2016, pp. 597–613.
W. Lotter, G. Kreiman, and D. D. Cox, “Unsupervised learning
of visual structure using predictive generative networks,” CoRR,
vol. abs/1511.06380, 2015.
X. Chen, W. Wang, J. Wang, and W. Li, “Learning object-centric
transformation for video prediction,” in Proceedings of the 25th
ACM International Conference on Multimedia, ser. MM ’17. New
York, NY, USA: ACM, 2017, pp. 1503–1512.
X. Jin, X. Li, H. Xiao, X. Shen, Z. Lin, J. Yang, Y. Chen, J. Dong,
L. Liu, Z. Jie, J. Feng, and S. Yan, “Video scene parsing with
predictive feature learning,” in IEEE International Conference on
Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017,
pp. 5581–5589.
N. Wichers, R. Villegas, D. Erhan, and H. Lee, “Hierarchical
long-term video prediction without supervision,” in ICML, ser.
Proceedings of Machine Learning Research, vol. 80. PMLR, 2018,
pp. 6033–6041.
R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee, “Learning
to generate long-term future via hierarchical prediction,” in Proceedings of the 34th International Conference on Machine Learning,
ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, 2017, pp.
3560–3569.
J. Walker, K. Marino, A. Gupta, and M. Hebert, “The pose
knows: Video forecasting by generating pose futures,” in IEEE
International Conference on Computer Vision, ICCV 2017, Venice,
Italy, October 22-29, 2017, 2017, pp. 3352–3361.
X. Liang, L. Lee, W. Dai, and E. P. Xing, “Dual motion GAN
for future-flow embedded video prediction,” in IEEE International
Conference on Computer Vision, ICCV 2017, Venice, Italy, October 2229, 2017. IEEE Computer Society, 2017, pp. 1762–1770.
P. Luc, N. Neverova, C. Couprie, J. Verbeek, and Y. LeCun,
“Predicting deeper into the future of semantic segmentation,”
in IEEE International Conference on Computer Vision, ICCV 2017,
Venice, Italy, October 22-29, 2017, 2017, pp. 648–657.
Z. Hu and J. Wang, “A novel adversarial inference framework for
video prediction with action control,” in The IEEE International
Conference on Computer Vision (ICCV) Workshops, Oct 2019.
23
[58]
[59]
[60]
[61]
[62]
[63]
[64]
[65]
[66]
[67]
[68]
[69]
[70]
[71]
[72]
[73]
[74]
[75]
[76]
[77]
Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proceedings of the
IEEE, vol. 86, no. 11, pp. 2278–2324, Nov 1998.
V. Jain, J. F. Murray, F. Roth, S. C. Turaga, V. P. Zhigulin, K. L. Briggman, M. Helmstaedter, W. Denk, and H. S. Seung, “Supervised
learning of image restoration with convolutional networks,” in
ICCV. IEEE Computer Society, 2007, pp. 1–8.
E. L. Denton, S. Chintala, A. Szlam, and R. Fergus, “Deep generative image models using a laplacian pyramid of adversarial
networks,” in Advances in Neural Information Processing Systems
28: Annual Conference on Neural Information Processing Systems
2015, December 7-12, 2015, Montreal, Quebec, Canada, 2015, pp.
1486–1494.
F. Yu, V. Koltun, and T. A. Funkhouser, “Dilated residual networks,” in CVPR. IEEE Computer Society, 2017, pp. 636–644.
L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
“Deeplab: Semantic image segmentation with deep convolutional
nets, atrous convolution, and fully connected crfs,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, 2018.
W. Luo, Y. Li, R. Urtasun, and R. S. Zemel, “Understanding the
effective receptive field in deep convolutional neural networks,”
in Advances in Neural Information Processing Systems 29: Annual
Conference on Neural Information Processing Systems 2016, December
5-10, 2016, Barcelona, Spain, 2016, pp. 4898–4906.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in 2016 IEEE Conference on Computer Vision
and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016. IEEE Computer Society, 2016, pp. 770–778.
R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, “Decomposing
motion and content for natural video sequence prediction,” in
5th International Conference on Learning Representations, ICLR 2017,
Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
Y. Wang, L. Jiang, M.-H. Yang, L.-J. Li, M. Long, and L. Fei-Fei,
“Eidetic 3d LSTM: A model for video prediction and beyond,” in
International Conference on Learning Representations, 2019.
C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos
with scene dynamics,” in Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing
Systems 2016, December 5-10, 2016, Barcelona, Spain, 2016, pp. 613–
621.
S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz, “Mocogan: Decomposing motion and content for video generation,” in The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR),
June 2018.
S. Aigner and M. Körner, “Futuregan: Anticipating the future
frames of video sequences using spatio-temporal 3d convolutions in progressively growing autoencoder gans,” CoRR, vol.
abs/1810.01325, 2018.
J. Zhang, Y. Wang, M. Long, W. Jianmin, and P. S. Yu, “Z-order
recurrent neural networks for video prediction,” in 2019 IEEE
International Conference on Multimedia and Expo (ICME), July 2019,
pp. 230–235.
J. R. van Amersfoort, A. Kannan, M. Ranzato, A. Szlam, D. Tran,
and S. Chintala, “Transformation-based models of video sequences,” CoRR, vol. abs/1701.08435, 2017.
D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning
representations by back-propagating errors,” Nature, vol. 323, no.
6088, pp. 533–536, 1986.
M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and
S. Chopra, “Video (language) modeling: a baseline for generative
models of natural videos,” CoRR, vol. abs/1412.6604, 2014.
N. Srivastava, E. Mansimov, and R. Salakhutdinov, “Unsupervised learning of video representations using lstms,” in Proceedings of the 32nd International Conference on Machine Learning, ICML
2015, Lille, France, 6-11 July 2015, 2015, pp. 843–852.
W. Lotter, G. Kreiman, and D. Cox, “Deep predictive coding
networks for video prediction and unsupervised learning,” in
5th International Conference on Learning Representations, ICLR 2017,
Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
W. Byeon, Q. Wang, R. K. Srivastava, and P. Koumoutsakos,
“Contextvp: Fully context-aware video prediction,” in 2018 IEEE
Conference on Computer Vision and Pattern Recognition Workshops,
CVPR Workshops 2018, Salt Lake City, UT, USA, June 18-22, 2018.
IEEE Computer Society, 2018, pp. 1122–1126.
V. Patraucean, A. Handa, and R. Cipolla, “Spatio-temporal
video autoencoder with differentiable memory,” CoRR, vol.
abs/1511.06309, 2015.
[78]
[79]
[80]
[81]
[82]
[83]
[84]
[85]
[86]
[87]
[88]
[89]
[90]
[91]
[92]
[93]
[94]
[95]
[96]
[97]
[98]
C. Lu, M. Hirsch, and B. Schölkopf, “Flexible spatio-temporal networks for video prediction,” in 2017 IEEE Conference on Computer
Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July
21-26, 2017, 2017, pp. 2137–2145.
E. L. Denton and V. Birodkar, “Unsupervised learning of disentangled representations from video,” in Advances in Neural
Information Processing Systems 30: Annual Conference on Neural
Information Processing Systems 2017, 4-9 December 2017, Long Beach,
CA, USA, 2017, pp. 4417–4426.
J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. P. Singh, “Actionconditional video prediction using deep networks in atari
games,” in Advances in Neural Information Processing Systems 28:
Annual Conference on Neural Information Processing Systems 2015,
December 7-12, 2015, Montreal, Quebec, Canada, 2015, pp. 2863–
2871.
E. Denton and R. Fergus, “Stochastic video generation with a
learned prior,” in Proceedings of the 35th International Conference
on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm,
Sweden, July 10-15, 2018, ser. Proceedings of Machine Learning
Research, J. G. Dy and A. Krause, Eds., vol. 80. PMLR, 2018, pp.
1182–1191.
S. shahabeddin Nabavi, M. Rochan, and Y. Wang, “Future semantic segmentation with convolutional LSTM,” in British Machine Vision Conference 2018, BMVC 2018, Northumbria University,
Newcastle, UK, September 3-6, 2018, 2018, p. 137.
S. Vora, R. Mahjourian, S. Pirk, and A. Angelova, “Future segmentation using 3d structure,” CoRR, vol. abs/1811.11358, 2018.
J. Sun, J. Xie, J. Hu, Z. Lin, J. Lai, W. Zeng, and W. Zheng, “Predicting future instance segmentation with contextual pyramid
convlstms,” in ACM Multimedia. ACM, 2019, pp. 2043–2051.
M. Minderer, C. Sun, R. Villegas, F. Cole, K. P. Murphy, and
H. Lee, “Unsupervised learning of object structure and dynamics
from videos,” in NeurIPS, 2019, pp. 92–102.
S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares,
H. Schwenk, and Y. Bengio, “Learning phrase representations
using RNN encoder-decoder for statistical machine translation,”
CoRR, vol. abs/1406.1078, 2014.
A. Graves, S. Fernández, and J. Schmidhuber, “Multidimensional recurrent neural networks,” in ICANN (1), ser. Lecture Notes in Computer Science, vol. 4668. Springer, 2007, pp.
549–558.
C. Finn, I. J. Goodfellow, and S. Levine, “Unsupervised learning
for physical interaction through video prediction,” in Advances
in Neural Information Processing Systems 29: Annual Conference on
Neural Information Processing Systems 2016, December 5-10, 2016,
Barcelona, Spain, 2016, pp. 64–72.
E. Zhan, S. Zheng, Y. Yue, L. Sha, and P. Lucey, “Generating
multi-agent trajectories using programmatic weak supervision,”
in ICLR (Poster). OpenReview.net, 2019.
A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel
recurrent neural networks,” in Proceedings of the 33nd International
Conference on Machine Learning, ICML 2016, New York City, NY,
USA, June 19-24, 2016, 2016, pp. 1747–1756.
R. M. Neal, “Connectionist learning of belief networks,” Artif.
Intell., vol. 56, no. 1, pp. 71–113, 1992.
Y. Bengio and S. Bengio, “Modeling high-dimensional discrete
data with multi-layer neural networks,” in Proceedings of the 12th
International Conference on Neural Information Processing Systems,
ser. NIPS99. Cambridge, MA, USA: MIT Press, 1999, p. 400406.
A. van den Oord, N. Kalchbrenner, L. Espeholt, K. Kavukcuoglu,
O. Vinyals, and A. Graves, “Conditional image generation with
pixelcnn decoders,” in NIPS, 2016, pp. 4790–4798.
N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka,
O. Vinyals, A. Graves, and K. Kavukcuoglu, “Video pixel networks,” CoRR, vol. abs/1610.00527, 2016.
K. Fragkiadaki, J. Huang, A. Alemi, S. Vijayanarasimhan,
S. Ricco, and R. Sukthankar, “Motion prediction under multimodality with conditional stochastic networks,” CoRR, vol.
abs/1705.02082, 2017.
L. Castrejon, N. Ballas, and A. Courville, “Improved conditional
vrnns for video prediction,” in The IEEE International Conference
on Computer Vision (ICCV), October 2019.
J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and
Y. Bengio, “A recurrent latent variable model for sequential data,”
in NIPS, 2015, pp. 2980–2988.
24
[99]
[100]
[101]
[102]
[103]
[104]
[105]
[106]
[107]
[108]
[109]
[110]
[111]
[112]
[113]
[114]
[115]
[116]
[117]
[118]
[119]
[120]
M. Henaff, J. J. Zhao, and Y. LeCun, “Prediction under uncertainty with error-encoding networks,” CoRR, vol.
abs/1711.04994, 2017.
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. C. Courville, and Y. Bengio, “Generative
adversarial networks,” CoRR, vol. abs/1406.2661, 2014.
Y.-H. Kwon and M.-G. Park, “Predicting future frames using
retrospective cycle gan,” in The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2019.
C. Vondrick and A. Torralba, “Generating the future with adversarial transformers,” in 2017 IEEE Conference on Computer Vision
and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26,
2017, 2017, pp. 2992–3000.
Y. Zhou and T. L. Berg, “Learning temporal transformations
from time-lapse videos,” in Computer Vision - ECCV 2016 - 14th
European Conference, Amsterdam, The Netherlands, October 11-14,
2016, Proceedings, Part VIII, 2016, pp. 262–277.
P. Bhattacharjee and S. Das, “Temporal coherency based criteria
for predicting video frames using deep multi-stage generative
adversarial networks,” in NIPS, 2017, pp. 4268–4277.
M. Saito, E. Matsumoto, and S. Saito, “Temporal generative
adversarial nets with singular value clipping,” in ICCV. IEEE
Computer Society, 2017, pp. 2849–2858.
B. Chen, W. Wang, and J. Wang, “Video imagination from a single image with transformation generation,” in ACM Multimedia
(Thematic Workshops). ACM, 2017, pp. 358–366.
M. Mirza and S. Osindero, “Conditional generative adversarial
nets,” CoRR, vol. abs/1411.1784, 2014.
A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and
S. Levine, “Stochastic adversarial video prediction,” CoRR, vol.
abs/1804.01523, 2018.
A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial
networks,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference
Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2016.
M. Arjovsky and L. Bottou, “Towards principled methods for
training generative adversarial networks,” in ICLR. OpenReview.net, 2017.
C. Schüldt, I. Laptev, and B. Caputo, “Recognizing human actions: A local SVM approach,” in ICPR (3). IEEE Computer
Society, 2004, pp. 32–36.
L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri,
“Actions as space-time shapes,” Transactions on Pattern Analysis
and Machine Intelligence, vol. 29, no. 12, pp. 2247–2253, December
2007.
H. Kuehne, H. Jhuang, E. Garrote, T. A. Poggio, and T. Serre,
“HMDB: A large video database for human motion recognition,”
in IEEE International Conference on Computer Vision, ICCV 2011,
Barcelona, Spain, November 6-13, 2011, 2011, pp. 2556–2563.
H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards
understanding action recognition,” in ICCV. IEEE Computer
Society, 2013, pp. 3192–3199.
K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of
101 human actions classes from videos in the wild,” CoRR, vol.
abs/1212.0402, 2012.
W. Zhang, M. Zhu, and K. G. Derpanis, “From actemes to
action: A strongly-supervised representation for detailed action
understanding,” in ICCV. IEEE Computer Society, 2013, pp.
2248–2255.
C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3.6m: Large scale datasets and predictive methods for 3d
human sensing in natural environments,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 36, no. 7, pp. 1325–1339, 2014.
H. Idrees, A. R. Zamir, Y. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah, “The THUMOS challenge on action recognition for videos ”in the wild”,” Computer Vision and Image
Understanding, vol. 155, pp. 1–23, 2017.
P. Dollár, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: A benchmark,” in 2009 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June
2009, Miami, Florida, USA. IEEE Computer Society, 2009, pp.
304–311.
A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets
robotics: The kitti dataset,” The International Journal of Robotics
Research, vol. 32, no. 11, pp. 1231–1237, 2013.
[121] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,
R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes
dataset for semantic urban scene understanding,” in Proc. of
the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2016.
[122] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y. Lin,
and R. Yang, “The apolloscape dataset for autonomous driving,”
arXiv: 1803.06184, 2018.
[123] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and
F. Li, “Large-scale video classification with convolutional neural
networks,” in 2014 IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014,
2014, pp. 1725–1732.
[124] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici,
B. Varadarajan, and S. Vijayanarasimhan, “Youtube-8m: A largescale video classification benchmark,” CoRR, vol. abs/1609.08675,
2016.
[125] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni,
D. Poland, D. Borth, and L. Li, “YFCC100M: the new data in
multimedia research,” Commun. ACM, vol. 59, no. 2, pp. 64–73,
2016.
[126] I. Sutskever, G. E. Hinton, and G. W. Taylor, “The recurrent
temporal restricted boltzmann machine,” in NIPS.
Curran
Associates, Inc., 2008, pp. 1601–1608.
[127] C. F. Cadieu and B. A. Olshausen, “Learning intermediate-level
representations of form and motion from natural movies,” Neural
Computation, vol. 24, no. 4, pp. 827–866, 2012.
[128] R. Memisevic and G. Exarchakis, “Learning invariant features
by harnessing the aperture problem,” in Proceedings of the 30th
International Conference on Machine Learning, ICML 2013, Atlanta,
GA, USA, 16-21 June 2013, ser. JMLR Workshop and Conference
Proceedings, vol. 28. JMLR.org, 2013, pp. 100–108.
[129] F. Ebert, C. Finn, A. X. Lee, and S. Levine, “Self-supervised
visual planning with temporal skip connections,” in CoRL, ser.
Proceedings of Machine Learning Research, vol. 78. PMLR, 2017,
pp. 344–356.
[130] S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper,
S. Singh, S. Levine, and C. Finn, “Robonet: Large-scale multirobot learning,” CoRR, vol. abs/1910.11215, 2019.
[131] R. Vezzani and R. Cucchiara, “Video surveillance online repository (visor): an integrated framework,” Multimedia Tools Appl.,
vol. 50, no. 2, pp. 359–380, 2010.
[132] J. Santner, C. Leistner, A. Saffari, T. Pock, and H. Bischof, “PROST:
parallel robust online simple tracking,” in The Twenty-Third IEEE
Conference on Computer Vision and Pattern Recognition, CVPR 2010,
San Francisco, CA, USA, 13-18 June 2010, 2010, pp. 723–730.
[133] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The
arcade learning environment: An evaluation platform for general
agents,” J. Artif. Intell. Res., vol. 47, pp. 253–279, 2013.
[134] G. Seguin, P. Bojanowski, R. Lajugie, and I. Laptev, “Instancelevel video segmentation from object tracks,” in CVPR. IEEE
Computer Society, 2016, pp. 3678–3687.
[135] Z. Bauer, F. Gomez-Donoso, E. Cruz, S. Orts-Escolano, and
M. Cazorla, “Uasol, a large-scale high-resolution outdoor stereo
dataset,” Scientific Data, vol. 6, no. 1, p. 162, 2019.
[136] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla, “Segmentation and recognition using structure from motion point clouds,”
in ECCV (1), ser. Lecture Notes in Computer Science, vol. 5302.
Springer, 2008, pp. 44–57.
[137] E. Santana and G. Hotz, “Learning a driving simulator,” CoRR,
vol. abs/1608.01230, 2016.
[138] Y. LeCun, F. J. Huang, and L. Bottou, “Learning methods for
generic object recognition with invariance to pose and lighting,”
in 2004 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR 2004), with CD-ROM, 27 June - 2 July
2004, Washington, DC, USA. IEEE Computer Society, 2004, pp.
97–104.
[139] P. Martinez-Gonzalez, S. Oprea, A. Garcia-Garcia, A. JoverAlvarez, S. Orts-Escolano, and J. Garcia-Rodriguez, “UnrealROX: An extremely photorealistic virtual reality environment
for robotics simulations and synthetic data generation,” Virtual
Reality, 2019.
[140] D. Jayaraman and K. Grauman, “Look-ahead before you leap:
End-to-end active recognition by forecasting the effect of motion,” in ECCV (5), ser. Lecture Notes in Computer Science, vol.
9909. Springer, 2016, pp. 489–505.
25
[141] J. Walker, C. Doersch, A. Gupta, and M. Hebert, “An uncertain
future: Forecasting from static images using variational autoencoders,” in Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings,
Part VII, 2016, pp. 835–851.
[142] Z. Hao, X. Huang, and S. J. Belongie, “Controllable video generation with sparse trajectories,” in CVPR. IEEE Computer Society,
2018, pp. 7854–7863.
[143] Y. Ye, M. Singh, A. Gupta, and S. Tulsiani, “Compositional video
prediction,” in The IEEE International Conference on Computer
Vision (ICCV), October 2019.
[144] S. Mozaffari, O. Y. Al-Jarrah, M. Dianati, P. A. Jennings, and
A. Mouzakitis, “Deep learning-based vehicle behaviour prediction for autonomous driving applications: A review,” CoRR, vol.
abs/1912.11676, 2019.
[145] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur, “Recurrent neural network based language model,” in
INTERSPEECH 2010, 11th Annual Conference of the International
Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, 2010, pp. 1045–1048.
[146] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in 3rd International
Conference on Learning Representations, ICLR 2015, San Diego, CA,
USA, May 7-9, 2015, Conference Track Proceedings, 2015.
[147] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy
optical flow estimation based on a theory for warping,” in
Computer Vision - ECCV 2004, 8th European Conference on Computer
Vision, Prague, Czech Republic, May 11-14, 2004. Proceedings, Part
IV, ser. Lecture Notes in Computer Science, T. Pajdla and J. Matas,
Eds., vol. 3024. Springer, 2004, pp. 25–36.
[148] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing
of gans for improved quality, stability, and variation,” in ICLR.
OpenReview.net, 2018.
[149] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C.
Courville, “Improved training of wasserstein gans,” in NIPS,
2017, pp. 5767–5777.
[150] B. Jin, Y. Hu, Q. Tang, J. Niu, Z. Shi, Y. Han, and X. Li,
“Exploring spatial-temporal multi-frequency analysis for highfidelity and temporal-consistency video prediction,” CoRR, vol.
abs/2002.09905, 2020.
[151] O. Shouno, “Photo-realistic video prediction on natural videos of
largely changing frames,” CoRR, vol. abs/2003.08635, 2020.
[152] R. Hou, H. Chang, B. Ma, and X. Chen, “Video prediction with
bidirectional constraint network,” in 2019 14th IEEE International
Conference on Automatic Face Gesture Recognition (FG 2019), May
2019, pp. 1–8.
[153] M. Oliu, J. Selva, and S. Escalera, “Folded recurrent neural
networks for future video prediction,” in Computer Vision - ECCV
2018 - 15th European Conference, Munich, Germany, September 8-14,
2018, Proceedings, Part XIV, 2018, pp. 745–761.
[154] F. A. Reda, G. Liu, K. J. Shih, R. Kirby, J. Barker, D. Tarjan, A. Tao,
and B. Catanzaro, “Sdc-net: Video prediction using spatiallydisplaced convolution,” in The European Conference on Computer
Vision (ECCV), September 2018.
[155] R. Memisevic and G. E. Hinton, “Learning to represent spatial transformations with factored higher-order boltzmann machines,” Neural Computation, vol. 22, no. 6, pp. 1473–1492, 2010.
[156] R. Memisevic, “Gradient-based learning of higher-order image
features,” in Proceedings of the 2011 International Conference on
Computer Vision, ser. ICCV ’11. Washington, DC, USA: IEEE
Computer Society, 2011, pp. 1591–1598.
[157] ——, “Learning to relate images,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 35, no. 8, pp. 1829–1846, 2013.
[158] V. Michalski, R. Memisevic, and K. Konda, “Modeling deep
temporal dependencies with recurrent grammar cells,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani,
M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger,
Eds. Curran Associates, Inc., 2014, pp. 1925–1933.
[159] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu,
“Spatial transformer networks,” in Advances in Neural Information
Processing Systems 28: Annual Conference on Neural Information
Processing Systems 2015, December 7-12, 2015, Montreal, Quebec,
Canada, 2015, pp. 2017–2025.
[160] B. Klein, L. Wolf, and Y. Afek, “A dynamic convolutional layer
for short rangeweather prediction,” in 2015 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), June 2015, pp.
4840–4848.
[161] B. D. Brabandere, X. Jia, T. Tuytelaars, and L. V. Gool, “Dynamic
filter networks,” in Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems
2016, December 5-10, 2016, Barcelona, Spain, 2016, pp. 667–675.
[162] A. Clark, J. Donahue, and K. Simonyan, “Adversarial video
generation on complex datasets,” 2019.
[163] P. Luc, A. Clark, S. Dieleman, D. de Las Casas, Y. Doron, A. Cassirer, and K. Simonyan, “Transformation-based adversarial video
prediction on large-scale data,” CoRR, vol. abs/2003.04035, 2020.
[164] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in
deep convolutional networks for visual recognition,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, 2015.
[165] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in Neural
Information Processing Systems 27: Annual Conference on Neural
Information Processing Systems 2014, December 8-13 2014, Montreal,
Quebec, Canada, Z. Ghahramani, M. Welling, C. Cortes, N. D.
Lawrence, and K. Q. Weinberger, Eds., 2014, pp. 568–576.
[166] H. Gao, H. Xu, Q. Cai, R. Wang, F. Yu, and T. Darrell, “Disentangling propagation and generation for video prediction,” in ICCV.
IEEE, 2019, pp. 9005–9014.
[167] Y. Wu, R. Gao, J. Park, and Q. Chen, “Future video synthesis with
object motion prediction,” 2020.
[168] J. Hsieh, B. Liu, D. Huang, F. Li, and J. C. Niebles, “Learning
to decompose and disentangle representations for video prediction,” in Advances in Neural Information Processing Systems 31:
Annual Conference on Neural Information Processing Systems 2018,
NeurIPS 2018, 3-8 December 2018, Montréal, Canada, S. Bengio,
H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and
R. Garnett, Eds., 2018, pp. 515–524.
[169] S. Chiappa, S. Racanière, D. Wierstra, and S. Mohamed, “Recurrent environment simulators,” in ICLR (Poster). OpenReview.net,
2017.
[170] K. Fragkiadaki, P. Agrawal, S. Levine, and J. Malik, “Learning
visual predictive models of physics for playing billiards,” in 4th
International Conference on Learning Representations, ICLR 2016, San
Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
[171] A. Dosovitskiy and V. Koltun, “Learning to act by predicting the
future,” in 5th International Conference on Learning Representations,
ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track
Proceedings, 2017.
[172] P. Luc, “Self-supervised learning of predictive segmentation
models from video,” Theses, Université Grenoble Alpes, Jun.
2019. [Online]. Available: https://tel.archives-ouvertes.fr/tel-0
2196890
[173] H.-k. Chiu, E. Adeli, and J. C. Niebles, “Segmenting the future,”
arXiv preprint arXiv:1904.10666, 2019.
[174] X. Jin, H. Xiao, X. Shen, J. Yang, Z. Lin, Y. Chen, Z. Jie, J. Feng, and
S. Yan, “Predicting scene parsing and motion dynamics in the
future,” in Advances in Neural Information Processing Systems 30:
Annual Conference on Neural Information Processing Systems 2017,
4-9 December 2017, Long Beach, CA, USA, 2017, pp. 6918–6927.
[175] J. Ba and R. Caruana, “Do deep nets really need to be deep?” in
NIPS, 2014, pp. 2654–2662.
[176] G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge
in a neural network,” CoRR, vol. abs/1503.02531, 2015.
[177] J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid,
“Epicflow: Edge-preserving interpolation of correspondences for
optical flow,” in IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, 2015,
pp. 1164–1172.
[178] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
network,” in CVPR. IEEE Computer Society, 2017, pp. 6230–
6239.
[179] S. N. Wood, “Statistical inference for noisy nonlinear ecological
dynamic systems,” Nature, vol. 466, no. 7310, pp. 1102–1104, 2010.
[180] M. Rosca, B. Lakshminarayanan, D. Warde-Farley, and S. Mohamed, “Variational approaches for auto-encoding generative
adversarial networks,” CoRR, vol. abs/1706.04987, 2017.
[181] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask R-CNN,”
in ICCV. IEEE Computer Society, 2017, pp. 2980–2988.
[182] S. E. Reed, Y. Zhang, Y. Zhang, and H. Lee, “Deep visual analogymaking,” in NIPS, 2015, pp. 1252–1260.
[183] N. Fushishita, A. Tejero-de-Pablos, Y. Mukuta, and T. Harada,
“Long-term video generation of multiple futures using human
poses,” CoRR, vol. abs/1904.07538, 2019.
26
[184] J. Tang, H. Hu, Q. Zhou, H. Shan, C. Tian, and T. Q. S. Quek,
“Pose guided global and local gan for appearance preserving
human video prediction,” in 2019 IEEE International Conference
on Image Processing (ICIP), Sep. 2019, pp. 614–618.
[185] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural
probabilistic language model,” Journal of Machine Learning Research, vol. 3, pp. 1137–1155, 2003.
[186] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence
learning with neural networks,” in Advances in Neural Information
Processing Systems 27: Annual Conference on Neural Information
Processing Systems 2014, December 8-13 2014, Montreal, Quebec,
Canada, 2014, pp. 3104–3112.
[187] A. Mahendran and A. Vedaldi, “Understanding deep image representations by inverting them,” in IEEE Conference on Computer
Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June
7-12, 2015, 2015, pp. 5188–5196.
[188] R. Chalasani and J. C. Prı́ncipe, “Deep predictive coding networks,” in 1st International Conference on Learning Representations,
ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track
Proceedings, 2013.
[189] M. F. Stollenga, W. Byeon, M. Liwicki, and J. Schmidhuber, “Parallel multi-dimensional lstm, with application to fast biomedical
volumetric image segmentation,” in Proceedings of the 28th International Conference on Neural Information Processing Systems Volume 2, ser. NIPS15. Cambridge, MA, USA: MIT Press, 2015,
p. 29983006.
[190] J. Zhang, Y. Zheng, and D. Qi, “Deep spatio-temporal residual
networks for citywide crowd flows prediction,” in AAAI. AAAI
Press, 2017, pp. 1655–1661.
[191] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal,
H. Kim, V. Haenel, I. Fründ, P. Yianilos, M. Mueller-Freitag,
F. Hoppe, C. Thurau, I. Bax, and R. Memisevic, “The ”something
something” video database for learning and evaluating visual
common sense,” in ICCV. IEEE Computer Society, 2017, pp.
5843–5851.
[192] Z. Yi, H. R. Zhang, P. Tan, and M. Gong, “Dualgan: Unsupervised
dual learning for image-to-image translation,” in ICCV. IEEE
Computer Society, 2017, pp. 2868–2876.
[193] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-toimage translation using cycle-consistent adversarial networks,”
in ICCV. IEEE Computer Society, 2017, pp. 2242–2251.
[194] W. Luo, W. Liu, and S. Gao, “A revisit of sparse coding based
anomaly detection in stacked RNN framework,” in ICCV. IEEE
Computer Society, 2017, pp. 341–349.
[195] M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. S.
Regazzoni, and N. Sebe, “Abnormal event detection in videos
using generative adversarial nets,” in ICIP. IEEE, 2017, pp. 1577–
1581.
[196] S. Niklaus, L. Mai, and F. Liu, “Video frame interpolation via
adaptive separable convolution,” in ICCV.
IEEE Computer
Society, 2017, pp. 261–270.
[197] ——, “Video frame interpolation via adaptive convolution,” in
CVPR. IEEE Computer Society, 2017, pp. 2270–2279.
[198] J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and
A. Zisserman, “A short note about kinetics-600,” CoRR, vol.
abs/1808.01340, 2018.
[199] T. Xue, J. Wu, K. L. Bouman, and B. Freeman, “Visual dynamics:
Probabilistic future frame synthesis via cross convolutional networks,” in Advances in Neural Information Processing Systems 29:
Annual Conference on Neural Information Processing Systems 2016,
December 5-10, 2016, Barcelona, Spain, 2016, pp. 91–99.
[200] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. A.
Funkhouser, “Semantic scene completion from a single depth
image,” in CVPR. IEEE Computer Society, 2017, pp. 190–198.
[201] M. Menze and A. Geiger, “Object scene flow for autonomous
vehicles,” in CVPR. IEEE Computer Society, 2015, pp. 3061–
3070.
[202] J. Janai, F. Güney, A. Ranjan, M. J. Black, and A. Geiger, “Unsupervised learning of multi-frame optical flow with occlusions,”
in ECCV (16), ser. Lecture Notes in Computer Science, vol. 11220.
Springer, 2018, pp. 713–731.
[203] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee,
“Learning what and where to draw,” in NIPS, 2016, pp. 217–225.
[204] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks
for human pose estimation,” in ECCV (8), ser. Lecture Notes in
Computer Science, vol. 9912. Springer, 2016, pp. 483–499.
[205] K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik, “Recurrent
network models for human dynamics,” in 2015 IEEE International
Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015, pp. 4346–4354.
[206] T. Jakab, A. Gupta, H. Bilen, and A. Vedaldi, “Conditional image
generation for learning the structure of visual objects,” CoRR, vol.
abs/1806.07823, 2018.
[207] Y. Zhang, Y. Guo, Y. Jin, Y. Luo, Z. He, and H. Lee, “Unsupervised
discovery of object landmarks as structural representations,” in
CVPR. IEEE Computer Society, 2018, pp. 2694–2703.
[208] R. Goroshin, M. Mathieu, and Y. LeCun, “Learning to linearize
under uncertainty,” in Advances in Neural Information Processing
Systems 28: Annual Conference on Neural Information Processing
Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, 2015,
pp. 1234–1242.
[209] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming autoencoders,” in ICANN (1), ser. Lecture Notes in Computer Science,
vol. 6791. Springer, 2011, pp. 44–51.
[210] R. Goroshin, J. Bruna, J. Tompson, D. Eigen, and Y. LeCun,
“Unsupervised learning of spatiotemporally coherent metrics,”
in 2015 IEEE International Conference on Computer Vision, ICCV
2015, Santiago, Chile, December 7-13, 2015, 2015, pp. 4086–4093.
[211] T. Brox and J. Malik, “Object segmentation by long term analysis
of point trajectories,” in ECCV (5), ser. Lecture Notes in Computer
Science, vol. 6315. Springer, 2010, pp. 282–295.
[212] J. Schmidhuber, “Learning complex, extended sequences using
the principle of history compression,” Neural Computation, vol. 4,
no. 2, pp. 234–242, 1992.
[213] P. Agrawal, A. Nair, P. Abbeel, J. Malik, and S. Levine, “Learning
to poke by poking: Experiential learning of intuitive physics,”
CoRR, vol. abs/1606.07419, 2016.
[214] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley,
D. Silver, and K. Kavukcuoglu, “Asynchronous methods for
deep reinforcement learning,” in ICML, ser. JMLR Workshop and
Conference Proceedings, vol. 48. JMLR.org, 2016, pp. 1928–1937.
[215] J. Zhang and K. Cho, “Query-efficient imitation learning for endto-end simulated driving,” in AAAI. AAAI Press, 2017, pp. 2891–
2897.
[216] S. Kohl, B. Romera-Paredes, C. Meyer, J. De Fauw, J. R. Ledsam,
K. Maier-Hein, S. A. Eslami, D. J. Rezende, and O. Ronneberger,
“A probabilistic u-net for segmentation of ambiguous images,” in
Advances in Neural Information Processing Systems, 2018, pp. 6965–
6975.
[217] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and
T. Darrell, “BDD100K: A diverse driving video database with
scalable annotation tooling,” CoRR, vol. abs/1805.04687, 2018.
[218] G. Neuhold, T. Ollmann, S. R. Bulò, and P. Kontschieder, “The
mapillary vistas dataset for semantic understanding of street
scenes,” in ICCV. IEEE Computer Society, 2017, pp. 5000–5009.
[219] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”
in 2nd International Conference on Learning Representations, ICLR
2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2014.
[220] X. Yan, J. Yang, K. Sohn, and H. Lee, “Attribute2image: Conditional image generation from visual attributes,” in Computer
Vision - ECCV 2016 - 14th European Conference, Amsterdam, The
Netherlands, October 11-14, 2016, Proceedings, Part IV, ser. Lecture
Notes in Computer Science, B. Leibe, J. Matas, N. Sebe, and
M. Welling, Eds., vol. 9908. Springer, 2016, pp. 776–791.
[221] H. Wu, M. Rubinstein, E. Shih, J. V. Guttag, F. Durand, and
W. T. Freeman, “Eulerian video magnification for revealing subtle
changes in the world,” ACM Trans. Graph., vol. 31, no. 4, pp. 65:1–
65:8, 2012.
[222] R. Villegas, A. Pathak, H. Kannan, D. Erhan, Q. V. Le, and H. Lee,
“High fidelity video prediction with large stochastic recurrent
neural networks,” CoRR, vol. abs/1911.01655, 2019.
[223] C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and
O. Winther, “Ladder variational autoencoders,” in NIPS, 2016,
pp. 3738–3746.
[224] R. Pottorff, J. Nielsen, and D. Wingate, “Video extrapolation
with an invertible linear embedding,” CoRR, vol. abs/1903.00133,
2019.
[225] D. P. Kingma and P. Dhariwal, “Glow: Generative flow with
invertible 1x1 convolutions,” in NeurIPS, 2018, pp. 10 236–10 245.
[226] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
quality assessment: from error visibility to structural similarity,”
IEEE Trans. Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
27
[227] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang,
“The unreasonable effectiveness of deep features as a perceptual
metric,” in CVPR. IEEE Computer Society, 2018, pp. 586–595.
[228] T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier,
M. Michalski, and S. Gelly, “Towards accurate generative models
of video: A new metric & challenges,” CoRR, vol. abs/1812.01717,
2018.
[229] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford,
and X. Chen, “Improved techniques for training gans,” in NIPS,
2016, pp. 2226–2234.
[230] O. Breuleux, Y. Bengio, and P. Vincent, “Quickly generating
representative samples from an rbm-derived process,” Neural
Computation, vol. 23, no. 8, pp. 2058–2073, 2011.
[231] E. Hildreth, “Theory of edge detection,” Proceedings of Royal
Society of London, vol. 207, no. 187-217, p. 9, 1980.
[232] Y. Wang, M. Long, J. Wang, Z. Gao, and P. S. Yu, “Predrnn:
Recurrent neural networks for predictive learning using spatiotemporal lstms,” in NIPS, 2017, pp. 879–888.
[233] Y. Wang, Z. Gao, M. Long, J. Wang, and P. S. Yu, “Predrnn++:
Towards A resolution of the deep-in-time dilemma in spatiotemporal predictive learning,” in ICML, ser. Proceedings of Machine
Learning Research, vol. 80. PMLR, 2018, pp. 5110–5119.
[234] F. Cricri, X. Ni, M. Honkala, E. Aksu, and M. Gabbouj, “Video
ladder networks,” CoRR, vol. abs/1612.01756, 2016.
[235] I. Prémont-Schwarz, A. Ilin, T. Hao, A. Rasmus, R. Boney, and
H. Valpola, “Recurrent ladder networks,” in NIPS, 2017, pp.
6009–6019.
[236] B. Jin, Y. Hu, Y. Zeng, Q. Tang, S. Liu, and J. Ye, “Varnet: Exploring
variations for unsupervised video prediction,” in 2018 IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS),
2018, pp. 5801–5806.
[237] J. Lee, J. Lee, S. Lee, and S. Yoon, “Mutual suppression network
for video prediction using disentangled features,” arXiv preprint
arXiv:1804.04810, 2018.
[238] D. Weissenborn, O. Tckstrm, and J. Uszkoreit, “Scaling
autoregressive video models,” in International Conference on
Learning Representations, 2020. [Online]. Available: https:
//openreview.net/forum?id=rJgsskrFwH
[239] L. Theis, A. van den Oord, and M. Bethge, “A note on the
evaluation of generative models,” in 4th International Conference
on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May
2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun,
Eds., 2016.
Sergiu Oprea is a PhD student at the Department of Computer Technology (DTIC), University of Alicante. He received his MSc (Automation and Robotics) and BSc (Computer Science)
from the same institution in 2017 and 2015 respectively. His main research interests include
video prediction with deep learning, virtual reality, 3D computer vision, and parallel computing
on GPUs.
Pablo Martinez Gonzalez is a PhD student
at the Department of Computer Technology
(DTIC), University of Alicante. He received his
MSc (Computer Graphics, Games and Virtual
Reality) and BSc (Computer Science) at the
Rey Juan Carlos University and University of Alicante, in 2017 and 2015, respectively. His main
research interests include deep learning, virtual
reality and parallel computing on GPUs.
Alberto Garcia Garcia is a Postdoctoral Researcher at the Institute of Space Sciences (ICECSIC, Barcelona) where he leads the efforts in
code optimization, machine learning, and parallel computing on the MAGNESIA ERC Consolidator project. He received his PhD (Machine
Learning and Computer Vision), MSc (Automation and Robotics) and BSc (Computer Science)
from the same institution in 2019, 2016 and
2015 respectively. Previously he was an intern
at NVIDIA Research/Engineering, Facebook Reality Labs, and Oculus Core Tech. His main research interests include
deep learning (specially convolutional neural networks), virtual reality,
3D computer vision, and parallel computing on GPUs.
John Alejandro Castro Vargas is a PhD student at the Department of Computer Technology
(DTIC), University of Alicante. He received his
MSc (Automation and Robotics) and BSc (Computer Science) from the same institution in 2017
and 2016 respectively. His main research interests include human behavior recognition with
deep learning, virtual reality and parallel computing on GPUs.
Sergio Orts-Escolano received a BSc, MSc
and PhD in Computer Science from the University of Alicante in 2008, 2010 and 2014 respectively. His research interests include computer vision, assistive robotics, 3D sensors, GPU
computing, virtual/augmented reality and deep
learning. He has authored +50 publications in
top journals and conferences like CVPR, SIGGRAPH, 3DV, BMVC, CVIU, IROS, UIST, RAS,
etcetera. He is also a member of European Networks like HiPEAC and Eucog. He has experience as a professor in academia and industry, working as a research
scientist for companies such as Google and Microsoft Research.
Jose Garcia-Rodriguez received his Ph.D. degree, with specialization in Computer Vision and
Neural Networks, from the University of Alicante
(Spain). He is currently Full Professor at the
Department of Computer Technology of the University of Alicante. His research areas of interest include: computer vision, computational intelligence, machine learning, pattern recognition,
robotics, man-machine interfaces, ambient intelligence, computational chemistry, and parallel
and multicore architectures.
Antonis Argyros is a professor of computer
science at the Computer Science Department,
University of Crete and a researcher at the Institute of Computer Science, FORTH, in Heraklion,
Crete, Greece. His research interests fall in the
areas of computer vision and pattern recognition, with emphasis on the analysis of humans
in images and videos, human pose analysis,
recognition of human activities and gestures, 3D
computer vision, as well as image motion and
tracking. He is also interested in applications of
computer vision in the fields of robotics and smart environments.