Academia.eduAcademia.edu

A Review on Deep Learning Techniques for Video Prediction

2020, IEEE Transactions on Pattern Analysis and Machine Intelligence

The ability to predict, anticipate and reason about future outcomes is a key component of intelligent decision-making systems. In light of the success of deep learning in computer vision, deep-learning-based video prediction emerged as a promising research direction. Defined as a self-supervised learning task, video prediction represents a suitable framework for representation learning, as it demonstrated potential capabilities for extracting meaningful representations of the underlying patterns in natural videos. Motivated by the increasing interest in this task, we provide a review on the deep learning methods for prediction in video sequences. We firstly define the video prediction fundamentals, as well as mandatory background concepts and the most used datasets. Next, we carefully analyze existing video prediction models organized according to a proposed taxonomy, highlighting their contributions and their significance in the field. The summary of the datasets and methods is accompanied with experimental results that facilitate the assessment of the state of the art on a quantitative basis. The paper is summarized by drawing some general conclusions, identifying open research challenges and by pointing out future research directions.

1 A Review on Deep Learning Techniques for Video Prediction arXiv:2004.05214v1 [cs.CV] 10 Apr 2020 S. Oprea, P. Martinez-Gonzalez, A. Garcia-Garcia, J.A. Castro-Vargas, S. Orts-Escolano, J. Garcia-Rodriguez, and A. Argyros Abstract—The ability to predict, anticipate and reason about future outcomes is a key component of intelligent decision-making systems. In light of the success of deep learning in computer vision, deep-learning-based video prediction emerged as a promising research direction. Defined as a self-supervised learning task, video prediction represents a suitable framework for representation learning, as it demonstrated potential capabilities for extracting meaningful representations of the underlying patterns in natural videos. Motivated by the increasing interest in this task, we provide a review on the deep learning methods for prediction in video sequences. We firstly define the video prediction fundamentals, as well as mandatory background concepts and the most used datasets. Next, we carefully analyze existing video prediction models organized according to a proposed taxonomy, highlighting their contributions and their significance in the field. The summary of the datasets and methods is accompanied with experimental results that facilitate the assessment of the state of the art on a quantitative basis. The paper is summarized by drawing some general conclusions, identifying open research challenges and by pointing out future research directions. Index Terms—Video prediction, future frame prediction, deep learning, representation learning, self-supervised learning ✦ 1 I NTRODUCTION Context Frames Predicted Frames W ILL the car hit the pedestrian? That might be one of the questions that comes to our minds when we observe Figure 1. Answering this question might be in principle a hard task; however, if we take a careful look into the image sequence we may notice subtle clues that can help us predicting into the future, e.g., the person’s body indicates that he is running fast enough so he will be able to escape the car’s trajectory. This example is just one situation among many others in which predicting future frames in video is useful. In general terms, the prediction and anticipation of future events is a key component of intelligent decisionmaking systems. Despite the fact that we, humans, solve this problem quite easily and effortlessly, it is extremely challenging from a machine’s point of view. Some of the factors that contribute to such complexity are occlusions, camera movement, lighting conditions, clutter, or object deformations. Nevertheless, despite such challenging conditions, many predictive methods have been applied with • • • • S. Oprea, P. Martinez-Gonzalez A. Garcia-Garcia, J. A. Castro-Vargas, and J. Garcia-Rodriguez are with the 3D Perception Lab (3DPL), Department of Computer Technology, University of Alicante, Carrer de San Vicente del Raspeig s/n, E-03690 San Vicent del Raspeig Spain, Spain. E-mail: {soprea, pmartinez, jacastro, jgarcia}@dtic.ua.es A. Garcia-Garcia is with the Institute of Space Sciences (ICE-CSIC), Campus UAB, Carrer de Can Magrans s/n, E-08193 Barcelona, Spain. E-mail: [email protected]. S. Orts-Escolano is with the Department of Computer Science and Artificial Intelligence (DCCIA), University of Alicante, Carrer de San Vicente del Raspeig s/n, E-03690 San Vicent del Raspeig Spain, Spain. E-mail: [email protected]. A. Argyros is with the Institute of Computer Science, FORTH, Heraklion GR-700 13, Greece and with the Computer Science Department, University of Crete, Heraklion, Rethimno 741 00, Greece. E-mail: [email protected]. ... (Xt−n, . . . , Xt) Ŷt+1 Ŷt+m Fig. 1: A pedestrian appeared from behind the white car with the intention of crossing the street. The autonomous car must make a call: hit the emergency braking routine or not. This all comes down to predict the next frames (Ŷt+1 , . . . , Ŷt+m ) given a sequence of context frames (Xt−n , . . . , Xt ), where n and m denote the number of context and predicted frames, respectively. From these predictions at a representation level (RGB, high-level semantics, etc.) a decision-making system would make the car avoid the collision. a certain degree of success in a broad range of application domains such as autonomous driving, robot navigation and human-machine interaction. Some of the tasks in which future prediction has been applied successfully are: anticipating activities and events [1]–[4], long-term planning [5], future prediction of object locations [6], video interpolation [7], predicting instance/semantic segmentation maps [8]– [10], prediction of pedestrian trajectories in traffic [11], anomaly detection [12], precipitation nowcasting [13], [14], and autonomous driving [15]. The great strides made by deep learning algorithms in a variety of research fields such as semantic segmen- 2 tation [16], human action recognition and prediction [17], object pose estimation [18] and registration [19] to name a few, motivated authors to explore deep representationlearning models for future video frame prediction. What made the deep architectures take a leap over the traditional approaches is their ability to learn adequate representations from high-dimensional data in an end-to-end fashion without hand-engineered features [20]. Deep learningbased models fit perfectly into the learning by prediction paradigm, enabling the extraction of meaningful spatiotemporal correlations from video data in a self-supervised fashion. In this review, we put our focus on deep learning techniques and how they have been extended or applied to future video prediction. We limit this review to the future video prediction given the context of a sequence of previous frames, leaving aside methods that predict future from a static image. In this context, the terms video prediction, future frame prediction, next video frame prediction, future frame forecasting, and future frame generation are used interchangeably. To the best of our knowledge, this is the first review in the literature that focuses on video prediction using deep learning techniques. This review is organized as follows. First, Sections 2 and 3 lay down the terminology and explain important background concepts that will be necessary throughout the rest of the paper. Next, Section 4 surveys the datasets used by the video prediction methods that are carefully reviewed in Section 5, providing a comprehensive description as well as an analysis of their strengths and weaknesses. Section 6 analyzes typical metrics and evaluation protocols for the aforementioned methods and provides quantitative results for them in the reviewed datasets. Section 7 presents a brief discussion on the presented proposals and enumerates potential future research directions. Finally, Section 8 summarizes the paper and draws conclusions about this work. 2 V IDEO P REDICTION The ability to predict, anticipate and reason about future events is the essence of intelligence [21] and one of the main goals of decision-making systems. This idea has biological roots, and also draws inspiration from the predictive coding paradigm [22] borrowed from the cognitive neuroscience field [23]. From a neuroscience perspective, the human brain builds complex mental representations of the physical and causal rules that govern the world. This is primarily through observation and interaction [24]–[26]. The common sense we have about the world arises from the conceptual acquisition and the accumulation of background knowledge from early ages, e.g. biological motion and intuitive physics to name a few. But how can the brain check and refine the learned mental representations from its raw sensory input? The brain is continuously learning through prediction, and refines the already understood world models from the mismatch between its predictions and what actually happened [27]. This is the essence of the predictive coding paradigm that early works tried to implement as computational models [22], [28]–[30]. Video prediction task closely captures the fundamentals of the predictive coding paradigm and it is considered the intermediate step between raw video data and decision making. Its potential to extract meaningful representations of the underlying patterns in video data makes the video prediction task a promising avenue for self-supervised representation learning. 2.1 Problem Definition We formally define the task of predicting future frames in videos, i.e. video prediction, as follows. Let Xt ∈ Rw×h×c be the t-th frame in the video sequence X = (Xt−n , . . . , Xt−1 , Xt ) with n frames, where w, h, and c denote width, height, and number of channels, respectively. The target is to predict the next frames Y = (Ŷt+1 , Ŷt+2 , . . . , Ŷt+m ) from the input X. Under the assumption that good predictions can only be the result of accurate representations, learning by prediction is a feasible approach to verify how accurately the system has learned the underlying patterns in the input data. In other words, it represents a suitable framework for representation learning [31], [32]. The essence of predictive learning paradigm is the prediction of plausible future outcomes from a set of historical inputs. On this basis, the task of video prediction is defined as: given a sequence of video frames as context, predict the subsequent frames –generation of continuing video given a sequence of previous frames. Different from video generation that is mostly unconditioned, video prediction is conditioned on a previously learned representation from a sequence of input frames. At a first glance, and in the context of learning paradigms, we can think about the future video frame prediction task as a supervised learning approach because the target frame acts as a label. However, as this information is already available in the input video sequence, no extra labels or human supervision is needed. Therefore, learning by prediction is a self-supervised task, filling the gap between supervised and unsupervised learning. 2.2 Exploiting the Time Dimension of Videos Unlike static images, videos provide complex transformations and motion patterns in the time dimension. At a fine granularity, if we focus on a small patch at the same spatial location across consecutive time steps, we could identify a wide range of local visually similar deformations due to the temporal coherence. In contrast, by looking at the big picture, consecutive frames would be visually different but semantically coherent. This variability in the visual appearance of a video at different scales is mainly due to, occlusions, changes in the lighting conditions, and camera motion, among other factors. From this source of temporally ordered visual cues, predictive models are able to extract representative spatio-temporal correlations depicting the dynamics in a video sequence. For instance, Agrawal et al. [33] established a direct link between vision and motion, attempting to reduce supervision efforts when training deep predictive models. Recent works study how important is the time dimension for video understanding models [34]. The implicit temporal ordering in videos, also known as the arrow of time, indicates whether a video sequence is playing forward or backward. This temporal direction is also used in the 3 Time Probabilistic Deterministic Context Frame 2.4 Fig. 2: At top, a deterministic environment where a geometric object, e.g. a black square, starts moving following a random direction. At bottom, probabilistic outcome. Darker areas correspond to higher probability outcomes. As uncertainty is introduced, probabilities get blurry and averaged. Figure inspired by [38]. literature as a supervisory signal [35]–[37]. This further encouraged predictive models to implicitly or explicitly model the spatio-temporal correlations of a video sequence to understand the dynamics of a scene. The time dimension of a video reduces the supervision effort and makes the prediction task self-supervised. 2.3 model is a crucial aspect. Probabilistic approaches dealing with these issues are discussed in Section 5.6. Dealing with Stochasticity Predicting how a square is moving, could be extremely challenging even in a deterministic environment such as the one represented in Figure 2. The lack of contextual information and the multiple equally probable outcomes hinder the prediction task. But, what if we use two consecutive frames as context? Under this configuration and assuming a physically perfect environment, the square will be indefinitely moving in the same direction. This represents a deterministic outcome, an assumption that many authors made in order to deal with future uncertainty. Assuming a deterministic outcome would narrow the prediction space to a unique solution. However, this assumption is not suitable for natural videos. The future is by nature multimodal, since the probability distribution defining all the possible future outcomes in a context has multiple modes, i.e. there are multiple equally probable and valid outcomes. Furthermore, on the basis of a deterministic universe, we indirectly assume that all possible outcomes are reflected in the input data. These assumptions make the prediction under uncertainty an extremely challenging task. Most of the existing deep learning-based models in the literature are deterministic. Although the future is uncertain, a deterministic prediction would suffice some easily predictable situations. For instance, most of the movement of a car is largely deterministic, while only a small part is uncertain. However, when multiple predictions are equally probable, a deterministic model will learn to average between all the possible outcomes. This unpredictability is visually represented in the predictions as blurriness, especially on long time horizons. As deterministic models are unable to handle real-world settings characterized by chaotic dynamics, authors considered that incorporating uncertainty to the The Devil is in the Loss Function The design and selection of the loss function for the video prediction task is of utmost importance. Pixel-wise losses, e.g. Cross Entropy (CE), ℓ2 , ℓ1 and Mean-Squared Error (MSE), are widely used in both unstructured and structured predictions. Although leading to plausible predictions in deterministic scenarios, such as synthetic datasets and video games, they struggle with the inherent uncertainty of natural videos. In a probabilistic environment, with different equally probable outcomes, pixel-wise losses aim to accommodate uncertainty by blurring the prediction, as we can observe in Figure 2. In other words, the deterministic loss functions average out multiple equally plausible outcomes in a single, blurred prediction. In the pixel space, these losses are unstable to slight deformations and fail to capture discriminative representations to efficiently regress the broad range of possible outcomes. This makes difficult to draw predictions maintaining the consistency with our visual similarity notion. Besides video prediction, several studies analyzed the impact of different loss functions in image restoration [39], classification [40], camera pose regression [41] and structured prediction [42], among others. This fosters reasoning about the importance of the loss function, particularly when making long-term predictions in high-dimensional and multimodal natural videos. Most of distance-based loss functions, such as based on ℓp norm, come from the assumption that data is drawn from a Gaussian distribution. But, how these loss functions address multimodal distributions? Assuming that a pixel is drawn from a bimodal distribution with two equally likely modes M o1 and M o2 , the mean value M o = (M o1 + M o2 )/2 would minimize the ℓp -based losses over the data, even if M o has very low probability [43]. This suggests that the average of two equally probable outcomes would minimize distance-based losses such as, the MSE loss. However, this applies to a lesser extent when using ℓ1 norm as the pixel values would be the median of the two equally likely modes in the distribution. In contrast to the ℓ2 norm that emphasizes outliers with the squaring term, the ℓ1 promotes sparsity thus making it more suitable for prediction in high-dimensional data [43]. Based on the ℓ2 norm, the MSE is also commonly used in the training of video prediction models. However, it produces low reconstruction errors by merely averaging all the possible outcomes in a blurry prediction as uncertainty is introduced. In other words, the mean image would minimize the MSE error as it is the global optimum, thus avoiding finer details such as facial features and subtle movements as they are noise for the model. Most of the video prediction approaches rely on pixel-wise loss functions, obtaining roughly accurate predictions in easily predictable datasets. One of the ultimate goals of many video prediction approaches is to palliate the blurry predictions when it comes to uncertainty. For this purpose, authors broadly focused on: directly improving the loss functions; exploring adversarial training; alleviating the training process by reformulating the problem in a higher-level space; or 4 exploring probabilistic alternatives. Some promising results were reported by combining the loss functions with sophisticated regularization terms, e.g. the Gradient Difference Loss (GDL) to enhance prediction sharpness [43] and the Total Variation (TV) regularization to reduce visual artifacts and enforce coherence [7]. Perceptual losses were also used to further improve the visual quality of the predictions [44]– [48]. However, in light of the success of the Generative Adversarial Networks (GANs), adversarial training emerged as a promising alternative to disambiguate between multiple equally probable modes. It was widely used in conjunction with different distance-based losses such as: MSE [49], ℓ2 [50]–[52], or a combination of them [43], [53]–[57]. To alleviate the training process, many authors reformulated the optimization process in a higher-level space (see Section 5.5). While great strides have been made to mitigate blurriness, most of the existing approaches still rely on distance-based loss functions. As a consequence, the regress-to-the-mean problem remains an open issue. This has further encouraged authors to reformulate existing deterministic models in a probabilistic fashion. 3 BACKBONE D EEP L EARNING A RCHITECTURES In this section, we will briefly review the most common deep networks that are used as building blocks for the video prediction models discussed in this review: convolutional neural networks, recurrent networks, and generative models. 3.1 Convolutional Models Convolutional layers are the basic building blocks of deep learning architectures designed for visual reasoning since the Convolutional Neural Networks (CNNs) efficiently model the spatial structure of images [58]. As we focus on the visual prediction, CNNs represent the foundation of predictive learning literature. However, their performance is limited by the intra-frame and inter-frame dependencies. Convolutional operations account for short-range intraframe dependencies due to their limited receptive fields, determined by the kernel size. This is a well-addressed issue, that many authors circumvented by (1) stacking more convolutional layers [59], (2) increasing the kernel size (although it becomes prohibitively expensive), (3) by linearly combining multiple scales [43] as in the reconstruction process of a Laplacian pyramid [60], (4) using dilated convolutions to capture long-range spatial dependencies [61], (5) enlarging the receptive fields [62], [63], or subsampling, i.e. using pooling operations in exchange for losing resolution. The latter could be mitigated by using residual connections [64], [65] to preserve resolution while increasing the number of stacking convolutions. But even addressing these limitations, would CNNs be able to predict in a longer time horizon? Vanilla CNNs lack of explicit inter-frame modeling capabilities. To properly model inter-frame variability in a video sequence, 3D convolutions come into play as a promising alternative to recurrent modeling. Several video prediction approaches leveraged 3D convolutions to capture temporal consistency [66]–[70]. Also modeling time dimension, Amersfoort et al. [71] replicated a purely convolutional approach in time to address multi-scale predictions in the transformation space. In this case, the learned affine transforms at each time step play the role of a recurrent state. 3.2 Recurrent Models Recurrent models were specifically designed to model a spatio-temporal representation of sequential data such as video sequences. Among other sequence learning tasks, such as machine translation, speech recognition and video captioning, to name a few, Recurrent Neural Networks (RNNs) [72] demonstrated great success in the video prediction scenario [10], [13], [49], [50], [52], [53], [53], [70], [73]– [85]. Vanilla RNNs have some important limitations when dealing with long-term representations due to the vanishing and exploding gradient issues, making the Backpropagation through time (BPTT) cumbersome. By extending classical RNNs to more sophisticated recurrent models, such as Long Short-Term Memory (LSTM) [86] and Gated Recurrent Unit (GRU) [87], these problems were mitigated. Shi et al. extended the use of LSTM-based models to the image space [13]. While some authors explored multidimensional LSTM (MD-LSTM) [88], others stacked recurrent layers to capture abstract spatio-temporal correlations [49], [89]. Zhang et al. addressed the duplicated representations along the same recurrent paths [90]. 3.3 Generative Models Whilst discriminative models learn the decision boundaries between classes, generative models learn the underlying distribution of individual classes. More formally, discriminative models capture the conditional probability p(y|x), while generative models capture the joint probability p(x, y), or p(x) in the absence of labels y . The goal of generative models is the following: given some training data, generate new samples from the same distribution. Let input data ∼ pdata (x) and generated samples ∼ pmodel (x) where, pdata and pmodel are the underlying input data and model’s probability distribution respectively. The training process consists in learning a pmodel (x) similar to pdata (x). This is done by explicitly, e.g VAEs, or implicitly, e.g. GANs, estimating a density function from the input data. In the context of video prediction, generative models are mainly used to cope with future uncertainty by generating a wide spectrum of feasible predictions rather than a single eventual outcome. 3.3.1 Explicit Density Modeling These models explicitly define and solve for pmodel (x). PixelRNNs and PixelCNNs [91]: These are a type of Fully Visible Belief Networks (FVBNs) [92], [93] that explicitly define a tractable density and estimate the joint distribution p(x) as a product of conditional distributions over the pixels. Informally, they turn pixel generation into a sequential modeling problem, where next pixel values are determined by previously generated ones. In PixelRNNs, this conditional dependency on previous pixels is modeled using two-dimensional (2d) LSTMs. On the other hand, dependencies are modeled using convolutional operations 5 over a context region, thus making training faster. In a nutshell, these methods are outputting a distribution over pixel values at each location in the image, aiming to maximize the likelihood of the training data being generated. Further improvements of the original architectures have been carried out to address different issues. The Gated PixelCNN [94] is computationally more efficient and improves the receptive fields of the original architecture. In the same work, authors also explored conditional modeling of natural images, where the joint probability distribution is conditioned on a latent vector —it represents a high-level image description. This further enabled the extension to video prediction [95]. Variational Autoencoders (VAEs): These models are an extension of Autoencoders (AEs) that encode and reconstruct its own input data x in order to capture a low-dimensional representation z containing the most meaningful factors of variation in x. Extending this architecture to generation, VAEs aim to sample new images from a prior over the underlying latent representation z . VAEs represent a probabilistic spin over the deterministic latent space in AEs. Instead of directly optimizing the density function, which is intractable, they derive and optimize a lower bound on the likelihood. Data is generated from the learned distribution by perturbing the latent variables. In the video prediction context, VAEs are the foundation of many probabilistic models dealing with future uncertainty [9], [38], [55], [81], [85], [96], [97]. Although these variational approaches are able to generate various plausible outcomes, the predictions are blurrier and of lower quality compared to state-of-theart GAN-based models. Many approaches were taken to leverage the advantages of variational inference: combined adversarial training with VAEs [55], and others incorporated latent probabilistic variables into deterministic models, such as Variational Recurrent Neural Networks (VRNNs) [97], [98] and Variational Encoder-Decoders (VEDs) [99]. 3.3.2 Implicit Density Modeling These models learn to sample from pmodel without explicitly defining it. GANs [100]: are the backbone of many video prediction approaches [43], [49]–[55], [57], [65], [67], [68], [78], [101]– [106]. Inspired on game theory, these networks consist of two models that are jointly trained as a minimax game to generate new fake samples that resemble the real data. On one hand, we have the discriminator model featuring a probability distribution function describing the real data. On the other hand, we have the generator which tries to generate new samples that fool the discriminator. In their original formulation, GANs are unconditioned –the generator samples new data from a random noise, e.g. Gaussian noise. Nevertheless, Mirza et al. [107] proposed the conditional Generative Adversarial Network (cGAN), a conditional version where the generator and discriminator are conditioned on some extra information, e.g. class labels, previous predictions, and multimodal data, among others. CGANs are suitable for video prediction, since the spatiotemporal coherence between the generated frames and the input sequence is guaranteed. The use of adversarial training for the video prediction task, represented a leap over the previous state-of-the-art methods in terms of prediction quality and sharpness. However, adversarial training is unstable. Without an explicit latent variable interpretation, GANs are prone to mode collapse —generator fails to cover the space of possible predictions by getting stuck into a single mode [99], [108]. Moreover, GANs often struggle to balance between the adversarial and reconstruction loss, thus getting blurry predictions. Among the dense literature on adversarial networks, we find some other interesting works addressing GANs limitations [109], [110]. 4 DATASETS As video prediction models are mostly self-supervised, they need video sequences as input data. However, some video prediction methods rely on extra supervisory signals, e.g. segmentation maps, and human poses. This makes outof-domain video datasets perfectly suitable for video prediction. This section describes the most relevant datasets, discussing their pros and cons. Datasets were organized according to their main purpose and summarized in Table 1. 4.1 Action and Human Pose Recognition Datasets KTH [111]: is an action recognition dataset which includes 2391 video sequences of 4 seconds mean duration, each of them containing an actor performing an action taken with a static camera, over homogeneous backgrounds, at 25 frames per second (fps) and with its resolution downsampled to 160 × 120 pixels. Just 6 different actions are performed, but it was the biggest dataset of this kind at its moment. Weizmann [112]: is also an action recognition dataset, created for modelling actions as space-time shapes. For this reason, it was recorded at a higher frame rate (50 fps). It just includes 90 video sequences, but performing 10 different actions. It uses a static-camera, homogeneous backgrounds and low resolution (180 × 144 pixels). KTH and Weizmann are usually used together due to their similarities in order to augment the amount of available data. HMDB-51 [113]: is a large-scale database for human motion recognition. It claims to represent the richness of human motion taking profit from the huge amount of video available online. It is composed by 6766 normalized videos (with mean duration of 3.15 seconds) where humans appear performing one of the 51 considered action categories. Moreover, a stabilized dataset version is provided, in which camera movement is disabled by detecting static backgrounds and displacing the action as a window. It also provides interesting data for each sequence such as body parts visible, point of view respect the human, and if there is camera movement or not. It also exists a joint-annotated version called J-HMBD [114] in which the key points of joints were mannually added for 21 of the HMDB actions. UCF101 [115]: is an action recognition dataset of realistic action videos, collected from YouTube. It has 101 different action categories, and it is an extension of UCF50, which has 50 action categories. All videos have a frame rate of 25 fps and a resolution of 320 × 240 pixels. Despite being the most used dataset among predictive models, a problem it 6 has is that only a few sequences really represent movement, i.e. they often show an action over a fixed background. level for both semantic and instance segmentation, following the format proposed by the Cityscapes [121] dataset. Penn Action Dataset [116]: is an action and human pose recognition dataset from the University of Pennsylvania. It contains 2326 video sequences of 15 different actions, and it also provides human joint and viewpoint (position of the camera respect the human) annotations for each sequence. Each action is balanced in terms of different viewpoints representation. Cityscapes [121]: is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense pixel annotations for 30 classes grouped into 8 categories. The dataset consist of around 5000 fine annotated images (1 frame in 30) and 20 000 coarse annotated ones (one frame every 20 seconds or 20 meters run by the car). Data was captured in 50 cities during several months, daytimes, and good weather conditions. All frames are provided as stereo pairs, and the dataset also includes vehicle odometry obtained from invehicle sensors, outside temperature, and GPS tracks. Human3.6M [117]: is a human pose dataset in which 11 actors with marker-based suits were recorded performing 15 different types of actions. It features RGB images, depth maps (time-of-flight range data), poses and scanned 3D surface meshes of all actors. Silhouette masks and 2D bounding boxes are also provided. Moreover, the dataset was extended by inserting high-quality 3D rigged human models (animated with the previously recorded actions) in real videos, to create a realistic and complex background. THUMOS-15 [118]: is an action recognition challenge that was celebrated in 2015. It didn’t just focus on recognizing an action in a video, but also on determining the time span in which that action occurs. With that purpose, the challenge provided a dataset that extends UCF101 [115] (trimmed videos with one action) with 2100 untrimmed videos where one or more actions take place (with the correspondent temporal annotations) and almost 3000 relevant videos without any of the 101 proposed actions. 4.2 Driving and Urban Scene Understanding Datasets CamVid [136]: the Cambridge-driving Labeled Video Database is a driving/urban scene understanding dataset which consists of 5 video sequences recorded with a 960 × 720 pixels resolution camera mounted on the dashboard of a car. Four of those sequences were sampled at 1 fps, and one at 15 fps, resulting in 701 frames which were manually per-pixel annotated for semantic segmentation (under 32 classes). It was the first video sequence dataset of this kind to incorporate semantic annotations. CalTech Pedestrian Dataset [119]: is a driving dataset focused on detecting pedestrians, since its unique annotations are pedestrian bounding boxes. It is conformed of approximately 10 hours of 640 × 480 30fps video taken from a vehicle driving through regular traffic in an urban environment, making a total of 250 000 annotated frames distributed in 137 approximately minute-long segments. The total pedestrian bounding boxes is 350 000, identifying 2300 unique pedestrians. Temporal correspondence between bounding boxes and detailed occlusion labels are also provided. Kitti [120]: is one of the most popular datasets for mobile robotics and autonomous driving, as well as a benchmark for computer vision algorithms. It is composed by hours of traffic scenarios recorded with a variety of sensor modalities, including high-resolution RGB, gray-scale stereo cameras, and a 3D laser scanner. Despite its popularity, the original dataset did not contain ground truth for semantic segmentation. However, after various researchers manually annotated parts of the dataset to fit their necessities, in 2015 Kitti dataset was updated with 200 annotated frames at pixel Comma.ai steering angle [137]: is a driving dataset composed by 7.25 hours of largely highway routes. It was recorded as 360 × 180 camera images at 20 fps (divided in 11 different clips), and steering angles, among other vehicle data (speed, GPS, etc.). Apolloscape [122]: is a driving/urban scene understanding dataset that focuses on 3D semantic reconstruction of the environment. It provides highly precise information about location and 6D camera pose, as well as a much bigger amount of dense per-pixel annotations than other datasets. Along with that, depth information is retireved from a LIDAR sensor, that allows to semantically reconstruct the scene in 3D as a point cloud. It also provides RGB stereo pairs as video sequences recorded under various weather conditions and daytimes. This video sequences and their per-pixel instance annotations make this dataset very interesting for a wide variety of predictive models. 4.3 Object and Video Classification Datasets Sports1M [123]: is a video classification dataset that also consists of annotated YouTube videos. In this case, it is fully focused on sports: its 487 classes correspond to the sport label retrieved from the YouTube Topics API. Video resolution, duration and frame rate differ across all available videos, but they can be normalized when accessed from YouTube. It is much bigger than UCF101 (over 1 million videos), and movement is also much more frequent. Youtube-8M [124]: Sports1M [123] dataset is, since 2016, part of a bigger one called YouTube8M, which follows the same philosophy, but with all kind of videos, not just sports. Moreover, it has been updated in order to improve the quality and precision of their annotations. In 2019 YouTube-8M Segments was released with segment-level human-verified labels on about 237 000 video segments on 1000 different classes, which are collected from the validation set of the YouTube-8M dataset. Since YouTube is the biggest video source on the planet, having annotations for some of their videos at segment level is great for predictive models. YFCC100M [125]: Yahoo Flickr Creative Commons 100 Million Dataset is a collection of 100 million images and videos uploaded to Flickr between 2004 and 2014. All those media files were published in Flickr under Creative Commons license, overcoming one of the biggest issues affecting existing multimedia datasets, licensing and volume. Although only 0.8% of the elements of the dataset are videos, it is 7 TABLE 1: Summary of the most widely used datasets for video prediction (S/R: Synthetic/Real, st: stereo, de: depth, ss: semantic segmentation, is: instance segmentation, sem: semantic, I/O: Indoor/Outdoor environment, bb: bounding box, Act: Action label, ann: annotated, env: environment, ToF: Time of Flight, vp: camera viewpoints respect human). provided data and ground-truth name1 year S/R #videos #frames #ann. frames resolution #classes RGB st de ss is other annotations env. X X X X X X X ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ToF ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ Act. Act. Act., vp Act. Act., Human poses, vp Act., Human poses & meshes Act., Time span O O I/O I/O I/O I/O I/O X X X X X X ✗ ✗ X X ✗ X ✗ ✗ LiDAR stereo ✗ LiDAR X ✗ X X ✗ X ✗ ✗ X X ✗ X ✗ Pedestrian bb & occlusions Odometry Odometry, temp, GPS Steering angles & speed Odometry, GPS O O O O O O X X X ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ Sport label Topic label, Segment info User tags, Localization I/O I/O I/O X X X X X X X ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ X ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ Arm pose Arm pose Arm pose I/O I I I I X X X X X X ✗ ✗ ✗ X ✗ X ✗ ✗ ✗ ✗ X stereo ✗ ✗ ✗ ✗ X ✗ ✗ ✗ ✗ X X ✗ User tags, human bb Object bb ✗ Human poses, bb Normal maps, 6D poses ✗ I/O I I/O I O Action and human pose recognition datasets KTH [111] Weizmann [112] HMDB-51 [113] UCF101 [115] Penn Action D. [116] Human3.6M [117] THUMOS-15 [118] 2004 2007 2011 2012 2013 2014 2017 R R R R R SR R 2391 90 6766 13 320 2326 40002 18 404 250 0002 90002 639 300 2 000 0002 163 841 3 600 000 3 000 0002 Camvid [77] CalTech Pedest. [119] Kitti [120] Cityscapes [121] Comma.ai [75] Apolloscape [122] 2008 2009 2013 2016 2016 2018 R R R R R R 5 137 151 50 11 4 18 202 1 000 0002 48 791 7 000 0002 522 0002 200 000 Sports1m [123] YouTube8M [124] YFCC100M [125] 2014 2016 2016 R R SR 1 133 158 8 200 000 8000 n/a n/a n/a 0 0 0 Bouncing balls [126] Van Hateren [127] NORBvideos [128] Moving MNIST [74] Robotic Pushing [89] BAIR Robot [129] RoboNet [130] 2008 2012 2013 2015 2016 2017 2019 S R R SR R R R 4000 56 110 560 custom3 57 000 45 000 161 000 20 000 3584 552 800 custom3 1 500 0002 n/a 15 000 000 0 0 All (is) 0 0 0 0 ViSOR [131] PROST [132] Arcade Learning [133] Inria 3DMovie v2 [134] Robotrix [16] UASOL [135] 2010 2010 2013 2016 2018 2019 R R S R S R 1529 4 (10) custom3 27 67 33 1 360 0002 4936 (9296) custom3 2476 3 039 252 165 365 0 0 0 0 0 0 0 160 × 120 180 × 144 var × 240 320 × 240 480 × 270 1000x1000 320 × 240 6 (action) 10 (action) 51 (action) 101 (action) 15 (action) 15 (action) 101 (action) Driving and urban scene understanding datasets 701 (ss) 250 000 (bb) 200 (ss) 25 000 (ss) 0 146 997 (ss) 960 × 720 640 × 480 1392 × 512 2048 × 1024 160 × 320 3384 × 2710 32 (sem) 30 (sem) 30 (sem) 25 (sem) Object and video classification datasets 640 × 360 (var.) variable variable 487 (sport) 1000 (topic) - Video prediction datasets 150 × 150 128 × 128 640 × 480 64 × 64 640 × 512 n/a variable 5 (object) - Other-purpose and multi-purpose datasets 1 2 3 0 All (bb) 0 235 (is) All (ss) 0 variable variable 210 × 160 960 × 540 1920 × 1080 2280 × 1282 39 (sem) - some dataset names have been abbreviated to enhance table’s readability. values estimated based on the framerate and the total number of frames or videos, as the original values are not provided by the authors. custom indicates that as many frames as needed can be generated. This is related to datasets generated from a game, algorithm or simulation, involving interaction or randomness. still useful for predictive models due to the great variety of these, and therefore the challenge that it represents. 4.4 Video Prediction Datasets Standard bouncing balls dataset [126]: is a common test set for models that generate high dimensional sequences. It consists of simulations of three balls bouncing in a box. Its clips can be generated randomly with custom resolution but the common structure is composed by 4000 training videos, 200 testing videos and 200 more for validation. This kind of datasets are purely focused on video prediction. Van Hateren Dataset of natural videos (version [127]): is a very small dataset of 56 videos, each 64 frames long, that has been widely used in unsupervised learning. Original images were taken and given for scientific use by the photographer Hans van Hateren, and they feature moving animals in grasslands along rivers and streams. Its frame size is 128 × 128 pixels. The version we are reviewing is the one provided along with the work of Cadieu and Olshausen [127]. NORBvideos [128]: NORB (NYU Object Recognition Benchmark) dataset [138] is a compilation of static stereo pairs of 50 homogeneously colored objects from various points of view and 6 lightning conditions. Those images were processed to obtain their object masks and even their casted shadows, allowing them to augment the dataset introducing random backgrounds. Viewpoints are determined by rotating the camera through 9 elevations and 18 azimuths (every 20 degrees) around the object. NORBvideos dataset was built by sequencing all these frames for each object. Moving MNIST [74] (M-MNIST): is a video prediction dataset built from the composition of 20-frame video sequences where two handwritten digits from the MNIST database are combined inside a 64 × 64 patch, and moved with some velocity and direction along frames, potentially overlapping between them. This dataset is almost infinite (as new sequences can be generated on the fly), and it also has interesting behaviours due to occlusions and the dynamics of digits bouncing off the walls of the patch. For these reasons, this dataset is widely used by many predictive models. A stochastic variant of this dataset is also available. In the original M-MNIST the digits move with constant velocity and bounce off the walls in a deterministic manner. In contrast, in SM-MNIST digits move with a constant velocity along a trajectory until they hit at wall at which point they bounce off with a random speed and direction. In this way, 8 moments of uncertainty (each time a digit hits a wall) are interspersed with deterministic motion. Robotic Pushing Dataset [89]: is a dataset created for learning about physical object motion. It consist on 640×512 pixels image sequences of 10 different 7-degree-of-freedom robotic arms interacting with real-world physical objects. No additional labeling is given, the dataset was designed to model motion at pixel level through deep learning algorithms based on convolutional LSTM (ConvLSTM). BAIR Robot Pushing Dataset (used in [129]): BAIR (Berkeley Artificial Intelligence Research) group has been working on robots that can learn through unsupervised training (also known in this case as self-supervised), this is, learning the consequences that its actions (movement of the arm and grip) have over the data it can measure (images from two cameras). In this way, the robot assimilates physics of the objects and can predict the effects that its actions will generate on the environment, allowing it to plan strategies to achieve more general goals. This was improved by showing the robot how it can grab tools to interact with other objects. The dataset is composed by hours of this self-supervised learning with the robotic arm Sawyer. RoboNet [130]: is a dataset composed by the aggregation of various self-supervised training sequences of seven robotic arms from four different research laboratories. The previously described BAIR group is one of them, along with Stanford AI Laboratory, Grasp Lab of the University of Pennsylvania and Google Brain Robotics. It was created with the goal of being a standard, in the same way as ImageNet is for images, but for robotic self-supervised learning. Several experiments have been performed studying how the transfer among robotic arms can be achieved. 4.5 Other-purpose and Multi-purpose Datasets ViSOR [131]: ViSOR (Video Surveillance Online Repository) is a repository designed with the aim of establishing an open platform for collecting, annotating, retrieving, and sharing surveillance videos, as well as evaluating the performance of automatic surveillance systems. Its raw data could be very useful for video prediction due to its implicit static camera. PROST [132]: is a method for online tracking that used ten manually annotated videos to test its performance. Four of them were created by PROST authors, and they conform the dataset with the same name. The remaining six sequences were borrowed from other authors, who released their annotated clips to test their tracking methods. We will consider both 4-sequences PROST dataset and 10-sequences aggregated dataset when providing statistics. In each video, different challenges are presented for tracking methods: occlusions, 3D motion, varying illumination, heavy appearance/scale changes, moving camera, motion blur, among others. Provided annotations include bounding boxes for the object/element being tracked. Arcade Learning Environment [133]: is a platform that enables machine learning algorithms to interact with the Atari 2600 open-source emulator Stella to play over 500 Atari games. The interface provides a single 2D frame of 210×160 pixels resolution at 60 fps in real-time, and up to 6000 fps when it is running at full speed. It also offers the possibility of saving and restoring the state of a game. Although its obvious main application is reinforcement learning, it could also be profitable as source of almost-infinite interactive video sequences from which prediction models can learn. Inria 3DMovie Dataset v2 [134]: is a video dataset which extracted its data from the StreetDance 3D stereo movies. The dataset includes stereo pairs, and manually generated ground-truth for human segmentation, poses and bounding boxes. The second version of this dataset, used in [134], is composed by 27 clips, which represent 2476 frames, of which just a sparse subset of 235 were annotated. RobotriX [16]: is a synthetic dataset designed for assistance robotics, that consist of sequences where a humanoid robot is moving through various indoor scenes and interacting with objects, recorded from multiple points of view, including robot-mounted cameras. It provides a huge variety of ground-truth data generated synthetically from highlyrealistic environments deployed on the cutting-edge game engine UnrealEngine, through the also available tool UnrealROX [139]. RGB frames are provided at 1920 × 1080 pixels resolution and at 60 fps, along with pixel-precise instance masks, depth and normal maps, and 6D poses of objects, skeletons and cameras. Moreover, UnrealROX is an open source tool for retrieving ground-truth data from any simulation running in UnrealEngine. UASOL [135]: is a large-scale dataset consisting of highresolution sequences of stereo pairs recorded outdoors at pedestrian (egocentric) point of view. Along with them, precise depth maps are provided, computed offline from stereo pairs by the same camera. This dataset is intended to be useful for depth estimation, both from single and stereo images, research fields where outdoor and pedestrian-pointof-view data is not abundant. Frames were taken at a resolution of 2280 × 1282 pixels at 15 fps. 5 V IDEO P REDICTION M ETHODS In the video prediction literature we find a broad range of different methods and approaches. Early models focused on directly predicting raw pixel intensities, by implicitly modeling scene dynamics and low-level details (Section 5.1). However, extracting a meaningful and robust representation from raw videos is challenging, since the pixel space is highly dimensional and extremely variable. From this point, reducing the supervision effort and the representation dimensionality emerged as a natural evolution. On the one hand, the authors aimed to disentangle the factors of variation from the visual content, i.e. factorizing the prediction space. For this purpose, they: (1) formulated the prediction problem into an intermediate transformation space by explicitly modeling the source of variability as transformations between frames (Section 5.2); (2) separated motion from the visual content with a two-stream computation (Section 5.3). On the other hand, some models narrowed the output space by conditioning the predictions on extra variables (Section 5.4), or reformulating the problem in a higher-level space (Section 5.5). High-level representation spaces are increasingly more attractive, since intelligent systems rarely 9 Video Prediction Through Direct Pixel Synthesis Factorizing the Prediction Space Narrowing the Prediction Space By Incorporating Uncertainty Implicit Modeling of Scene Dynamics Using Explicit Transformations By Conditioning on Extra Variables Using Probabilistic Approaches With Explicit Motion from Content Separation To High-level Feature Space Fig. 3: Classification of video prediction models. rely on raw pixel information for decision making. Besides simplifying the prediction task, some other works addressed the future uncertainty in predictions. As the vast majority of video prediction models are deterministic, they are unable to manage probabilistic environments. To address this issue, several authors proposed modeling future uncertainty with probabilistic models (Section 5.6). So far in the literature, there is no specific taxonomy that classifies video prediction models. In this review, we have classified the existing methods according to the video prediction problem they addressed and following the classification illustrated in Figure 3. For simplicity, each subsection extends directly the last level in the taxonomy. Moreover, some methods in this review can be classified in more than one category since they addressed multiple problems. For instance, [9], [54], [85] are probabilistic models making predictions in a high-level space as they addressed both the future uncertainty and high dimensionality in videos. The category of these models were specified according to their main contribution. The most relevant methods, ordered in a chronological order, are summarized in Table 2 containing low-level details. Prediction is a widely discussed topic in different fields and at different levels of abstraction. For instance, the future prediction from a static image [3], [106], [140]–[143], vehicle behavior prediction [144] and human action prediction [17] are a different but inspiring research fields. Although related, the aforementioned topics are outside the scope of this particular review, as it focuses purely on the video prediction methods using a sequence of previous frames as context and is limited to 2D RGB data. 5.1 Direct Pixel Synthesis Initial video prediction models attempted to directly predict future pixel intensities without any explicit modeling of the scene dynamics. Ranzato et al. [73] discretized video frames in patch clusters using k-means. They assumed that non-overlapping patches are equally different in a k-means discretized space, yet similarities can be found between patches. The method is a convolutional extension of a RNNbased model [145] making short-term predictions at the patch-level. As the full-resolution frame is a composition of the predicted patches, some tilling effect can be noticed. Predictions of large and fast-moving objects are accurate, however, when it comes to small and slow-moving objects there is still room for improvement. These are common issues for most methods making predictions at the patch-level. Addressing longer-term predictions, Srivastava et al. [74] proposed different AE-based approaches incorporating LSTM units to model the temporal coherence. Using convolutional [146] and flow [147] percepts alongside RGB image patches, authors tested the models on multi-domain tasks and considered both unconditioned and conditioned decoder versions. The latter only marginally improved the prediction accuracy. Replacing the fully connected LSTMs with convolutional LSTMs, Shi et al. proposed an end-to-end model efficiently exploiting spatial correlations [13]. This enhanced prediction accuracy and reduced the number of parameters. Inspired on adversarial training: Building on the recent success of the Laplacian Generative Adversarial Network (LAPGAN), Mathieu et al. proposed the first multi-scale architecture for video prediction that was trained in an adversarial fashion [43]. Their novel GDL regularization combined with ℓ1 -based reconstruction and adversarial training represented a leap over the previous state-of-the-art models [73], [74] in terms of prediction sharpness. However, it was outperformed by the Predictive Coding Network (PredNet) [75] which stacked several ConvLSTMs vertically connected by a bottom-up propagation of the local ℓ1 error computed at each level. Previously to PredNet, the same authors proposed the Predictive Generative Network (PGN) [49], an end-to-end model trained with a weighted combination of adversarial loss and MSE on synthetic data. However, no tests on natural videos and comparison with state-of-the-art predictive models were carried out. Using a similar training strategy as [43], Zhou et al. used a convolutional AE to learn long-term dependencies from time-lapse videos [103]. Build on Progressively Growing GANs (PGGANs) [148], Aigner et al. proposed the FutureGAN [69], a three-dimensional (3d) convolutional Encoder-decoder (ED)-based model. They used the Wasserstein GAN with gradient penalty (WGANGP) loss [149] and conducted experiments on increasingly complex datasets. Extending [13], Zhang et al. proposed a novel LSTM-based architecture where hidden states are updated along a z-order curve [70]. Dealing with distortion and temporal inconsistency in predictions and inspired by the Human Visual System (HVS), Jin et al. [150] first incorporated multi-frequency analysis into the video prediction task to decompose images into low and high frequency bands. This allowed high-fidelity and temporally consistent predictions with the ground truth, as the model better lever- 10 ages the spatial and temporal details. The proposed method outperformed previous state-of-the-art in all metrics except in the Learned Perceptual Image Patch Similarity (LPIPS), where probabilistic models achieved a better performance since their predictions are clearer and realistic but less consistent with the ground truth. Distortion and blurriness are further accentuated when it comes to predict under fast camera motions. To this end, Shouno [151] implemented a hierarchical residual network with top-down connections. Leveraging parallel prediction at multiple scales, authors reported finer details and textures under fast and large camera motion. Bidirectional flow: Under the assumption that video sequences are symmetric in time, Kwon et al. [101] explored a retrospective prediction scheme training a generator for both, forward and backward prediction (reversing the input sequence to predict the past). Their cycle GAN-based approach ensure the consistency of bidirectional prediction through retrospective cycle constraints. Similarly, Hu et al. [57] proposed a novel cycle-consistency loss used to train a GAN-based approach (VPGAN). Future frames are generated from a sequence of context frames and their variation in time, denoted as Z . Under the assumption that Z is symmetric in the encoding space, it is manipulated by the model manipulates to generate desirable moving directions. In the same spirit, other works focused on both, forward and backward predictions [37], [152]. Enabling state sharing between the encoder and decoder, Oliu et al. proposed the folded Recurrent Neural Network (fRNN) [153], a recurrent AE architecture featuring GRUs that implement a bidirectional flow of the information. The model demonstrated a stratified representation, which makes the topology more explainable, as well as efficient compared to regular AEs in terms or memory consumption and computational requirements. Exploiting 3D convolutions: for modeling short-term features, Wang et al. [66] integrated them into a recurrent network demonstrating state-of-the-art results in both video prediction and early activity recognition. While 3D convolutions efficiently preserves local dynamics, RNNs model the long-range context. The eidetic 3d LSTM (E3d-LSTM) network, represented in Figure 4, features a gated-controlled self-attention module, i.e. eidetic 3D memory, that effectively manages historical memory records across multiple time steps. This enables long-range video reasoning, outperforming previous approaches. Some other works used 3D convolutional operations to model the time dimension [69]. Analyzing the previous works, Byeon et al. [76] identified a lack of spatial-temporal context in the representations, leading to blurry results when it comes to the future uncertainty. Although authors addressed this contextual limitation with dilated convolutions and multi-scale architectures, the context representation progressively vanishes in longterm predictions. To address this issue, they proposed a context-aware model that efficiently aggregates per-pixel contextual information at each layer and in multiple directions. The core of their proposal is a context-aware layer consisting of two blocks, one aggregating the information from multiple directions and the other blending them into a Fig. 4: Representation of the 3D encoder-decoder architecture of E3d-LSTM [66]. After reducing T consecutive input frames to high-dimensional feature maps, these are directly fed into a novel eidetic module for modeling long-term spatiotemporal dependencies. Finally, stacked 3D CNN decoder outputs the predicted video frames. For classification tasks the hidden states can be directly used as the learned video representation. Figure extracted from [66]. It It+1 It It+1 (x, y) (x, y) P (x, y) (x, y) (x + u, y + v) It+1 (x, y) = f (It (x + u, y + v)) (a) Vector-based. It+1 (x, y) = K(x, y) ∗ P (x, y) (b) Kernel-based. Fig. 5: Representation of transformation-based approaches. (a) Vector-based with a bilinear interpolation. (b) Applying transformations as a convolutional operation. Figure inspired by [154]. unified context. Extracting a robust representation from raw pixel values is an overly complicated task due to the high-dimensionality of the pixel space. The per-pixel variability between consecutive frames, causes an exponential growth in the prediction error on the long-term horizon. 5.2 Using Explicit Transformations Let X = (Xt−n , . . . , Xt−1 , Xt ) be a video sequence of n frames, where t denotes time. Instead of learning the visual appearance, transformation-based approaches assume that visual information is already available in the input sequence. To deal with the strong similarity and pixel redundancy between successive frames, these methods explicitly model the transformations that takes a frame at time t to the frame at t+1. These models are formally defined as follows: Yt+1 = T (G (Xt−n:t ) , Xt−n:t ) , (1) 11 where G is a learned function that outputs future transformation parameters, which applied to the last observed frame Xt using the function T , generates the future frame prediction Yt+1 . According to the classification of Reda et al. [154], T function can be defined as a vector-based resampling such as bilinear sampling, or adaptive kernelbased resampling, e.g. using convolutional operations. For instance, a bilinear sampling operation is defined as: Yt+1 (x, y) = f (Xt (x + u, y + v)) , (2) where f is a bilinear interpolator such as [7], [155], [156], (u, v) is a motion vector predicted by G , and Xt (x, y) is a pixel value at (x,y) in the last observed frame Xt . Approaches following this formulation are categorized as vector-based resampling operations and are depicted in Figure 5a. On the other side, in the kernel-based resampling, the G function predicts the kernel K(x, y) which is applied as a convolution operation using T , as depicted in Figure 5b and is mathematically represented as follows: Yt+1 (x, y) = K(x, y) ∗ Pt (x, y), (3) where K(x, y) ∈ RN xN is the 2D kernel predicted by the function G and Pt (x, y) is an N × N patch centered at (x, y). Combining kernel and vector-based resampling into a hybrid solution, Reda et al. [154] proposed the Spatially Displaced Convolution (SDC) module that synthesizes highresolution images applying a learned per-pixel motion vector and kernel at a displaced location in the source image. Their 3D CNN model trained on synthetic data and featuring the SDC modules, reported promising predictions of a high-fidelity. 5.2.1 Vector-based Resampling Bilinear models use multiplicative interactions to extract transformations from pairs of observations in order to relate images, such as Gated Autoencoders (GAEs) [157]. Inspired by these models, Michalski et al. proposed the Predictive Gating Pyramid (PGP) [158] consisting on a recurrent pyramid of stacked GAEs. To the best of our knowledge, this was the first attempt to predict future frames in the affine transform space. Multiple GAEs are stacked to represent a hierarchy of transformations and capture higher-order dependencies. From the experiments on predicting frequency modulated sin-waves, authors stated that standard RNNs were outperformed in terms of accuracy. However, no performance comparison was conducted on videos. Based on the Spatial Transformer (ST) module [159]: To provide spatial transformation capabilities to existing CNNs, Jaderberg et al. [159] proposed the ST module represented in Figure 6. It regresses different affine transformation parameters for each input, to be applied as a single transformation to the whole feature map(s) or image(s). Moreover, it can be incorporated at any part of the CNNs and it is fully differentiable. The ST module is the essence of vector-based resampling approaches for video prediction. As an extension, Patraucean et al. [77] modified the grid generator to consider per-pixel transformations instead of a single dense transformation map for the entire image. They nested a LSTM-based temporal encoder into a spatial AE, Fig. 6: A representation of the spatial transformer module proposed by [159]. First, the localization network regresses the transformation parameters, denoted as θ, from the input feature map U . Then, the grid generator creates a sampling grid from the predicted transformation parameters. Finally, the sampler produces the output map by sampling the input at the points defined in the sampling grid. Figure extracted from [159]. proposing the AE-convLSTM-flow architecture. The prediction is generated by resampling the current frame with the flow-based predicted transformation. Using the components of the AE-convLSTM-flow architecture, Lu et al. [78] assembled an extrapolation module which is unfolded in time for multi-step prediction. Their Flexible Spatio-semporal Network (FSTN) features a novel loss function using the DeePSiM perceptual loss [44] in order to mitigate blurriness. An exhaustive experimentation and ablation study was carried out, testing multiple combinations of loss functions. Also inspired on the ST module for the volume sampling layer, Liu et al. proposed the Deep Voxel Flow (DVF) architecture [7]. It consists of a multi-scale flow-based ED model originally designed for the video frame interpolation task, but also evaluated on a predictive basis reporting sharp results. Liang et al. [55] use a flow-warping layer based on a bilinear interpolation. Finn et al. proposed the Spatial Transformer Predictor (STP) motion-based model [89] producing 2D affine transformations for bilinear sampling. Pursuing efficiency, Amersfoort et al. [71] proposed a CNN designed to predict local affine transformations of overlapping image patches. Unlike the ST module, authors estimated transformations of input frames off-line and at patch level. As the model is parameter-efficient, it was unfolded in time for multi-step prediction. This resembles RNNs as the parameters are shared over time and the local affine transforms play the role of recurrent states. 5.2.2 Kernel-based Resampling As a promising alternative to the vector-based resampling, recent approaches synthesize pixels by convolving input patches with a predicted kernel. However, convolutional operations are limited in learning spatial invariant representations of complex transformations. Moreover, due to their local receptive fields, global spatial information is not fully preserved. Using larger kernels would help to preserve global features, but in exchange for a higher memory consumption. Pooling layers are another alternative, but loosing spatial resolution. Preserving spatial resolution at a low computational cost is still an open challenge for future video frame prediction task. Transformation layers used 12 in vector-based resampling [7], [77], [159] enabled CNNs to be spatially invariant and also inspired kernel-based architectures. Inspired on the Convolutional Dynamic Neural Advection (CDNA) module [89]: In addition to the STP vectorbased model, Finn et al. [89] proposed two different kernelbased motion prediction modules outperforming previous approaches [43], [80], (1) the Dynamic Neural Advection (DNA) module predicting different distributions for each pixel and (2) the CDNA module that instead of predicting different distributions for each pixel, it predicts multiple discrete distributions applied convolutionally to the input. While, CDNA and STP mask out objects that are moving in consistent directions, the DNA module produces perpixel motion. These modules inspired several kernel-based approaches. Similar to the CDNA module, Klein et al. proposed the Dynamic Convolutional Layer (DCL) [160] for short-range weather prediction. Likewise, Brabandere et al. [161] proposed the Dynamic Filter Networks (DFN) generating sample (for each image) and position-specific (for each pixel) kernels. This enabled sophisticated and local filtering operations in comparison with the ST module, that is limited to global spatial transformations. Different to the CDNA model, the DFN uses a softmax layer to filter values of greater magnitude, thus obtaining sharper predictions. Moreover, temporal correlations are exploited using a parameter-efficient recurrent layer, much simpler than [13], [74]. Exploiting adversarial training, Vondrick et al. proposed a cGAN-based model [102] consisting of a discriminator similar to [67] and a CNN generator featuring a transformer module inspired on the CDNA model. Different from the CDNA model, transformations are not applied recurrently on a per-frame basis. To deal with inthe-wild videos and make predictions invariant to camera motion, authors stabilized the input videos. However, no performance comparison with previous works has been conducted. Relying on kernel-based transformations and improving [162], Luc et al. [163] proposed the Transformation-based & TrIple Video Discriminator GAN (TrIVD-GAN-FP) featuring a novel recurrent unit that computes the parameters of a transformation used to warp previous hidden states without any supervision. These Transformation-based Spatial Recurrent Units (TSRUs) are generic modules and can replace any traditional recurrent unit in currently existent video prediction approaches. Object-centric representation: Instead of focusing on the whole input, Chen et al. [50] modeled individual motion of local objects, i.e. object-centered representations. Based on the ST module and a pyramid-like sampling [164], authors implemented an attention mechanism for object selection. Moreover, transformation kernels were generated dynamically as in the DFN, to then apply them to the last patch containing an object. Although object-centered predictions is novel, performance drops when dealing with multiple objects and occlusions as the attention module fails to distinguish them correctly. Fig. 7: MCnet with Multi-scale Motion-Content Residuals. While the motion encoder captures the temporal dynamics in a sequence of image differences, the content encoder extracts meaningful spatial features from the last observed RGB frame. After that, the network computes motioncontent features that are fed into the decoder to predict the next frame. Figure extracted from [65]. 5.3 Explicit Motion from Content Separation Drawing inspiration from two-stream architectures for action recognition [165], video generation from a static image [67], and unconditioned video generation [68], authors decided to factorize the video into content and motion to process each on a separate pathway. By decomposing the high-dimensional videos, the prediction is performed on a lower-dimensional temporal dynamics separately from the spatial layout. Although this makes end-to-end training difficult, factorizing the prediction task into more tractable problems demonstrated good results. The Motion-content Network (MCnet) [65], represented in Figure 7 was the first end-to-end model that disentangled scene dynamics from the visual appearance. Authors performed an in-depth performance analysis ensuring the motion and content separation through generalization capabilities and stable long-term predictions compared to models that lack of explicit motion-content factorization [43], [74]. In a similar fashion, yet working in a higher-level pose space, Denton et al. proposed Disentangled-representation Net (DRNET) [79] using a novel adversarial loss —it isolates the scene dynamics from the visual content, considered as the discriminative component— to completely disentangle motion dynamics from content. Outperforming [43], [65], the DRNET demonstrated a clean motion from content separation by reporting plausible long-term predictions on both synthetic and natural videos. To improve prediction variability, Liang et al. [55] fused the future-frame and future-flow prediction into a unified architecture with a shared probabilistic motion encoder. Aiming to mitigate the ghosting effect in disoccluded regions, Gae et al. [166] proposed a two-staged approach consisting of a separate computation of flow and pixel predictions. As they focused on inpainting occluded regions of the image using flow information, they improved results on disoccluded areas avoiding undesirable artifacts and enhancing sharpness. Separating the moving objects and the static background, 13 Wu et al. [167] proposed a two-staged architecture that firstly predicts the static background to then, using this information, predict the moving objects in the foreground. Final results are generated through composition and by means of a video inpainting module. Reported predictions are quite accurate, yet performance was not contrasted with the latest video prediction models. Although previous approaches disentangled motion from content, they have not performed an explicit decomposition into low-dimensional components. Addressing this issue, Hsieh et al. proposed the Decompositional Disentangled Predictive Autoencoder (DDPAE) [168] that decomposes the high-dimensional video into components represented with low-dimensional temporal dynamics. On the Moving MNIST dataset, DDPAE first decomposes images into individual digits (components) to then factorize each digit into its visual appearance and spatial location, being the latter easier to predict. Although experiments were performed on synthetic data, this approach represents a promising baseline to disentangle and decompose natural videos. Moreover, it is applicable to other existing models to improve their predictions. 5.4 Conditioned on Extra Variables Conditioning the prediction on extra variables such as vehicle odometry or robot state, among others, would narrow the prediction space. These variables have a direct influence on the dynamics of the scene, providing valuable information that facilitates the prediction task. For instance, the motion captured by a camera placed on the dashboard of an autonomous vehicle is directly influenced by the wheelsteering and acceleration. Without explicitly exploiting this information, we rely blindly on the model’s capabilities to correlate the wheel-steering and acceleration with the perceived motion. However, the explicit use of these variables would guide the prediction. Following this paradigm, Oh et al. first made longterm video predictions conditioned by control inputs from Atari games [80]. Although the proposed ED-based models reported very long-term predictions (+100), performance drops when dealing with small objects (e.g. bullets in Space Invaders) and while handling stochasticity due to the squared error. However, by simply minimizing ℓ2 error can lead to accurate and long-term predictions for deterministic synthetic videos, such as those extracted from Atari video games. Building on [80], Chiappa et al. [169] proposed alternative architectures and training schemes alongside an in-depth performance analysis for both short and long-term prediction. Similar model-based control from visual inputs performed well in restricted scenarios [170], but was inadequate for unconstrained environments. These deterministic approaches are unable to deal with natural videos in the absence of control variables. To address this limitation, the models proposed by Finn et al. [89] successfully made predictions on natural images, conditioned on the robot state and robot-object interactions performed in a controlled scenario. These models predict per-pixel transformations conditioned by the previous frame, to finally combine them using a composition mask. They outperformed [43], [80] on both conditioned and unconditioned predictions, however the quality of long-term predictions degrades over time because of the blurriness caused by the MSE loss function. Also, using high-dimensional sensory such as images, Dosovitskiy et al. [171] proposed a sensorimotor control model which enables interaction in complex and dynamic 3d environments. The approach is a reinforcement learning (RL)-based techniques, with the difference that instead of building upon a monolithic state and a scalar reward, the authors consider high-dimensional input streams, such as raw visual input, alongside a stream of measurements or player statistics. Although the outputs are future measurements instead of visual predictions, it was proven that using multivariate data benefits decision-making over conventional scalar reward approaches. 5.5 In the High-level Feature Space Despite the vast work on video prediction models, there is still room for improvement in natural video prediction. To deal with the curse of dimensionality, authors reduced the prediction space to high-level representations, such as semantic and instance segmentation, and human pose. Since the pixels are categorical, the semantic space greatly simplifies the prediction task, yet unexpected deformations in semantic maps and disocclusions, i.e. initially occluded scene entities become visible, induce uncertainty. However, high-level prediction spaces are more tractable and constitute good intermediate representations. By bypassing the prediction in the pixel space, models become able to report longer-term and more accurate predictions. 5.5.1 Semantic Segmentation In recent years, semantic and instance representations have gained increasing attention, emerging as a promising avenue for complete scene understanding. By decomposing the visual scene into semantic entities, such as pedestrians, vehicles and obstacles, the output space is narrowed to high-level scene properties. This intermediate representation represents a more tractable space as pixel values of a semantic map are categorical. In other words, scene dynamics are modeled at the semantic entity level instead of being modeled at the pixel level. This has encouraged authors to (1) leverage future prediction to improve parsing results [51] and (2) directly predict segmentation maps into the future [8], [56], [172]. Exploring the scene parsing in future frames, Jin et al. proposed the Parsing with prEdictive feAtuRe Learning (PEARL) framework [51] which was the first to explore the potential of a GAN-based frame prediction model to improve per-pixel segmentation. Specifically, this framework conducts two complementary predictive learning tasks. Firstly, it captures the temporal context from input data by using a single-frame prediction network. Then, these temporal features are embedded into a frame parsing network through a transform layer for generating per-pixel future segmentations. Although the predictive net was not compared with existing approaches, PEARL outperforms the traditional parsing methods by generating temporally consistent segmentations. In a similar fashion, Luc et al. [56] extended the msCNN model of [43] to the novel task of 14 Fig. 8: Two-staged method proposed by Chiu et al. [173]. In the upper half, the student network consists on an ED-based architecture featuring a 3D convolutional forecasting module. It performs the forecasting task guided by an additional loss generated by the teacher network (represented in the lower half). Figure extracted from [173]. predicting semantic segmentations of future frames, using softmax pre-activations instead of raw pixels as input. The use of intermediate features or higher-level data as input is a common practice in the video prediction performed in the high-level feature space. Some authors refer to this type or input data as percepts. Luc et al. explored different combinations of loss functions, inputs (using RGB information alongside percepts), and outputs (autoregressive and batch models). Results on short, medium and long-term predictions are sound, however, the models are not endto-end and they do not capture explicitly the temporal continuity across frames. To address this limitation and extending [51], Jin et al. first proposed a model for jointly predicting motion flow and scene parsing [174]. Flow-based representations implicitly draw temporal correlations from the input data, thus producing temporally coherent perpixel segmentations. As in [56], the authors tested different network configurations, as using Res101-FCN percepts for the prediction of semantic maps, and also performed multistep prediction up to 10 time-steps into the future. Perpixel accuracy improved when segmenting small objects, e.g. pedestrians and traffic signs, which are more likely to vanish in long-term predictions. Similarly, except that time dimension is modeled with LSTMs instead of motion flow estimation, Nabavi et al. proposed a simple bidirectional EDLSTM [82] using segmentation masks as input. Although the literature on knowledge distillation [175], [176] stated that softmax pre-activations carry more information than class labels, this model outperforms [56], [174] on short-term predictions. Another relevant idea is to use both motion flow estimation alongside LSTM-based temporal modeling. In this direction, Terwilliger et al. [10] proposed a novel method performing a LSTM-based feature-flow aggregation. Authors also tried to further simplify the semantic space by disentangling motion from semantic entities [65], achieving low overhead and efficiency. The prediction problem was decomposed into two subtasks, that is, current frame segmentation and future optical flow prediction, which are finally combined with a novel end-to-end warp layer. An improvement on short-term predictions were reported over previous works [56], [174], yet performing worse on midterm predictions. A different approach was proposed by Vora et al. [83] which first incorporated structure information to predict future 3D segmented point clouds. Their geometry-based model consists of several derivable sub-modules: (1) the pixel-wise segmentation and depth estimation modules which are jointly used to generate the 3d segmented point cloud of the current RGB frame; and (2) an LSTM-based module trained to predict future camera ego-motion trajectories. The future 3d segmented point clouds are obtained by transforming the previous point clouds with the predicted ego-motion. Their short-term predictions improved the results of [56], however, the use of structure information for longer-term predictions is not clear. The main disadvantage of two-staged, i.e. not end-toend, approaches [10], [56], [82], [83], [174] is that their performance is constrained by external supervisory signals, e.g. optical flow [177], segmentation [178] and intermediate features or percepts [61]. Breaking this trend, Chiu et al. [173] first solved jointly the semantic segmentation and forecasting problems in a single end-to-end trainable model by using raw pixels as input. This ED architecture is based on two networks, with one performing the forecasting task (student) and the other (teacher) guiding the student by means of a novel knowledge distillation loss. An in-depth ablation study was performed, validating the performance of the ED architectures as well as the 3D convolution used for capturing temporal scale instead of a LSTM or ConvLSTM, as in previous works. Avoiding the flood of deterministic models, Bhattacharyya et al. proposed a Bayesian formulation of the ResNet model in a novel architecture to capture model and observation uncertainty [9]. As main contribution, their dropout-based Bayesian approach leverages synthetic likelihoods [179], [180] to encourage prediction diversity and deal with multi-modal outcomes. Since Cityscapes sequences have been recorded in the frame of reference of a moving vehicle, authors conditioned the predictions on vehicle odometry. 5.5.2 Instance Segmentation While great strides have been made in predicting future segmentation maps, the authors attempted to make predictions at a semantically richer level, i.e. future prediction of semantic instances. Predicting future instance-level segmentations is a challenging and weakly unexplored task. This 15 is because instance labels are inconsistent and variable in number across the frames in a video sequence. Since the representation of semantic segmentation prediction models is of fixed-size, they cannot directly address semantics at the instance level. To overcome this limitation and introducing the novel task of predicting instance segmentations, Luc et al. [8] predict fixed-sized feature pyramids, i.e. features at multiple scales, used by the Mask R-CNN [181] network. The combination of dilated convolutions and multi-scale, efficiently preserve high-resolution details improving the results over previous methods [56]. To further improve predictions, Sun et al. [84] focused on modeling not only the spatio-temporal correlations between the pyramids, but also the intrinsic relations among the feature layers inside them. By enriching the contextual information using the proposed Context Pyramid ConvLSTMs (CP-ConvLSTM), an improvement in the prediction was noticed. Although the authors have not shown any long-term predictions nor compared with semantic segmentation models, their approach is currently the state of the art in the task of predicting instance segmentations, outperforming [8]. 5.5.3 Other High-level Spaces Although semantic and instance segmentation spaces were the most used in video prediction, other high-level spaces such as human pose and keypoints represent a promising avenue. Human Pose: As the human pose is a low-dimensional and interpretable structure, it represents a cheap supervisory signal for predictive models. This fostered pose-guided prediction methods, where future frame regression in the pixel space is conditioned by intermediate prediction of human poses. However, these methods are limited to videos with human presence. As this review focuses on video prediction, we will briefly review some of the most relevant methods predicting human poses as an intermediate representation. From a supervised prediction of human poses, Villegas et al. [53] regress future frames through analogy making [182]. Although background is not considered in the prediction, authors compared the model against [13], [43] reporting long-term results. To make the model unsupervised on the human pose, Wichers et al. [52] adopted different training strategies: end-to-end prediction minimizing the ℓ2 loss, and through analogy making, constraining the predicted features to be close to the outputs of the future encoder. Different from [53], in this work the predictions are made in the feature space. As a probabilistic alternative, Walker et al. [54] fused a conditioned Variational Autoencoder (cVAE)based probabilistic pose predictor with a GAN. While the probabilistic predictor enhances the diversity in the predicted poses, the adversarial network ensures prediction realism. As this model struggles with long-term predictions, Fushishita et al. [183] addressed long-term video prediction of multiple outcomes avoiding the error accumulation and vanishing gradients by using a unidimensional CNN trained in an adversarial fashion. To enable multiple predictions, they have used additional inputs ensuring trajectory and behavior variability at a human pose level. To better preserve the visual appearance in the predictions than [53], [65], [108], Tang et al. [184] firstly predict human poses using a LSTM-based model to then synthesize pose-conditioned future frames using a combination of different networks: a global GAN modeling the time-invariant background and a coarse human pose, a local GAN refining the coarsepredicted human pose, and a 3D-AE to ensure temporal consistency across frames. Keypoints-based representations: The keypoint coordinate space is a meaningful, tractable and structured representation for prediction, ensuring stable learning. It enforces model’s internal representation to contain object-level information. This leads to better results on tasks requiring objectlevel understanding such as, trajectory prediction, action recognition and reward prediction. As keypoints are a natural representation of dynamic objects, Minderer et al. [85] reformulated the prediction task in the keypoint coordinate space. They proposed an AE architecture with a keypointbased representational bottleneck, consisting of a VRNN that predicts dynamics in the keypoint space. Although this model qualitatively outperforms the Stochastic Video Generation (SVG) [81], Stochastic Adversarial Video Prediction (SAVP) [108] and EPVA [52] models, the quantitative evaluation reported similar results. 5.6 Incorporating Uncertainty Although high-level representations significantly reduce the prediction space, the underlying distribution still has multiple modes. In other words, different plausible outcomes would be equally probable for the same input sequence. Addressing multimodal distributions is not straightforward for regression and classification approaches, as they regress to the mean and aim to discretize a continuous highdimensional space, respectively. To deal with the inherent unpredictability of natural videos, some works introduced latent variables into existing deterministic models or directly relied on generative models such as GANs and VAEs. Inspired by the DVF, Xue et al. [199] proposed a cVAEbased [219], [220] multi-scale model featuring a novel cross convolutional layer trained to regress the difference image or Eulerian motion [221]. Background on natural videos is not uniform, however the model implicitly assumes that the difference image would accurately capture the movement in foreground objects. Introducing latent variables into a convolutional AE, Goroshin et al. [208] proposed a probabilistic model for learning linearized feature representations to linearly extrapolate the predicted frame in a feature space. Uncertainty is introduced to the loss by using a cosine distance as an explicit curvature penalty. Authors focused on evaluating the linearization properties, yet the model was not contrasted to previous works. Extending [141], [199], Fragkiadaki et al. [96] proposed several architectural changes and training schemes to handle marginalization over stochastic variables, such as sampling from the prior and variational inference. They proposed a stochastic ED architecture that predicts future optical flow, i.e., dense pixel motion field, used to spatially transform the current frame into the next frame prediction. To introduce uncertainty in predictions, the authors proposed the k-best-sample-loss (MCbest) that draws K outcomes penalizing those similar to the ground-truth. 16 TABLE 2: Summary of video prediction models (c: convolutional; r: recurrent; v: variational; ms: multi-scale; st: stacked; bi: bidirectional; P: Percepts; M: Motion; PL: Perceptual Loss; AL: Adversarial Loss; S/R: using Synthetic/Real datasets; SS: Semantic Segmentation; D: Depth; S: State; Po: Pose; O: Odometry; IS: Instance Segmentation; ms: multi-step prediction; pred-fr: number of predicted frames, ⋆ 1-5 frames, ⋆ ⋆ 5-10 frames, ⋆ ⋆ ⋆ 10-100 frames, ⋆ ⋆ ⋆ ⋆ over 100 frames; ood: indicates if model was tested on out-of-domain tasks). details method year based on architecture Ranzato et al. [73] Srivastava et al. [74] PGN [49] Shi et al. [13] BeyondMSE [43] PredNet [75] ContextVP [76] fRNN [153] E3d-LSTM [66] Kwon et al. [101] Znet [70] VPGAN [57] Jin et al. [150] Shouno et al. [151] 2014 2015 2015 2015 2016 2017 2018 2018 2019 2019 2019 2019 2020 2020 [145], [185] [186] [74] [60], [187] [22], [188] [88], [189] [13] [45], [192], [193] [13] [79], [193] [75] rCNN LSTM-AE LSTM-cED cLSTM msCNN stLSTMs MD-LSTM cGRU-AE r3D-CNN cycleGAN cLSTM GAN cED-GAN GAN PGP [158] Patraucean et al. [77] DFN [161] Amersfoort et al. [71] FSTN [78] Vondrick et al. [102] Chen et al. [50] DVF [7] SDC-Net [154] TrIVD-GAN-FP [163] 2014 [156] 2015 [186] 2016 [89], [160] 2017 [77] 2017 [44], [77] 2017 [67], [89] 2017 [71], [159], [161] 2017 [159] 2018 [196], [197] 2020 [142], [162], [166] st-rGAEs LSTM-cAE r-cED CNN LSTM-cED cGAN rCNN-ED ms-cED CNN DVD-GAN MCnet [65] Dual-GAN [55] DRNET [79] DPG [166] 2017 2017 2017 2019 [13], [165], [199] [100] [65] [89], [142] LSTM-cED VAE-GAN LSTM-ED cED Oh et al. [80] Finn et al. [89] 2015 2016 [13] [13], [80] rED st-cLSTMs Villegas et al. [53] PEARL [51] S2S [56] Walker et al. [54] Jin et al. [174] EPVA (EPVA) [52] Nabavi et al. [82] F2F et al. [8] Vora et al. [83] Chiu et al. [173] Bayes-WD-SL [9] Sun et al. [84] Terwilliger et al. [10] Struct-VRNN [85] 2017 2017 2017 2017 2017 2018 2018 2018 2018 2019 2019 2019 2019 2019 [182], [203], [204] [43] [205] [51], [56], [77] [53] [56], [174] [56], [181] [56], [174] [8] [65], [174] [206], [207] LSTM-cED cED msCNN vED cED LSTM-ED biLSTM-cED st-msCNN LSTM 3D-cED bayesResNet st-ms-cLSTM M-cLSTM cVRNN Goroshin et al. [208] Fragkiadaki et al. [96] EEN [99] SV2P [38] SVG [81] Castrejon et al. [97] Hu et al. [15] 2015 2017 2017 2018 2018 2019 2020 [209] [141], [199] [22], [75], [212] [89] [38] [81], [98] [56], [162], [216] cAE vED vED CDNA LSTM-cED vRNN cED datasets (train, valid, test) evaluation input output MS loss function S/R pred-fr ood code RGB RGB,P RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB,Z RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB ✗ X ✗ ✗ X X X X X X X X X X CE CE, ℓ2 M SE, AL CE ℓ1 , GDL, AL ℓ 1 ,ℓ 2 ℓ1 , GDL ℓ1 ℓ1 , ℓ2 , CE ℓ1 , LoG, AL ℓ2 , BCE, AL ℓ1 , Lcycle , AL ℓ2 , GDL, AL Lp , AL, P L R SR S S R SR R SR SR R SR R R R ⋆ ⋆⋆⋆ ⋆ ⋆⋆⋆ ⋆⋆ ⋆⋆ ⋆⋆ ⋆⋆⋆ ⋆⋆⋆ ⋆⋆⋆ ⋆⋆⋆ ⋆⋆⋆ ⋆⋆⋆ ⋆⋆⋆ ✗ X ✗ X ✗ X ✗ ✗ X ✗ ✗ ✗ ✗ ✗ ✗ X ✗ ✗ X X ✗ X X ✗ ✗ ✗ ✗ ✗ RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB X ✗ X X X X X X X X ℓ2 ℓ 2 , ℓδ BCE M SE ℓ 2 , ℓδ , P L CE, AL CE, ℓ2 , GDL, AL ℓ1 , T V ℓ1 , P L Lhinge [55] SR SR SR SR SR R SR R SR R ⋆ ⋆ ⋆⋆⋆ ⋆⋆ ⋆⋆⋆ ⋆⋆⋆ ⋆⋆ ⋆ ⋆⋆ ⋆⋆⋆ ✗ X X ✗ ✗ X ✗ X X ✗ ✗ X X ✗ ✗ ✗ ✗ X ✗ ✗ RGB RGB RGB RGB X X X X ℓp , GDL, AL ℓ1 , KL, AL ℓ2 , CE, AL ℓp , T V, P L, CE R R SR SR ⋆⋆⋆ ⋆⋆ ⋆⋆⋆⋆ ⋆⋆ ✗ ✗ X ✗ X ✗ X ✗ RGB RGB X X ℓ2 ℓ2 S R ⋆⋆⋆⋆ ⋆⋆⋆ X ✗ X X RGB,Po SS SS RGB SS,M RGB SS P,SS,IS ego-M SS SS P,IS SS RGB X ✗ X X X X X X ✗ ✗ X X X X ℓ2 , P L, AL [44] ℓ2 , AL ℓ1 , GDL, AL ℓ2 , CE, KL, AL ℓ1 , GDL, CE ℓ2 , AL CE ℓ2 ℓ1 CE, M SE KL ℓ2 , [181] CE, ℓ1 ℓ2 , KL R R R R R SR R R R R SR R R SR ⋆⋆⋆⋆ ⋆ ⋆⋆⋆ ⋆⋆⋆ ⋆⋆⋆ ⋆⋆⋆⋆ ⋆⋆ ⋆⋆⋆ ⋆ ⋆⋆ ⋆⋆⋆ ⋆⋆ ⋆⋆⋆ ⋆⋆ X X ✗ X X X ✗ X X ✗ X ✗ ✗ X ✗ ✗ X ✗ ✗ X ✗ X ✗ ✗ X ✗ X X RGB RGB RGB RGB RGB RGB SS,D,M ✗ ✗ X X X X X ℓ2 , penalty KL, M Cbest ℓ 1 , ℓ2 ℓp , KL ℓ2 , KL KL CE, ℓδ , Ld , Lc , Lp SR R SR SR SR SR R ⋆ ⋆ ⋆⋆ ⋆⋆⋆ ⋆⋆⋆⋆ ⋆⋆⋆ ⋆⋆⋆ ✗ X ✗ ✗ ✗ ✗ X ✗ ✗ X X X ✗ ✗ Direct Pixel Synthesis [115], [127] [74], [113], [115], [123] [126] [74] [115], [123] [117], [119], [120], [137] [115], [117], [119], [120] [74], [111], [115] [74], [111], [190], [191] [115], [119], [120], [194], [195] [74], [111] [111], [129] [111], [119], [120], [129] [119], [120] Using Explicit Transformations [126], [128] [74], [113], [131], [132] [74], [115] [74], [115] [74], [115], [123], [131], [132] [125] [74], [115] [115], [118] [119], [124] [115], [129], [198] RGB RGB RGB RGB RGB RGB RGB RGB RGB,M RGB Explicit Motion from Content Separation [111], [112], [115], [123] [115], [118]–[120] [74], [111], [138], [200] [119], [201], [202] RGB RGB RGB RGB Conditioned on Extra Variables [133] [89], [117] RGB,A RGB,A,S In the High-level Feature Space [116], [117] [121], [136] [121], [136] [115], [116] [121], [137] [117] [121] [121] [121] [121], [122] [121] [121], [134] [121] [90], [117] RGB,Po RGB P RGB,Po RGB,P RGB P P ego-M RGB SS,O P RGB,P RGB Incorporating Uncertainty [138], [210] [117], [211] [213]–[215] [89], [117], [129] [74], [111], [129] [74], [121], [129] [121], [122], [217], [218] RGB RGB RGB RGB RGB RGB RGB 17 Incorporating latent variables into the deterministic CDNA architecture for the first time, Babaeizadeh et al. proposed the Stochastic Variational Video Prediction (SV2P) [38] model handling natural videos. Their timeinvariant posterior distribution is approximated from the entire input video sequence. Authors demonstrated that, by explicitly modeling uncertainty with latent variables, the deterministic CDNA model is outperformed. By combining a standard deterministic architecture (LSTM-ED) with stochastic latent variables, Denton et al. proposed the SVG network [81]. Different from SV2P, the prior is sampled from a time-varying posterior distribution, i.e. it is a learned-prior instead of fixed-prior sampled from the same distribution. Most of the VAEs use a fixed Gaussian as a prior, sampling randomly at each time step. Exploiting the temporal dependencies, a learned-prior predicts high variance in uncertain situations, and a low variance when a deterministic prediction suffices. The SVG model is easier to train and reported sharper predictions in contrast to [38]. Built upon SVG, Villegas et al. [222] implemented a baseline to perform an in-depth empirical study on the importance of the inductive bias, stochasticity, and model’s capacity in the video prediction task. Different from previous approaches, Henaff et al. proposed the Error Encoding Network (EEN) [99] that incorporates uncertainty by feeding back the residual error —the difference between the ground truth and the deterministic prediction— encoded as a low-dimensional latent variable. In this way, the model implicitly separates the input video into deterministic and stochastic components. On the one hand, latent variable-based approaches cover the space of possible outcomes, yet predictions lack of realism. On the other hand, GANs struggle with uncertainty, but predictions are more realistic. Searching for a tradeoff between VAEs and GANs, Lee et al. [108] proposed the SAVP model, being the first to combine latent variable models with GANs to improve variability in video predictions, while maintaining realism. Under the assumption that blurry predictions of VAEs are a sign of underfitting, Castrejon et al. extended the VRNNs to leverage a hierarchy of latent variables and better approximate data likelihood [97]. Although the backpropagation through a hierarchy of conditioned latents is not straightforward, several techniques alleviated this issue such as, KL beta warm-up, dense connectivity pattern between inputs and latents, Ladder Variational Autoencoders (LVAEs) [223]. As most of the probabilistic approaches fail in approximating the true distribution of future frames, Pottorff et al. [224] reformulated the video prediction task without making any assumption about the data distribution. They proposed the Invertible Linear Embedding (ILE) enabling exact maximum likelihood learning of video sequences, by combining an invertible neural network [225], also known as reversible flows, and a linear time-invariant dynamic system. The ILE handles nonlinear motion in the pixel space and scales better to longer-term predictions compared to adversarial models [43]. While previous variational approaches [81], [108] focused on predicting a single frame of low resolution in restricted, predictable or simulated datasets, Hu et al. [15] jointly predict full-frame ego-motion, static scene, and object dynamics on complex real-world urban driving. Featuring a novel spatio-temporal module, their five-component architecture learns rich representations that incorporate both local and global spatio-temporal context. Authors validated the model on predicting semantic segmentation, depth and optical flow, two seconds in the future outperforming existing spatio-temporal architectures. However, no performance comparison with [81], [108] has been carried out. 6 P ERFORMANCE E VALUATION This section presents the results of the previously analyzed video prediction models on the most popular datasets on the basis of the metrics described below. 6.1 Metrics and Evaluation Protocols For a fair evaluation of video prediction systems, multiple aspects in the prediction have to be addressed such as whether the predicted sequences look realistic, are plausible and cover all possible outcomes. To the best of our knowledge, there are no evaluation protocols and metrics that evaluate the predictions by fulfilling simultaneously all these aspects. The most widely used evaluation protocols for video prediction rely on image similarity-based metrics such as, Mean-Squared Error (MSE), Structural Similarity Index Measure (SSIM) [226], and Peak Signal to Noise Ratio (PSNR). However, evaluating a prediction according to the mismatch between its visual appearance and the ground truth is not always reliable. In practice, these metrics penalize all predictions that deviate from the ground truth. In other words, they prefer blurry predictions nearly accommodating the exact ground truth than sharper and plausible but imperfect generations [97], [108], [227]. Pixel-wise metrics do not always reflect how accurate a model captured video scene dynamics and their temporal variability. In addition, the success of a metric is influenced by the loss function used to train the model. For instance, the models trained with MSE loss function would obviously perform well on MSE metric, but also on PSNR metric as it is based on MSE. Suffering from similar problems, SSIM measures the similarity between two images, from −1 (very dissimilar) to +1 (the same image). As a difference, it measures similarities on image patches instead of performing pixelwise comparison. These metrics are easily fooled by learning to match the background in predictions. To address this issue, Mathieu et al. [43] evaluated the predictions only on the dynamic parts of the sequence, avoiding background influence. As the pixel space is multimodal and highlydimensional, it is challenging to evaluate how accurately a prediction sequence covers the full distribution of possible outcomes. Addressing this issue, some probabilistic approaches [81], [97], [108] adopted a different evaluation protocol to assess prediction coverage. Basically, they sample multiple random predictions and then they search for the best match with the ground truth sequence. Finally, they report the best match using common metrics. This represents the most common evaluation protocol for probabilistic video prediction. Other methods [97], [150], [151] also reported results using: LPIPS [227] as a perceptual 18 TABLE 3: Results on M-MNIST (Moving MNIST). Predicting the next y frames from x context frames (x → y ). † results reported by Oliu et al. [153], ‡ results reported by Wang et al. [66], ∗ results reported by Wang et al. [232], ⊳ results reported by Wang et al. [233]. MSE represents per-pixel average MSE (10−3 ). MSE⋄ represents per-frame error. M-MNIST (10 → 10) method BeyondMSE [43] Srivastava et al. [74] Shi et al. [13] DFN [161] CDNA [89] VLN [234] Patraucean et al. [77] MCnet [65]† RLN [235]† PredNet [75]† fRNN [153] PredRNN [232] VPN [95] Znet [70] PredRNN++ [233] E3d-LSTM [66] MSE 27.48† 17.37† 43.9 42.54 42.54 41.61 9.47 - MSE⋄ SSIM M-MNIST (10 → 30) PSNR 122.6∗ 0.713∗ 15.969† 118.3∗ 0.690∗ 18.183† 96.5‡ 0.713‡ 89.0‡ 0.726‡ 84.2‡ 0.728‡ 13.857 13.857 13.968 68.4‡ 0.819‡ 21.386 56.8 0.867 64.1‡ 0.870‡ 50.5 0.877 46.5 0.898 41.3 0.910 - CE MSE⋄ SSIM 341.2 367.2∗ 285.2 346.6∗ 187.7 179.8 97.0 87.6 - 180.1⊳ 156.2⊳ 149.5⊳ 142.3⊳ 0.583⊳ 0.597⊳ 0.601⊳ 0.609⊳ 129.6⊳ 91.1 - 0.620⊳ 0.733 - metric comparing CNN features, or Frchet Video Distance (FVD) [228] to measure sample realism by comparing underlying distributions of predictions and ground truth. Moreover, Lee et al. [108] used the VGG Cosine Similarity metric that performs cosine similarity to the features extracted with the VGGnet [146] from the predictions. Some other alternative metrics include the inception score [229] introduced to deal with GANs mode collapse problem by measuring the diversity of generated samples; perceptual similarity metrics, such as DeePSiM [44]; measuring sharpness based on difference of gradients [43]; Parzen window [230], yet deficient for high-dimensional images; and the Laplacian of Gaussians (LoG) [60], [231] used in [101]. In the semantic segmentation space, authors used the popular Intersection over Union (IoU) metric. Inception score was also widely used to report results on different methods [54], [65], [67], [79]. Differently, on the basis of the EPVA model [52] a quantitative evaluation was performed, based on the confidence of an external method trained to identify whether the generated video contains a recognizable person. While some authors [10], [43], [56] evaluated the performance only on the dynamic parts of the image, other directly opted for visual human evaluation through Amazon Mechanical Turk (AMT) workers, without a direct quantitative evaluation. 6.2 TABLE 4: Results on KTH dataset. Predicting the next y frames from x context frames (x → y ). † results reported by Oliu et al. [153], ‡ results reported by Wang et al. [66], ∗ results reported by Zhang et al. [70], ⊳ results reported by Jin et al. [150]. Per-pixel average MSE (10−3 ). Best results are represented in bold. Results In this section we report the quantitative results of the most relevant methods reviewed in the previous sections. To achieve a wide comparison, we limited the quantitative results to the most common metrics and datasets. We have distributed the results in different tables, given the large variation in the evaluation protocols of the video prediction models. Many authors evaluated their methods on the Moving MNIST synthetic environment. Although it represents a KTH (10 → 10) KTH (10 → 20) KTH (10 → 40) method MSE PSNR SSIM PSNR SSIM PSNR Srivastava et al. [74]† PredNet [75]† BeyondMSE [43]† fRNN [153] MCnet [65] RLN [235]† Shi et al. [13]‡ SAVP [108]⊳ VPN [95]∗ DFN [161]‡ fRNN [153]‡ Znet [70] SV2P invariant [38]⊳ SV2P variant [38]⊳ PredRNN [232] VarNet [236]⊳ SAVP-VAE [108]⊳ PredRNN++ [233] MSNET [237] E3d-LSTM [66] Jin et al. [150] 9.95 3.09 1.80 1.75 1.65† 1.39 - 21.22 28.42 29.34 29.299 0.771⊳ 30.95† 0.804‡ 31.27 0.712 0.746 0.746 0.794 0.771 0.817 0.826 0.838 0.839 0.843 0.852 0.865 0.876 0.879 0.893 26.12⊳ 25.95‡ 23.58 25.38 23.76 27.26 26.12 27.58 27.56 27.79 27.55 28.48 27.77 28.47 27.08 29.31 29.85 0.678⊳ 0.73⊳ 0.639 0.701 0.652 0.678 0.778 0.789 0.703‡ 0.739 0.811 0.741‡ 0.810 0.851 23.77⊳ 23.89⊳ 22.85 23.97 23.01 23.77 25.92 26.12 24.16‡ 25.37 26.18 25.21‡ 27.24 27.56 restricted and quasi-deterministic scenario, long-term predictions are still challenging. The black and homogeneous background induce methods to accurately extrapolate black frames and vanish the predicted digits in the long-term horizon. Under this configuration, the E3d-LSTM network demonstrated how their memory attention mechanism improved the performance over previous methods. Reported errors remain stable in both short-term and longer-term predictions. Moreover, it also reported the second best results on the KTH dataset, after [150] which achieved the best overall performance and demonstrated quality predictions on natural videos. E3d-LSTM was also tested on the TaxiBJ dataset [190] comparing their method with [95], [153], [232], [233]. Performing short-term predictions in the KTH dataset, the Recurrent Ladder Network (RLN) outperformed MCnet and fRNN by a slight margin. The RLN architecture draws similarities with fRNN, except that the former uses bridge connections and the latter, state sharing that improves memory consumption. On the Moving MNIST and UCF101 datasets, fRNN outperformed RLN. Other interesting methods to highlight are PredRNN and PredRNN++, both providing close results to E3d-LSTM. State-of-the-art results using different metrics were reported on Caltech Pedestrian by Kwon et al. [101] and Jin et al. [150]. The former —as its retrospective prediction scheme represented a leap over the previous state-of-the-art— was also the overall winner on the UCF-101 dataset meanwhile the latter outperformed previous methods on the BAIR Push dataset. On the one hand, some approaches have been evaluated 19 TABLE 5: Results on Caltech Pedestrian. Predicting the next y frames from x context frames (x → y ). † reported by Kwon et al. [101], ‡ reported by Reda et al. [154], ∗ reported by Gao et al. [166], ⊳ reported by Jin et al. [150]. Per-pixel average MSE (10−3 ). Best results are represented in bold. SM-MNIST (5 → 10) Caltech Pedestrian (10 → 1) method method MSE SSIM PSNR LPIPS BeyondMSE [43]‡ MCnet [65]‡ DVF [7]∗ Dual-GAN [55] CtrlGen [142]∗ PredNet [75]† ContextVP [76] GAN-VGG [151] G-VGG [151] SDC-Net [154] Kwon et al. [101] DPG [166] G-MAE [151] GAN-MAE [151] Jin et al. [150] 3.42 2.50 2.41 2.42 1.94 1.62 1.61 − - 0.847 0.879 0.897 0.899 0.900 0.905 0.921 0.916 0.917 0.918 0.919 0.923 0.923 0.923 0.927 26.2 26.5 27.6 28.7 29.2 28.2 29.1 5.57⊳ 6.38⊳ 7.47⊳ 6.03⊳ 3.61 3.52 5.04⊳ 4.30 4.09 5.89 TABLE 6: Results on UCF-101 dataset. Predicting the next x frames from y context frames (x → y ). † results reported by Oliu et al. [153]. Per-pixel average MSE (10−3 ). Best results are represented in bold. UCF-101 (10 → 10) method Srivastava et al. [74]† PredNet [75]† BeyondMSE [43]† MCnet [65] RLN [235]† fRNN [153] BeyondMSE [43] Dual-GAN [55] DVF [7] ContextVP [76] Kwon et al. [101] UCF-101 (4 → 1) MSE PSNR MSE SSIM PSNR 148.66 15.50 9.26 9.40† 9.18 9.08 - 10.02 19.87 22.78 23.46† 23.56 23.87 - 1.37 0.91 0.92 0.94 0.94 0.92 0.94 31.0 32 30.5 33.4 34.9 35.0 on other datasets: SDC-Net [154] outperformed [43], [65] on YouTube8M, TrIVD-GAN-FP outperformed [162], [238] on Kinetics-600 test set [198]. On the other hand, some explored out-of-domain tasks [13], [66], [102], [161] (see ood column in Table 2). 6.2.1 TABLE 7: Results on SM-MNIST (Stochastic Moving MNIST), BAIR Push and Cityscapes datasets. † results reported by Castrejon et al. [97]. ‡ results reported by Jin et al. [150]. Results on Probabilistic Approaches Video prediction probabilistic methods have been mainly evaluated on the Stochastic Moving MNIST, Bair Push and Cityscapes datasets. Different from the original Moving MNIST dataset, the stochastic version includes uncertain digit trajectories, i.e. the digits bounce off the border with a random new direction. On this dataset, both versions of Castrejon et al. models (1L, without a hierarchy of latents, and 3L with a 3-level hierarchy of latents) outperform SVG SVG [81] SAVP [108] SAVP-VAE [108] SV2P inv. [38]‡ vRNN 1L [97] vRNN 3L [97] Jin et al. [150] BAIR Push (2 → 28) Cityscapes (2 → 28) FVD SSIM FVD SSIM PSNR FVD SSIM 90.81† 63.81 57.17 - 0.688† 0.763 0.760 - 256.62† 143.43† 149.22 143.40 - 0.816† 0.795† 0.815‡ 0.817 0.829 0.822 0.844 17.72‡ 18.42‡ 19.09‡ 20.36 21.02 1300.26† 682.08 567.51 - 0.574† 0.609 0.628 - TABLE 8: Results on Cityscapes dataset. Predicting the next y time-steps of semantic segmented frames from 4 context frames (4 → y ). ‡ IoU results on eight moving objects classes. † results reported by Chiu et al. [173] (4 → 1) method S2S [56]‡ S2S-maskRCNN [8]‡ S2S [56] Nabavi et al. [82] F2F [8] Vora et al. [83] S2S-Res101-FCN [174] Terwilliger et al. [10]‡ Chiu et al. [173] Jin et al. [174] Bayes-WD-SL [9] Terwilliger et al. [10] Cityscapes (4 → 3) (4 → 9) (4 → 10) IoU IoU IoU IoU 62.60‡ 71.37 72.43 75.3 73.2 55.3 55.4 59.4 60.06 61.2 61.47 62.6 65.1 65.53 66.1 66.7 67.1 40.8 42.4 47.8 41.2 45.4 46.3 50.52 52.5 51.5 50.8 - 53.9 52.5 by a large margin. On the Bair Push dataset, SAVP reported sharper and more realistic-looking predictions than SVG which suffer of blurriness. However, both models were outperformed by [97] as well on the Cityscapes dataset. The model based on a 3-level hierarchy of latents [97] outperform previous works on all three datasets, showing the advantages of the extra expressiveness of this model. 6.2.2 Results on the High-level Prediction Space Most of the methods have chosen the semantic segmentation space to make predictions. Although they relied on different datasets for training, performance results were mostly reported on the Cityscapes dataset using the IoU metric. Authors explored short-term (next-frame prediction), midterm (+3 time steps in the future) and long-term (up to +10 time step in the future) predictions. On the semantic segmentation prediction space, Bayes-WD-SL [9], the model proposed by Terwilliger et al. [10], and Jin et al. [51] reported the best results. Among these methods, it is noteworthy that Bayes-WD-SL was the only one to explore prediction diversity on the basis of a Bayesian formulation. In the instance segmentation space, the F2F pioneering method [8] was outperformed by Sun et al. [84] on short and mid-term predictions using the AP50 and AP evaluation metrics. On the other hand, in the keypoint coordinate 20 space, the seminal model of Minderer et al. [85] qualitatively outperforms SVG [81], SAVP [108] and EPVA [52], yet pixelwise metrics reported similar results. In the human pose space, Tang et al. [184] by regressing future frames from human pose predictions outperformed SAVP [108], MCnet [65] and [53] on the basis of the PSNR and SSIM metrics on the Penn Action and J-HMDB [114] datasets. 7 D ISCUSSION The video prediction literature ranges from a direct synthesis of future pixel intensities, to complex probabilistic models addressing prediction uncertainty. The range between these approaches consists of methods that try to factorize or narrow the prediction space. Simplifying the prediction task has been a natural evolution of video prediction models, influenced by several open research challenges discussed below. Due to the curse of dimensionality and the inherent pixel variability, developing a robust prediction based on raw pixel intensities is overly-complicated. This often leads to the regression-to-the-mean problem, visually represented as blurriness. Making parametric models larger would improve the quality of predictions, yet this is currently incompatible with high-resolution predictions due to memory constraints. Transformation-based approaches propagate pixels from previous frames based on estimated flow maps. In this case, prediction quality is directly influenced by the accuracy of the estimated flow. Similarly, the prediction in a high-level space is mostly conditioned by the quality of some extra supervisory signals such as semantic maps and human poses, to name a few. Erroneous supervision signals would harm prediction quality. Analyzing the impact of the inductive bias on the performance of a video prediction model, Villegas et al. [222] demonstrated the maximization of the SVG model [81] performance with minimal inductive bias (e.g. segmentation or instance maps, optical flow, adversarial losses, etc.) by increasing progressively the scale of computation. A common assumption when addressing the prediction task in a highlevel feature space, is the direct improvement of long-term predictions as a result of simplifying the prediction space. Even if the complexity of the prediction space is reduced, it is still multimodal when dealing with natural videos. For instance, when it comes to long-term predictions in the semantic segmentation space, most of the models reported predictions only up to ten time steps into the future. This directly suggests that the choice of the prediction space is still an unsolved problem. Finding a trade-off between the complexity of the prediction space and the output quality is challenging. An overly-simplified representation could limit the prediction on complex data such as natural videos. Although abstract predictions suffice for many of the decisionmaking systems based on visual reasoning, prediction in pixel space is still being addressed. From the analysis performed in this review and in line with the conclusions extracted from [222] we state that: (1) including recurrent connections and stochasticity in a video prediction model generally lead to improved performance; (2) increasing model capacity while maintaining a low inductive bias also improves prediction performance; (3) multi-step predictions conditioned by previously generated outputs are prone to accumulate errors, diverging from the ground truth when addressing long-term horizons; (4) authors predicted further in the future without relying on high-level feature spaces; (5) combining pixelwise losses with adversarial training somewhat mitigates the regression-to-the-mean issue. 7.1 Research Challenges Despite the wealth of currently existing video prediction approaches and the significant progress made in this field, there is still room to improve state-of-the-art algorithms. To foster progress, open research challenges must be clearly identified and disentangled. So far in this review, we have already discussed about: (1) the importance of spatiotemporal correlations as a self-supervisory signal for predictive models; (2) how to deal with future uncertainty and model the underlying multimodal distributions of natural videos; (3) the over-complicated task of learning meaningful representations and deal with the curse of dimensionality; (4) pixel-wise loss functions and blurry results when dealing with equally probable outcomes, i.e. probabilistic environments. These issues define the open research challenges in video prediction. Currently existing methods are limited to short-term horizons. While frames in the immediate future are extrapolated with high accuracy, in the long term horizon the prediction problem becomes multimodal by nature. Initial solutions consisted on conditioning the prediction on previously predicted frames. However, these autoregressive models tend to accumulate prediction errors that progressively diverge the generated prediction from the expected outcome. On the other hand, due to memory issues, there is a lack of resolution in predictions. Authors tried to address this issue by composing the full-resolution image from small predicted patches. However, as the results are not convincing because of the annoying tilling effect, most of the available models are still limited to low-resolution predictions. In addition to the lack of resolution and longterm predictions, models are still prone to the regress-to-themean problem that consists on averaging the output frame to accommodate multiple equally probable outcomes. This is directly related to the pixel-wise loss functions, that focus the learning process on the visual appearance. The choice of the loss function is an open research problem with a direct influence on the prediction quality. Finally, the lack of reliable and fair evaluation models makes the qualitative evaluation of video prediction challenging and represents another potential open problem. 7.2 Future Directions Based on the reviewed research identifying the state-ofthe-art video prediction methods, we present some future promising research directions. Consider alternative loss functions: Pixel-wise loss functions are widely used in the video prediction task, causing blurry predictions when dealing with uncontrolled environments or long-term horizon. In this regard, great efforts have been made in the literature for identifying more suitable 21 loss functions for the prediction task. However, despite the existing wide spectrum of loss functions, most models still blindly rely on deterministic loss functions. strides have been made, there is still room for improvement in video prediction using deep learning techniques. Alternatives to RNNs: Currently, RNNs are still widely used in this field to model temporal dependencies, and achieved state-of-the-art results on different benchmarks [66], [153], [232], [233]. Nevertheless, some methods also relied on 3D convolutions to further enhance video prediction [66], [173] representing a promising avenue. ACKNOWLEDGMENTS Use synthetically generated videos: Simplifying the prediction is a current trend in the video prediction literature. A vast amount of video prediction models explored higherlevel features spaces to reformulate the prediction task into a more tractable problem. However, this mostly conditions the prediction to the accuracy of an external source of supervision such as optical flow, human pose, pre-activations (percepts) extracted from supervised networks, and more. However, this issue could be alleviated by taking advantage of existing fully-annotated and photorealistic synthetic datasets or by using data generation tools. Video prediction in photorealistic synthetic scenarios has not been explored in the literature. Evaluation metrics: Since the most widely used evaluation protocols for video prediction rely on image similaritybased metrics, the need for fairer evaluation metrics is imminent. A fair metric should not penalize predictions that deviate from the ground truth at the pixel level, if their content represents a plausible future prediction in a higher level, i.e., the dynamics of the scene correspond to the reality of the labels. In this regard, some methods evaluate the similarity between distributions or at a higher-level. However, there is still room for improvement in the evaluation protocols for video prediction and generation [239]. This work has been funded by the Spanish Government PID2019-104818RB-I00 grant for the MoDeaAS project. This work has also been supported by two Spanish national grants for PhD studies, FPU17/00166, and ACIF/2018/197 respectively. Experiments were made possible by a generous hardware donation from NVIDIA. R EFERENCES [1] [2] [3] [4] [5] [6] [7] [8] 8 C ONCLUSION In this review, after reformulating the predictive learning paradigm in the context of video prediction, we have closely reviewed the fundamentals on which it is based: exploiting the time dimension of videos, dealing with stochasticity, and the importance of the loss functions in the learning process. Moreover, an analysis of the backbone deep learningbased architectures for this task was performed in order to provide the reader the necessary background knowledge. The core of this study encompasses the analysis and classification of more than 50 methods and the datasets they have used. Methods were analyzed from three perspectives: method description, contribution over the previous works and performance results. They have also been classified according to a proposed taxonomy based on their main contribution. In addition, we have presented a comparative summary of the datasets and methods in tabular form so as the reader, at a glance, could identify low-level details. In the end, we have discussed the performance results on the most popular datasets and metrics to finally provide useful insight in shape of future research directions and open problems. In conclusion, video prediction is a promising avenue for the self-supervised learning of rich spatiotemporal correlations to provide prediction capabilities to existing intelligent decision-making systems. While great [9] [10] [11] [12] [13] [14] [15] M. H. Nguyen and F. D. la Torre, “Max-margin early event detectors,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012, 2012, pp. 2863– 2870. K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert, “Activity forecasting,” in Computer Vision - ECCV 2012 - 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part IV, 2012, pp. 201–214. C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual representations from unlabeled video,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, pp. 98–106. K. Zeng, W. B. Shen, D. Huang, M. Sun, and J. C. Niebles, “Visual forecasting by imitating dynamics in natural sequences,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017, pp. 3018–3027. S. Shalev-Shwartz, N. Ben-Zrihem, A. Cohen, and A. Shashua, “Long-term planning by short-term prediction,” CoRR, vol. abs/1602.01580, 2016. O. Makansi, E. Ilg, O. Cicek, and T. Brox, “Overcoming limitations of mixture density networks: A sampling and fitting framework for multimodal future prediction,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala, “Video frame synthesis using deep voxel flow,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017, pp. 4473–4481. P. Luc, C. Couprie, Y. LeCun, and J. Verbeek, “Predicting future instance segmentation by forecasting convolutional features,” in Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IX, 2018, pp. 593– 608. A. Bhattacharyya, M. Fritz, and B. Schiele, “Bayesian prediction of future street scenes using synthetic likelihoods,” in ICLR (Poster). OpenReview.net, 2019. A. Terwilliger, G. Brazil, and X. Liu, “Recurrent flow-guided semantic forecasting,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Jan 2019, pp. 1703–1712. A. Bhattacharyya, M. Fritz, and B. Schiele, “Long-term on-board prediction of people in traffic scenes under uncertainty,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, 2018, pp. 4194– 4202. W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for anomaly detection - A new baseline,” in CVPR. IEEE Computer Society, 2018, pp. 6536–6545. X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo, “Convolutional LSTM network: A machine learning approach for precipitation nowcasting,” in Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, 2015, pp. 802–810. X. Shi, Z. Gao, L. Lausen, H. Wang, D.-Y. Yeung, W.-k. Wong, and W.-c. WOO, “Deep learning for precipitation nowcasting: A benchmark and a new model,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 5617–5627. A. Hu, F. Cotter, N. Mohan, C. Gurau, and A. Kendall, “Probabilistic future prediction for video scene understanding,” CoRR, vol. abs/2003.06409, 2020. 22 [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] A. Garcia-Garcia, P. Martinez-Gonzalez, S. Oprea, J. A. CastroVargas, S. Orts-Escolano, J. Garcia-Rodriguez, and A. JoverAlvarez, “The robotrix: An extremely photorealistic and verylarge-scale indoor dataset of sequences with robot trajectories and interactions,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 6790–6797. Y. Kong and Y. Fu, “Human action recognition and prediction: A survey,” CoRR, vol. abs/1806.11230, 2018. C. Sahin, G. Garcia-Hernando, J. Sock, and T. Kim, “A review on object pose recovery: from 3d bounding box detectors to full 6d pose estimators,” CoRR, vol. abs/2001.10609, 2020. V. Villena-Martinez, S. Oprea, M. Saval-Calvo, J. A. López, A. F. Guilló, and R. B. Fisher, “When deep learning meets data alignment: A review on deep registration networks (drns),” CoRR, vol. abs/2003.03167, 2020. [Online]. Available: https://arxiv.org/abs/2003.03167 Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015. J. Hawkins and S. Blakeslee, On Intelligence. USA: Times Books, 2004. R. P. N. Rao and D. H. Ballard, “Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects,” Nature Neuroscience, vol. 2, no. 1, pp. 79– 87, 1999. D. Mumford, “On the computational architecture of the neocortex,” Biological Cybernetics, vol. 66, no. 3, pp. 241–251, 1992. A. Cleeremans and J. L. McClelland, “Learning the structure of event sequences.” Journal of Experimental Psychology: General, vol. 120, no. 3, p. 235, 1991. A. Cleeremans and J. Elman, Mechanisms of implicit learning: Connectionist models of sequence processing. MIT press, 1993. R. Baker, M. Dexter, T. E. Hardwicke, A. Goldstone, and Z. Kourtzi, “Learning to predict: Exposure to temporal sequences facilitates prediction of future events,” Vision research, vol. 99, pp. 124–133, 2014. H. E. M. den Ouden, P. Kok, and F. P. de Lange, “How prediction errors shape perception, attention, and motivation,” in Front. Psychology, 2012. W. R. Softky, “Unsupervised pixel-prediction,” in Advances in Neural Information Processing Systems 8, NIPS, Denver, CO, USA, November 27-30, 1995, 1995, pp. 809–815. G. Deco and B. Schürmann, “Predictive coding in the visual cortex by a recurrent network with gabor receptive fields,” Neural Processing Letters, vol. 14, no. 2, pp. 107–114, 2001. A. Hollingworth, “Constructing visual representations of natural scenes: the roles of short- and long-term visual memory.” Journal of experimental psychology. Human perception and performance, vol. 30 3, pp. 519–37, 2004. Y. Bengio, A. C. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828, 2013. X. Wang and A. Gupta, “Unsupervised learning of visual representations using videos,” in 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015, pp. 2794–2802. P. Agrawal, J. Carreira, and J. Malik, “Learning to see by moving,” in 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015, pp. 37–45. D.-A. Huang, V. Ramanathan, D. Mahajan, L. Torresani, M. Paluri, L. Fei-Fei, and J. Carlos Niebles, “What makes a video a video: Analyzing temporal information in video understanding models and datasets,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. L. C. Pickup, Z. Pan, D. Wei, Y. Shih, C. Zhang, A. Zisserman, B. Schölkopf, and W. T. Freeman, “Seeing the arrow of time,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, 2014, pp. 2043– 2050. D. Wei, J. J. Lim, A. Zisserman, and W. T. Freeman, “Learning and using the arrow of time,” in CVPR. IEEE Computer Society, 2018, pp. 8052–8060. I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: Unsupervised learning using temporal order verification,” in Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, 2016, pp. 527– 544. [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine, “Stochastic variational video prediction,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image restoration with neural networks,” IEEE Trans. Computational Imaging, vol. 3, no. 1, pp. 47–57, 2017. K. Janocha and W. M. Czarnecki, “On loss functions for deep neural networks in classification,” CoRR, vol. abs/1702.05659, 2017. A. Kendall and R. Cipolla, “Geometric loss functions for camera pose regression with deep learning,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. J.-J. Hwang, T.-W. Ke, J. Shi, and S. X. Yu, “Adversarial structure matching for structured prediction tasks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video prediction beyond mean square error,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. A. Dosovitskiy and T. Brox, “Generating images with perceptual similarity metrics based on deep networks,” in NIPS, 2016, pp. 658–666. J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for realtime style transfer and super-resolution,” in ECCV (2), ser. Lecture Notes in Computer Science, vol. 9906. Springer, 2016, pp. 694–711. C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image super-resolution using a generative adversarial network,” in CVPR. IEEE Computer Society, 2017, pp. 105–114. M. S. M. Sajjadi, B. Schölkopf, and M. Hirsch, “Enhancenet: Single image super-resolution through automated texture synthesis,” in ICCV. IEEE Computer Society, 2017, pp. 4501–4510. J. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros, “Generative visual manipulation on the natural image manifold,” in ECCV (5), ser. Lecture Notes in Computer Science, vol. 9909. Springer, 2016, pp. 597–613. W. Lotter, G. Kreiman, and D. D. Cox, “Unsupervised learning of visual structure using predictive generative networks,” CoRR, vol. abs/1511.06380, 2015. X. Chen, W. Wang, J. Wang, and W. Li, “Learning object-centric transformation for video prediction,” in Proceedings of the 25th ACM International Conference on Multimedia, ser. MM ’17. New York, NY, USA: ACM, 2017, pp. 1503–1512. X. Jin, X. Li, H. Xiao, X. Shen, Z. Lin, J. Yang, Y. Chen, J. Dong, L. Liu, Z. Jie, J. Feng, and S. Yan, “Video scene parsing with predictive feature learning,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017, pp. 5581–5589. N. Wichers, R. Villegas, D. Erhan, and H. Lee, “Hierarchical long-term video prediction without supervision,” in ICML, ser. Proceedings of Machine Learning Research, vol. 80. PMLR, 2018, pp. 6033–6041. R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee, “Learning to generate long-term future via hierarchical prediction,” in Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, 2017, pp. 3560–3569. J. Walker, K. Marino, A. Gupta, and M. Hebert, “The pose knows: Video forecasting by generating pose futures,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017, pp. 3352–3361. X. Liang, L. Lee, W. Dai, and E. P. Xing, “Dual motion GAN for future-flow embedded video prediction,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 2229, 2017. IEEE Computer Society, 2017, pp. 1762–1770. P. Luc, N. Neverova, C. Couprie, J. Verbeek, and Y. LeCun, “Predicting deeper into the future of semantic segmentation,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017, pp. 648–657. Z. Hu and J. Wang, “A novel adversarial inference framework for video prediction with action control,” in The IEEE International Conference on Computer Vision (ICCV) Workshops, Oct 2019. 23 [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov 1998. V. Jain, J. F. Murray, F. Roth, S. C. Turaga, V. P. Zhigulin, K. L. Briggman, M. Helmstaedter, W. Denk, and H. S. Seung, “Supervised learning of image restoration with convolutional networks,” in ICCV. IEEE Computer Society, 2007, pp. 1–8. E. L. Denton, S. Chintala, A. Szlam, and R. Fergus, “Deep generative image models using a laplacian pyramid of adversarial networks,” in Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, 2015, pp. 1486–1494. F. Yu, V. Koltun, and T. A. Funkhouser, “Dilated residual networks,” in CVPR. IEEE Computer Society, 2017, pp. 636–644. L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, 2018. W. Luo, Y. Li, R. Urtasun, and R. S. Zemel, “Understanding the effective receptive field in deep convolutional neural networks,” in Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, 2016, pp. 4898–4906. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016. IEEE Computer Society, 2016, pp. 770–778. R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, “Decomposing motion and content for natural video sequence prediction,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. Y. Wang, L. Jiang, M.-H. Yang, L.-J. Li, M. Long, and L. Fei-Fei, “Eidetic 3d LSTM: A model for video prediction and beyond,” in International Conference on Learning Representations, 2019. C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos with scene dynamics,” in Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, 2016, pp. 613– 621. S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz, “Mocogan: Decomposing motion and content for video generation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. S. Aigner and M. Körner, “Futuregan: Anticipating the future frames of video sequences using spatio-temporal 3d convolutions in progressively growing autoencoder gans,” CoRR, vol. abs/1810.01325, 2018. J. Zhang, Y. Wang, M. Long, W. Jianmin, and P. S. Yu, “Z-order recurrent neural networks for video prediction,” in 2019 IEEE International Conference on Multimedia and Expo (ICME), July 2019, pp. 230–235. J. R. van Amersfoort, A. Kannan, M. Ranzato, A. Szlam, D. Tran, and S. Chintala, “Transformation-based models of video sequences,” CoRR, vol. abs/1701.08435, 2017. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986. M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra, “Video (language) modeling: a baseline for generative models of natural videos,” CoRR, vol. abs/1412.6604, 2014. N. Srivastava, E. Mansimov, and R. Salakhutdinov, “Unsupervised learning of video representations using lstms,” in Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, 2015, pp. 843–852. W. Lotter, G. Kreiman, and D. Cox, “Deep predictive coding networks for video prediction and unsupervised learning,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. W. Byeon, Q. Wang, R. K. Srivastava, and P. Koumoutsakos, “Contextvp: Fully context-aware video prediction,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2018, Salt Lake City, UT, USA, June 18-22, 2018. IEEE Computer Society, 2018, pp. 1122–1126. V. Patraucean, A. Handa, and R. Cipolla, “Spatio-temporal video autoencoder with differentiable memory,” CoRR, vol. abs/1511.06309, 2015. [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] C. Lu, M. Hirsch, and B. Schölkopf, “Flexible spatio-temporal networks for video prediction,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 2137–2145. E. L. Denton and V. Birodkar, “Unsupervised learning of disentangled representations from video,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, 2017, pp. 4417–4426. J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. P. Singh, “Actionconditional video prediction using deep networks in atari games,” in Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, 2015, pp. 2863– 2871. E. Denton and R. Fergus, “Stochastic video generation with a learned prior,” in Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, ser. Proceedings of Machine Learning Research, J. G. Dy and A. Krause, Eds., vol. 80. PMLR, 2018, pp. 1182–1191. S. shahabeddin Nabavi, M. Rochan, and Y. Wang, “Future semantic segmentation with convolutional LSTM,” in British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 3-6, 2018, 2018, p. 137. S. Vora, R. Mahjourian, S. Pirk, and A. Angelova, “Future segmentation using 3d structure,” CoRR, vol. abs/1811.11358, 2018. J. Sun, J. Xie, J. Hu, Z. Lin, J. Lai, W. Zeng, and W. Zheng, “Predicting future instance segmentation with contextual pyramid convlstms,” in ACM Multimedia. ACM, 2019, pp. 2043–2051. M. Minderer, C. Sun, R. Villegas, F. Cole, K. P. Murphy, and H. Lee, “Unsupervised learning of object structure and dynamics from videos,” in NeurIPS, 2019, pp. 92–102. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” CoRR, vol. abs/1406.1078, 2014. A. Graves, S. Fernández, and J. Schmidhuber, “Multidimensional recurrent neural networks,” in ICANN (1), ser. Lecture Notes in Computer Science, vol. 4668. Springer, 2007, pp. 549–558. C. Finn, I. J. Goodfellow, and S. Levine, “Unsupervised learning for physical interaction through video prediction,” in Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, 2016, pp. 64–72. E. Zhan, S. Zheng, Y. Yue, L. Sha, and P. Lucey, “Generating multi-agent trajectories using programmatic weak supervision,” in ICLR (Poster). OpenReview.net, 2019. A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” in Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, 2016, pp. 1747–1756. R. M. Neal, “Connectionist learning of belief networks,” Artif. Intell., vol. 56, no. 1, pp. 71–113, 1992. Y. Bengio and S. Bengio, “Modeling high-dimensional discrete data with multi-layer neural networks,” in Proceedings of the 12th International Conference on Neural Information Processing Systems, ser. NIPS99. Cambridge, MA, USA: MIT Press, 1999, p. 400406. A. van den Oord, N. Kalchbrenner, L. Espeholt, K. Kavukcuoglu, O. Vinyals, and A. Graves, “Conditional image generation with pixelcnn decoders,” in NIPS, 2016, pp. 4790–4798. N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu, “Video pixel networks,” CoRR, vol. abs/1610.00527, 2016. K. Fragkiadaki, J. Huang, A. Alemi, S. Vijayanarasimhan, S. Ricco, and R. Sukthankar, “Motion prediction under multimodality with conditional stochastic networks,” CoRR, vol. abs/1705.02082, 2017. L. Castrejon, N. Ballas, and A. Courville, “Improved conditional vrnns for video prediction,” in The IEEE International Conference on Computer Vision (ICCV), October 2019. J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio, “A recurrent latent variable model for sequential data,” in NIPS, 2015, pp. 2980–2988. 24 [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] M. Henaff, J. J. Zhao, and Y. LeCun, “Prediction under uncertainty with error-encoding networks,” CoRR, vol. abs/1711.04994, 2017. I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. C. Courville, and Y. Bengio, “Generative adversarial networks,” CoRR, vol. abs/1406.2661, 2014. Y.-H. Kwon and M.-G. Park, “Predicting future frames using retrospective cycle gan,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. C. Vondrick and A. Torralba, “Generating the future with adversarial transformers,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 2992–3000. Y. Zhou and T. L. Berg, “Learning temporal transformations from time-lapse videos,” in Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII, 2016, pp. 262–277. P. Bhattacharjee and S. Das, “Temporal coherency based criteria for predicting video frames using deep multi-stage generative adversarial networks,” in NIPS, 2017, pp. 4268–4277. M. Saito, E. Matsumoto, and S. Saito, “Temporal generative adversarial nets with singular value clipping,” in ICCV. IEEE Computer Society, 2017, pp. 2849–2858. B. Chen, W. Wang, and J. Wang, “Video imagination from a single image with transformation generation,” in ACM Multimedia (Thematic Workshops). ACM, 2017, pp. 358–366. M. Mirza and S. Osindero, “Conditional generative adversarial nets,” CoRR, vol. abs/1411.1784, 2014. A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine, “Stochastic adversarial video prediction,” CoRR, vol. abs/1804.01523, 2018. A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2016. M. Arjovsky and L. Bottou, “Towards principled methods for training generative adversarial networks,” in ICLR. OpenReview.net, 2017. C. Schüldt, I. Laptev, and B. Caputo, “Recognizing human actions: A local SVM approach,” in ICPR (3). IEEE Computer Society, 2004, pp. 32–36. L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes,” Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 12, pp. 2247–2253, December 2007. H. Kuehne, H. Jhuang, E. Garrote, T. A. Poggio, and T. Serre, “HMDB: A large video database for human motion recognition,” in IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011, 2011, pp. 2556–2563. H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards understanding action recognition,” in ICCV. IEEE Computer Society, 2013, pp. 3192–3199. K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” CoRR, vol. abs/1212.0402, 2012. W. Zhang, M. Zhu, and K. G. Derpanis, “From actemes to action: A strongly-supervised representation for detailed action understanding,” in ICCV. IEEE Computer Society, 2013, pp. 2248–2255. C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 7, pp. 1325–1339, 2014. H. Idrees, A. R. Zamir, Y. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah, “The THUMOS challenge on action recognition for videos ”in the wild”,” Computer Vision and Image Understanding, vol. 155, pp. 1–23, 2017. P. Dollár, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: A benchmark,” in 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA. IEEE Computer Society, 2009, pp. 304–311. A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013. [121] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [122] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y. Lin, and R. Yang, “The apolloscape dataset for autonomous driving,” arXiv: 1803.06184, 2018. [123] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F. Li, “Large-scale video classification with convolutional neural networks,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, 2014, pp. 1725–1732. [124] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan, “Youtube-8m: A largescale video classification benchmark,” CoRR, vol. abs/1609.08675, 2016. [125] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L. Li, “YFCC100M: the new data in multimedia research,” Commun. ACM, vol. 59, no. 2, pp. 64–73, 2016. [126] I. Sutskever, G. E. Hinton, and G. W. Taylor, “The recurrent temporal restricted boltzmann machine,” in NIPS. Curran Associates, Inc., 2008, pp. 1601–1608. [127] C. F. Cadieu and B. A. Olshausen, “Learning intermediate-level representations of form and motion from natural movies,” Neural Computation, vol. 24, no. 4, pp. 827–866, 2012. [128] R. Memisevic and G. Exarchakis, “Learning invariant features by harnessing the aperture problem,” in Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, ser. JMLR Workshop and Conference Proceedings, vol. 28. JMLR.org, 2013, pp. 100–108. [129] F. Ebert, C. Finn, A. X. Lee, and S. Levine, “Self-supervised visual planning with temporal skip connections,” in CoRL, ser. Proceedings of Machine Learning Research, vol. 78. PMLR, 2017, pp. 344–356. [130] S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn, “Robonet: Large-scale multirobot learning,” CoRR, vol. abs/1910.11215, 2019. [131] R. Vezzani and R. Cucchiara, “Video surveillance online repository (visor): an integrated framework,” Multimedia Tools Appl., vol. 50, no. 2, pp. 359–380, 2010. [132] J. Santner, C. Leistner, A. Saffari, T. Pock, and H. Bischof, “PROST: parallel robust online simple tracking,” in The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010, 2010, pp. 723–730. [133] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation platform for general agents,” J. Artif. Intell. Res., vol. 47, pp. 253–279, 2013. [134] G. Seguin, P. Bojanowski, R. Lajugie, and I. Laptev, “Instancelevel video segmentation from object tracks,” in CVPR. IEEE Computer Society, 2016, pp. 3678–3687. [135] Z. Bauer, F. Gomez-Donoso, E. Cruz, S. Orts-Escolano, and M. Cazorla, “Uasol, a large-scale high-resolution outdoor stereo dataset,” Scientific Data, vol. 6, no. 1, p. 162, 2019. [136] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla, “Segmentation and recognition using structure from motion point clouds,” in ECCV (1), ser. Lecture Notes in Computer Science, vol. 5302. Springer, 2008, pp. 44–57. [137] E. Santana and G. Hotz, “Learning a driving simulator,” CoRR, vol. abs/1608.01230, 2016. [138] Y. LeCun, F. J. Huang, and L. Bottou, “Learning methods for generic object recognition with invariance to pose and lighting,” in 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2004), with CD-ROM, 27 June - 2 July 2004, Washington, DC, USA. IEEE Computer Society, 2004, pp. 97–104. [139] P. Martinez-Gonzalez, S. Oprea, A. Garcia-Garcia, A. JoverAlvarez, S. Orts-Escolano, and J. Garcia-Rodriguez, “UnrealROX: An extremely photorealistic virtual reality environment for robotics simulations and synthetic data generation,” Virtual Reality, 2019. [140] D. Jayaraman and K. Grauman, “Look-ahead before you leap: End-to-end active recognition by forecasting the effect of motion,” in ECCV (5), ser. Lecture Notes in Computer Science, vol. 9909. Springer, 2016, pp. 489–505. 25 [141] J. Walker, C. Doersch, A. Gupta, and M. Hebert, “An uncertain future: Forecasting from static images using variational autoencoders,” in Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VII, 2016, pp. 835–851. [142] Z. Hao, X. Huang, and S. J. Belongie, “Controllable video generation with sparse trajectories,” in CVPR. IEEE Computer Society, 2018, pp. 7854–7863. [143] Y. Ye, M. Singh, A. Gupta, and S. Tulsiani, “Compositional video prediction,” in The IEEE International Conference on Computer Vision (ICCV), October 2019. [144] S. Mozaffari, O. Y. Al-Jarrah, M. Dianati, P. A. Jennings, and A. Mouzakitis, “Deep learning-based vehicle behaviour prediction for autonomous driving applications: A review,” CoRR, vol. abs/1912.11676, 2019. [145] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur, “Recurrent neural network based language model,” in INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, 2010, pp. 1045–1048. [146] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. [147] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy optical flow estimation based on a theory for warping,” in Computer Vision - ECCV 2004, 8th European Conference on Computer Vision, Prague, Czech Republic, May 11-14, 2004. Proceedings, Part IV, ser. Lecture Notes in Computer Science, T. Pajdla and J. Matas, Eds., vol. 3024. Springer, 2004, pp. 25–36. [148] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,” in ICLR. OpenReview.net, 2018. [149] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” in NIPS, 2017, pp. 5767–5777. [150] B. Jin, Y. Hu, Q. Tang, J. Niu, Z. Shi, Y. Han, and X. Li, “Exploring spatial-temporal multi-frequency analysis for highfidelity and temporal-consistency video prediction,” CoRR, vol. abs/2002.09905, 2020. [151] O. Shouno, “Photo-realistic video prediction on natural videos of largely changing frames,” CoRR, vol. abs/2003.08635, 2020. [152] R. Hou, H. Chang, B. Ma, and X. Chen, “Video prediction with bidirectional constraint network,” in 2019 14th IEEE International Conference on Automatic Face Gesture Recognition (FG 2019), May 2019, pp. 1–8. [153] M. Oliu, J. Selva, and S. Escalera, “Folded recurrent neural networks for future video prediction,” in Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIV, 2018, pp. 745–761. [154] F. A. Reda, G. Liu, K. J. Shih, R. Kirby, J. Barker, D. Tarjan, A. Tao, and B. Catanzaro, “Sdc-net: Video prediction using spatiallydisplaced convolution,” in The European Conference on Computer Vision (ECCV), September 2018. [155] R. Memisevic and G. E. Hinton, “Learning to represent spatial transformations with factored higher-order boltzmann machines,” Neural Computation, vol. 22, no. 6, pp. 1473–1492, 2010. [156] R. Memisevic, “Gradient-based learning of higher-order image features,” in Proceedings of the 2011 International Conference on Computer Vision, ser. ICCV ’11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 1591–1598. [157] ——, “Learning to relate images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1829–1846, 2013. [158] V. Michalski, R. Memisevic, and K. Konda, “Modeling deep temporal dependencies with recurrent grammar cells,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 1925–1933. [159] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial transformer networks,” in Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, 2015, pp. 2017–2025. [160] B. Klein, L. Wolf, and Y. Afek, “A dynamic convolutional layer for short rangeweather prediction,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 4840–4848. [161] B. D. Brabandere, X. Jia, T. Tuytelaars, and L. V. Gool, “Dynamic filter networks,” in Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, 2016, pp. 667–675. [162] A. Clark, J. Donahue, and K. Simonyan, “Adversarial video generation on complex datasets,” 2019. [163] P. Luc, A. Clark, S. Dieleman, D. de Las Casas, Y. Doron, A. Cassirer, and K. Simonyan, “Transformation-based adversarial video prediction on large-scale data,” CoRR, vol. abs/2003.04035, 2020. [164] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, 2015. [165] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds., 2014, pp. 568–576. [166] H. Gao, H. Xu, Q. Cai, R. Wang, F. Yu, and T. Darrell, “Disentangling propagation and generation for video prediction,” in ICCV. IEEE, 2019, pp. 9005–9014. [167] Y. Wu, R. Gao, J. Park, and Q. Chen, “Future video synthesis with object motion prediction,” 2020. [168] J. Hsieh, B. Liu, D. Huang, F. Li, and J. C. Niebles, “Learning to decompose and disentangle representations for video prediction,” in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., 2018, pp. 515–524. [169] S. Chiappa, S. Racanière, D. Wierstra, and S. Mohamed, “Recurrent environment simulators,” in ICLR (Poster). OpenReview.net, 2017. [170] K. Fragkiadaki, P. Agrawal, S. Levine, and J. Malik, “Learning visual predictive models of physics for playing billiards,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. [171] A. Dosovitskiy and V. Koltun, “Learning to act by predicting the future,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. [172] P. Luc, “Self-supervised learning of predictive segmentation models from video,” Theses, Université Grenoble Alpes, Jun. 2019. [Online]. Available: https://tel.archives-ouvertes.fr/tel-0 2196890 [173] H.-k. Chiu, E. Adeli, and J. C. Niebles, “Segmenting the future,” arXiv preprint arXiv:1904.10666, 2019. [174] X. Jin, H. Xiao, X. Shen, J. Yang, Z. Lin, Y. Chen, Z. Jie, J. Feng, and S. Yan, “Predicting scene parsing and motion dynamics in the future,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, 2017, pp. 6918–6927. [175] J. Ba and R. Caruana, “Do deep nets really need to be deep?” in NIPS, 2014, pp. 2654–2662. [176] G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” CoRR, vol. abs/1503.02531, 2015. [177] J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid, “Epicflow: Edge-preserving interpolation of correspondences for optical flow,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, 2015, pp. 1164–1172. [178] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in CVPR. IEEE Computer Society, 2017, pp. 6230– 6239. [179] S. N. Wood, “Statistical inference for noisy nonlinear ecological dynamic systems,” Nature, vol. 466, no. 7310, pp. 1102–1104, 2010. [180] M. Rosca, B. Lakshminarayanan, D. Warde-Farley, and S. Mohamed, “Variational approaches for auto-encoding generative adversarial networks,” CoRR, vol. abs/1706.04987, 2017. [181] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask R-CNN,” in ICCV. IEEE Computer Society, 2017, pp. 2980–2988. [182] S. E. Reed, Y. Zhang, Y. Zhang, and H. Lee, “Deep visual analogymaking,” in NIPS, 2015, pp. 1252–1260. [183] N. Fushishita, A. Tejero-de-Pablos, Y. Mukuta, and T. Harada, “Long-term video generation of multiple futures using human poses,” CoRR, vol. abs/1904.07538, 2019. 26 [184] J. Tang, H. Hu, Q. Zhou, H. Shan, C. Tian, and T. Q. S. Quek, “Pose guided global and local gan for appearance preserving human video prediction,” in 2019 IEEE International Conference on Image Processing (ICIP), Sep. 2019, pp. 614–618. [185] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural probabilistic language model,” Journal of Machine Learning Research, vol. 3, pp. 1137–1155, 2003. [186] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, 2014, pp. 3104–3112. [187] A. Mahendran and A. Vedaldi, “Understanding deep image representations by inverting them,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, 2015, pp. 5188–5196. [188] R. Chalasani and J. C. Prı́ncipe, “Deep predictive coding networks,” in 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, 2013. [189] M. F. Stollenga, W. Byeon, M. Liwicki, and J. Schmidhuber, “Parallel multi-dimensional lstm, with application to fast biomedical volumetric image segmentation,” in Proceedings of the 28th International Conference on Neural Information Processing Systems Volume 2, ser. NIPS15. Cambridge, MA, USA: MIT Press, 2015, p. 29983006. [190] J. Zhang, Y. Zheng, and D. Qi, “Deep spatio-temporal residual networks for citywide crowd flows prediction,” in AAAI. AAAI Press, 2017, pp. 1655–1661. [191] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fründ, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic, “The ”something something” video database for learning and evaluating visual common sense,” in ICCV. IEEE Computer Society, 2017, pp. 5843–5851. [192] Z. Yi, H. R. Zhang, P. Tan, and M. Gong, “Dualgan: Unsupervised dual learning for image-to-image translation,” in ICCV. IEEE Computer Society, 2017, pp. 2868–2876. [193] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-toimage translation using cycle-consistent adversarial networks,” in ICCV. IEEE Computer Society, 2017, pp. 2242–2251. [194] W. Luo, W. Liu, and S. Gao, “A revisit of sparse coding based anomaly detection in stacked RNN framework,” in ICCV. IEEE Computer Society, 2017, pp. 341–349. [195] M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. S. Regazzoni, and N. Sebe, “Abnormal event detection in videos using generative adversarial nets,” in ICIP. IEEE, 2017, pp. 1577– 1581. [196] S. Niklaus, L. Mai, and F. Liu, “Video frame interpolation via adaptive separable convolution,” in ICCV. IEEE Computer Society, 2017, pp. 261–270. [197] ——, “Video frame interpolation via adaptive convolution,” in CVPR. IEEE Computer Society, 2017, pp. 2270–2279. [198] J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman, “A short note about kinetics-600,” CoRR, vol. abs/1808.01340, 2018. [199] T. Xue, J. Wu, K. L. Bouman, and B. Freeman, “Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks,” in Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, 2016, pp. 91–99. [200] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. A. Funkhouser, “Semantic scene completion from a single depth image,” in CVPR. IEEE Computer Society, 2017, pp. 190–198. [201] M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” in CVPR. IEEE Computer Society, 2015, pp. 3061– 3070. [202] J. Janai, F. Güney, A. Ranjan, M. J. Black, and A. Geiger, “Unsupervised learning of multi-frame optical flow with occlusions,” in ECCV (16), ser. Lecture Notes in Computer Science, vol. 11220. Springer, 2018, pp. 713–731. [203] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, “Learning what and where to draw,” in NIPS, 2016, pp. 217–225. [204] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in ECCV (8), ser. Lecture Notes in Computer Science, vol. 9912. Springer, 2016, pp. 483–499. [205] K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik, “Recurrent network models for human dynamics,” in 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015, pp. 4346–4354. [206] T. Jakab, A. Gupta, H. Bilen, and A. Vedaldi, “Conditional image generation for learning the structure of visual objects,” CoRR, vol. abs/1806.07823, 2018. [207] Y. Zhang, Y. Guo, Y. Jin, Y. Luo, Z. He, and H. Lee, “Unsupervised discovery of object landmarks as structural representations,” in CVPR. IEEE Computer Society, 2018, pp. 2694–2703. [208] R. Goroshin, M. Mathieu, and Y. LeCun, “Learning to linearize under uncertainty,” in Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, 2015, pp. 1234–1242. [209] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming autoencoders,” in ICANN (1), ser. Lecture Notes in Computer Science, vol. 6791. Springer, 2011, pp. 44–51. [210] R. Goroshin, J. Bruna, J. Tompson, D. Eigen, and Y. LeCun, “Unsupervised learning of spatiotemporally coherent metrics,” in 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015, pp. 4086–4093. [211] T. Brox and J. Malik, “Object segmentation by long term analysis of point trajectories,” in ECCV (5), ser. Lecture Notes in Computer Science, vol. 6315. Springer, 2010, pp. 282–295. [212] J. Schmidhuber, “Learning complex, extended sequences using the principle of history compression,” Neural Computation, vol. 4, no. 2, pp. 234–242, 1992. [213] P. Agrawal, A. Nair, P. Abbeel, J. Malik, and S. Levine, “Learning to poke by poking: Experiential learning of intuitive physics,” CoRR, vol. abs/1606.07419, 2016. [214] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in ICML, ser. JMLR Workshop and Conference Proceedings, vol. 48. JMLR.org, 2016, pp. 1928–1937. [215] J. Zhang and K. Cho, “Query-efficient imitation learning for endto-end simulated driving,” in AAAI. AAAI Press, 2017, pp. 2891– 2897. [216] S. Kohl, B. Romera-Paredes, C. Meyer, J. De Fauw, J. R. Ledsam, K. Maier-Hein, S. A. Eslami, D. J. Rezende, and O. Ronneberger, “A probabilistic u-net for segmentation of ambiguous images,” in Advances in Neural Information Processing Systems, 2018, pp. 6965– 6975. [217] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell, “BDD100K: A diverse driving video database with scalable annotation tooling,” CoRR, vol. abs/1805.04687, 2018. [218] G. Neuhold, T. Ollmann, S. R. Bulò, and P. Kontschieder, “The mapillary vistas dataset for semantic understanding of street scenes,” in ICCV. IEEE Computer Society, 2017, pp. 5000–5009. [219] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2014. [220] X. Yan, J. Yang, K. Sohn, and H. Lee, “Attribute2image: Conditional image generation from visual attributes,” in Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, ser. Lecture Notes in Computer Science, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., vol. 9908. Springer, 2016, pp. 776–791. [221] H. Wu, M. Rubinstein, E. Shih, J. V. Guttag, F. Durand, and W. T. Freeman, “Eulerian video magnification for revealing subtle changes in the world,” ACM Trans. Graph., vol. 31, no. 4, pp. 65:1– 65:8, 2012. [222] R. Villegas, A. Pathak, H. Kannan, D. Erhan, Q. V. Le, and H. Lee, “High fidelity video prediction with large stochastic recurrent neural networks,” CoRR, vol. abs/1911.01655, 2019. [223] C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther, “Ladder variational autoencoders,” in NIPS, 2016, pp. 3738–3746. [224] R. Pottorff, J. Nielsen, and D. Wingate, “Video extrapolation with an invertible linear embedding,” CoRR, vol. abs/1903.00133, 2019. [225] D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” in NeurIPS, 2018, pp. 10 236–10 245. [226] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Processing, vol. 13, no. 4, pp. 600–612, 2004. 27 [227] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR. IEEE Computer Society, 2018, pp. 586–595. [228] T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly, “Towards accurate generative models of video: A new metric & challenges,” CoRR, vol. abs/1812.01717, 2018. [229] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in NIPS, 2016, pp. 2226–2234. [230] O. Breuleux, Y. Bengio, and P. Vincent, “Quickly generating representative samples from an rbm-derived process,” Neural Computation, vol. 23, no. 8, pp. 2058–2073, 2011. [231] E. Hildreth, “Theory of edge detection,” Proceedings of Royal Society of London, vol. 207, no. 187-217, p. 9, 1980. [232] Y. Wang, M. Long, J. Wang, Z. Gao, and P. S. Yu, “Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms,” in NIPS, 2017, pp. 879–888. [233] Y. Wang, Z. Gao, M. Long, J. Wang, and P. S. Yu, “Predrnn++: Towards A resolution of the deep-in-time dilemma in spatiotemporal predictive learning,” in ICML, ser. Proceedings of Machine Learning Research, vol. 80. PMLR, 2018, pp. 5110–5119. [234] F. Cricri, X. Ni, M. Honkala, E. Aksu, and M. Gabbouj, “Video ladder networks,” CoRR, vol. abs/1612.01756, 2016. [235] I. Prémont-Schwarz, A. Ilin, T. Hao, A. Rasmus, R. Boney, and H. Valpola, “Recurrent ladder networks,” in NIPS, 2017, pp. 6009–6019. [236] B. Jin, Y. Hu, Y. Zeng, Q. Tang, S. Liu, and J. Ye, “Varnet: Exploring variations for unsupervised video prediction,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 5801–5806. [237] J. Lee, J. Lee, S. Lee, and S. Yoon, “Mutual suppression network for video prediction using disentangled features,” arXiv preprint arXiv:1804.04810, 2018. [238] D. Weissenborn, O. Tckstrm, and J. Uszkoreit, “Scaling autoregressive video models,” in International Conference on Learning Representations, 2020. [Online]. Available: https: //openreview.net/forum?id=rJgsskrFwH [239] L. Theis, A. van den Oord, and M. Bethge, “A note on the evaluation of generative models,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2016. Sergiu Oprea is a PhD student at the Department of Computer Technology (DTIC), University of Alicante. He received his MSc (Automation and Robotics) and BSc (Computer Science) from the same institution in 2017 and 2015 respectively. His main research interests include video prediction with deep learning, virtual reality, 3D computer vision, and parallel computing on GPUs. Pablo Martinez Gonzalez is a PhD student at the Department of Computer Technology (DTIC), University of Alicante. He received his MSc (Computer Graphics, Games and Virtual Reality) and BSc (Computer Science) at the Rey Juan Carlos University and University of Alicante, in 2017 and 2015, respectively. His main research interests include deep learning, virtual reality and parallel computing on GPUs. Alberto Garcia Garcia is a Postdoctoral Researcher at the Institute of Space Sciences (ICECSIC, Barcelona) where he leads the efforts in code optimization, machine learning, and parallel computing on the MAGNESIA ERC Consolidator project. He received his PhD (Machine Learning and Computer Vision), MSc (Automation and Robotics) and BSc (Computer Science) from the same institution in 2019, 2016 and 2015 respectively. Previously he was an intern at NVIDIA Research/Engineering, Facebook Reality Labs, and Oculus Core Tech. His main research interests include deep learning (specially convolutional neural networks), virtual reality, 3D computer vision, and parallel computing on GPUs. John Alejandro Castro Vargas is a PhD student at the Department of Computer Technology (DTIC), University of Alicante. He received his MSc (Automation and Robotics) and BSc (Computer Science) from the same institution in 2017 and 2016 respectively. His main research interests include human behavior recognition with deep learning, virtual reality and parallel computing on GPUs. Sergio Orts-Escolano received a BSc, MSc and PhD in Computer Science from the University of Alicante in 2008, 2010 and 2014 respectively. His research interests include computer vision, assistive robotics, 3D sensors, GPU computing, virtual/augmented reality and deep learning. He has authored +50 publications in top journals and conferences like CVPR, SIGGRAPH, 3DV, BMVC, CVIU, IROS, UIST, RAS, etcetera. He is also a member of European Networks like HiPEAC and Eucog. He has experience as a professor in academia and industry, working as a research scientist for companies such as Google and Microsoft Research. Jose Garcia-Rodriguez received his Ph.D. degree, with specialization in Computer Vision and Neural Networks, from the University of Alicante (Spain). He is currently Full Professor at the Department of Computer Technology of the University of Alicante. His research areas of interest include: computer vision, computational intelligence, machine learning, pattern recognition, robotics, man-machine interfaces, ambient intelligence, computational chemistry, and parallel and multicore architectures. Antonis Argyros is a professor of computer science at the Computer Science Department, University of Crete and a researcher at the Institute of Computer Science, FORTH, in Heraklion, Crete, Greece. His research interests fall in the areas of computer vision and pattern recognition, with emphasis on the analysis of humans in images and videos, human pose analysis, recognition of human activities and gestures, 3D computer vision, as well as image motion and tracking. He is also interested in applications of computer vision in the fields of robotics and smart environments.