Academia.eduAcademia.edu

Clockwork Variational Autoencoders

2021, ArXiv

Deep learning has enabled algorithms to generate realistic images. However, accurately predicting long video sequences requires understanding long-term dependencies and remains an open challenge. While existing video prediction models succeed at generating sharp images, they tend to fail at accurately predicting far into the future. We introduce the Clockwork VAE (CW-VAE), a video prediction model that leverages a hierarchy of latent sequences, where higher levels tick at slower intervals. We demonstrate the benefits of both hierarchical latents and temporal abstraction on 4 diverse video prediction datasets with sequences of up to 1000 frames, where CW-VAE outperforms top video prediction models. Additionally, we propose a Minecraft benchmark for long-term video prediction. We conduct several experiments to gain insights into CW-VAE and confirm that slower levels learn to represent objects that change more slowly in the video, and faster levels learn to represent faster objects.

Clockwork Variational Autoencoders Vaibhav Saxena University of Toronto [email protected] Jimmy Ba University of Toronto [email protected] Danijar Hafner University of Toronto Google Research, Brain Team [email protected] Abstract Deep learning has enabled algorithms to generate realistic images. However, accurately predicting long video sequences requires understanding long-term dependencies and remains an open challenge. While existing video prediction models succeed at generating sharp images, they tend to fail at accurately predicting far into the future. We introduce the Clockwork VAE (CW-VAE), a video prediction model that leverages a hierarchy of latent sequences, where higher levels tick at slower intervals. We demonstrate the benefits of both hierarchical latents and temporal abstraction on 4 diverse video prediction datasets with sequences of up to 1000 frames, where CW-VAE outperforms top video prediction models. Additionally, we propose a Minecraft benchmark for long-term video prediction. We conduct several experiments to gain insights into CW-VAE and confirm that slower levels learn to represent objects that change more slowly in the video, and faster levels learn to represent faster objects.1 Introduction Video prediction involves generating highdimensional sequences for many steps into the future. A simple approach is to predict one image after another, based on the previously observed or generated images (Oh et al., 2015; Kalchbrenner et al., 2016; Babaeizadeh et al., 2017; Denton and Fergus, 2018; Weissenborn et al., 2020). However, such temporally autoregressive models can be computationally expensive because they operate in the highdimensional image space and at the frame rate of the dataset. In contrast, humans have the ability to reason using abstract concepts and predict their changes far into the future, without having to visualizing the details in the input space or at the input frequency. Moving MNIST Open-loop Prediction 0.90 CW-VAE (3 levels, factor 8) CW-VAE (3 levels, factor 6) CW-VAE (3 levels, factor 4) CW-VAE (3 levels, factor 2) RSSM SVG-LP random Structural Similarity (SSIM) 1 0.84 0.78 0.72 0.66 0.60 80 120 160 200 Distance in Frames Figure 1: Video prediction quality as a function of the distance predicted. We show 4 versions of Clockwork VAE with temporal abstraction factors 2, 4, 6, and 8. Larger temporal abstraction directly results in predictions that remain accurate for longer horizons. Clockwork VAE further outLatent dynamics models predict a sequence of performs the top video models RSSM and SVG. learned latent states forward that is then decoded into the video, without feeding generated frames back into the model (Kalman, 1960; Krishnan et al., 2015; Karl et al., 2016). These models are typically trained using variational inference, similar to VRNN (Chung et al., 2015), except that the generated images are not fed back into the model. Latent dynamics models have recently achieved success in reinforcement learning for learning world models from pixels (Ha and Schmidhuber, 2018; Zhang et al., 2019; Buesing et al., 2018; Mirchev et al., 2018; Hafner et al., 2019b;a; 2020). 1 0 40 All code is publicly available at https://github.com/vaibhavsaxena11/cwvae. 35th Conference on Neural Information Processing Systems (NeurIPS 2021). Video Prediction without Further Inputs SVG RSSM VTA CW-VAE GT Inputs t =1 18 36 50 80 110 140 170 200 230 260 290 320 350 380 410 440 470 500 Figure 2: Long-horizon video predictions on the MineRL Navigate dataset. In the dataset, the camera moves straight ahead most of the time. CW-VAE accurately predicts the camera movement from the ocean to the forest until the end of the sequence. In contrast, VTA outputs artifacts in the sky after 240 steps and shows less diversity in the frames. RSSM fails to predict the movement toward the island after 290 frames. SVG simply copies the initial frame and does not predict any new events in the future. To better represent complex data, hierarchical latent variable models learn multiple levels of features. Ladder VAE (Sønderby et al., 2016), VLAE (Zhao et al., 2017), NVAE (Vahdat and Kautz, 2020), and very deep VAEs (Child, 2020) have demonstrated the success of this approach for generating static images. Hierarchical latents have also been incorporated into deep video prediction models (Serban et al., 2016; Kumar et al., 2019; Castrejón et al., 2019). These models can learn to separate high-level details, such as textures, from low-level details, such as object positions. However, they operate at the frequency of the input sequence, making it challenging to predict far into the future. Temporally abstract latent dynamics models predict learned features at a slower frequency than the input sequence. This encourages learning long-term dependencies that can result in more accurate predictions and enable computationally efficient long-horizon planning. Temporal abstraction has been studied for low-dimensional sequences (Koutník et al., 2014; Chung et al., 2016; Mujika et al., 2017). VTA (Kim et al., 2019) models videos using two levels of latent variables, where the fast states decide when the slow states should tick. TD-VAE (Gregor and Besse, 2018) models high-dimensional input sequences using jumpy predictions, without a hierarchy. Refer to Appendix B for further related work. Despite this progress, scaling temporally abstract latent dynamics to complex datasets and understanding how these models organize the information about their inputs remain open challenges. In this paper, we introduce the Clockwork Variational Autoencoder (CW-VAE), a simple hierarchical latent dynamics model where all levels tick at different fixed clock speeds. We conduct an extensive empirical evaluation and find that CW-VAE outperforms existing video prediction models. Moreover, we conduct several experiments to gain insights into the inner workings of the learned hierarchy. Our key contributions are summarized as follows: • Clockwork Variational Autoencoder (CW-VAE) We introduce a simple hierarchical video prediction model that leverages different clock speeds per level to learn long-term dependencies in video. A comprehensive empirical evaluation shows that on average, CW-VAE outperforms strong baselines, such as SVG-LP, VTA, and RSSM across several metrics. • Long-term video benchmark In the past, the video prediction literature has mainly focused on short-term video prediction of under 100 frames. To evaluate the ability to capture long-term dependencies, we propose using the Minecraft Navigate dataset as a challenging benchmark for video prediction of 500 frames. • Accurate long-term predictions Despite the simplicity of fixed clock speeds, CW-VAE improves over the distance of accurate video prediction of prior work. On the Minecraft Navigate dataset, CW-VAE accurately predicts for over 400 frames, whereas prior work fails before or around 150 frames. • Adaptation to sequence speed We demonstrate that CW-VAE automatically adapts to the frame rate of the dataset. Varying the frame rate of a synthetic dataset confirms that the slower latents are used more when objects in the video are slower, and faster latents are used more when the objects are faster. 2 Second Level s21 top-down context s23 s11 s12 s13 s14 t=1 t=2 t=3 t=4 temporal context temporal context h z Video Data First Level s bottom-up observations Connection used for both prior and posterior top-down context Connection only used for posterior (given data) Figure 3: Clockwork Variational Autoencoder (CW-VAE). We show 2 levels with temporal abstraction factor 2 (but we use up to 4 levels and abstraction factor of 8 in our experiments). Left is the structure of the model, where each latent state slt in the second level conditions two latent states in the first level. The solid arrows represent the generative model, while both solid and dashed arrows comprise the inference model. On the right, we illustrate the internal components of the state, containing a deterministic variable ht and a stochastic variable zt . The deterministic variable aggregates contextual information and passes it to the stochastic variable to be used for inference and stochastic prediction. • Separation of information We visualize the content represented at higher levels of the temporal hierarchy. This experiment shows that the slower levels represent content that changes more slowly in the input, such as the wall colors of a maze, while faster levels represent faster changing concepts, such as the camera position. 2 Clockwork Variational Autoencoder Long video sequences contain both, information that is local to a few frames, as well as global information that is shared among many frames. Traditional video prediction models that predict ahead at the frame rate of the video can struggle to retain information long enough to learn such long-term dependencies. We introduce the Clockwork Variational Autoencoder (CW-VAE) to learn long-term correlations of videos. Our model predicts ahead on multiple time scales, as visualized in Figure 3. We build our work upon the Recurrent State-Space Model (RSSM; Hafner et al., 2019b), the details of which can be found in Appendix A. CW-VAE consists of a hierarchy of recurrent latent variables, where each level transitions at a different clock speed. We slow down the transitions exponentially as we go up in the hierarchy, i.e. each level is slower than the level below by a factor k. We denote the latent state at timestep t and level l by slt and the video frames by xt . We define the set of active timesteps Tl for each level l ∈ [1, L] as those instances in time where the state transition generates a new latent state, . (1) Active steps: Tl = {t ∈ [1, T ] | t mod k l−1 = 1}. At each level, we condition k consecutive latent states on a single latent variable in the level above. For example, in the model shown in Figure 3 with k = 2, T1 = {1, 2, 3, . . . }, T2 = {1, 3, 5, . . . }, and both s11 and s12 are conditioned on the same s21 from the second level. The latent chains can also be thought of as a hierarchy of latent variables where each level has a separate state variable per timestep, but only updates the previous state every k l−1 timesteps and otherwise copies the previous state, so that ∀t ∈ / Tl : . Copied states: slt = slmax{τ ∈Tl |τ ≤t} . (2) τ Joint distribution We can factorize the joint distribution of a sequence of images and active latents at every level into two terms: (1) the reconstruction terms of the images given their lowest level latents, and (2) state transitions at all levels conditioned on the previous latent and the latent above,   Q Q  . QT L l+1 1 l l p(x1:T , s1:L p(x | s ) p(s | s , s ) . (3) t t 1:T ) = t t t−1 t=1 l=1 t∈Tl 3 To implement this distribution and its inference model, CW-VAE utilizes the following components, ∀ l ∈ [1, L], t ∈ Tl , Encoder: elt = e(xt:t+kl−1 −1 ) l Posterior transition qtl : q(slt | slt−1 , sl+1 t , et ) Prior transition plt : p(slt | slt−1 , sl+1 t ) Decoder: p(xt | s1t ). (4) Inference CW-VAE embeds the observed frames using a CNN. Each active latent state at a level l receives the image embeddings of its corresponding k l−1 observation frames (dashed lines in Figure 3). The diagonal Gaussian belief qtl is then computed as a function of the input features, the posterior sample at the previous step, and the posterior sample above (solid lines in Figure 3). We reuse all weights of the generative model for inference except for the output layer that predicts the mean and variance. Generation The diagonal Gaussian prior plt is computed by applying the transition function from the latent state at the previous timestep in the current level, as well as the state belief at the level above (solid lines in Figure 3). Finally, the posterior samples at the lowest level are decoded into images using a transposed CNN. Training objective Because we cannot compute the likelihood of the training data under the model in closed form, we use the ELBO as our training objective. This training objective optimizes a reconstruction loss at the lowest level, and a KL regularizer at every level in the hierarchy summed across active timesteps, h i PT PL P max t=1 Eqt1 [ln p(xt | s1t )] − l=1 t∈Tl Eql ql+1 KL[qtl ∥ plt ] . (5) t−1 t e,q,p The KL regularizers limit the amount of information about the images that enters via the encoder. This encourages the model to utilize the “free” information from the previous and above latent and only attend to the input image to the extent necessary. Since the number of active timesteps decreases as we go higher in the hierarchy, the number of KL terms per level decreases as well. Hence it is easier for the model to store slowly changing information high up in the hierarchy than to pay a KL penalty to repeatedly extracting the information from the images at the lower level or trying to remember it by passing it along for many steps at the lower level without accidental forgetting. Stochastic and Deterministic Path As shown in Figure 3 (right), we split the state slt into stochastic (ztl ) and deterministic (hlt ) parts (Hafner et al., 2019b). The deterministic state is computed using the top-down and temporal context, which then conditions the stochastic state at that level. The stochastic variables follow diagonal Gaussians with predicted means and variances. We use one GRU (Cho et al., 2014) per level to update the deterministic variable at every active step. All components of Equation 4 jointly optimize Equation 5 by stochastic backprop with reparameterized sampling (Kingma and Welling, 2013; Rezende et al., 2014). Refer to Appendix C for architecture details. 3 Experiments We compare the Clockwork Variational Autoencoder (CW-VAE) on 4 diverse video prediction datasets to state-of-the-art video prediction models in Sections 3.2 to 3.5. We then conduct an extensive analysis to gain insights into the behavior of CW-VAE. We study the content stored at each level of the hierarchy in Section 3.6, the effect of different clock speeds in Section 3.7, the amount of information at each level in Section 3.8, and the change in information as a function of the dataset frame rate Section 3.9. The source code and video predictions are available on the project website.2 Datasets We choose 4 diverse video datasets for the benchmark. The MineRL Navigate dataset (available under the CC Attribution-NonCommercial-ShareAlike 4.0 license) was crowd sourced by Guss et al. (2019) for reinforcement learning applications. We process this data to create a longhorizon video prediction dataset that contains ∼750k frames. The sequences show players traveling to goal locations in procedurally generated 3D worlds of the video game Minecraft (traversing forests, mountains, villages, oceans). The KTH Action video prediction dataset (Schuldt et al., 2004) (available under the CC Attribution-NonCommercial license) contains 290k frames. The 600 videos show humans walking, jogging, running, boxing, hand-waving, and clapping. GQN Mazes (Eslami 2 https://danijar.com/cwvae 4 Video Prediction without Further Inputs SVG RSSM VTA CW-VAE GT Inputs t =1 18 36 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 Figure 4: Open-loop video predictions for KTH Action. CW-VAE predicts accurately for 50 time steps. VTA fails to predict coherent transitions from one frame to the next around time step 55, which we attribute to the fact that its lower level is reset when the higher level steps. SVG and RSSM accurately predict for 20 frames and their subsequent predictions appear plausible but fail to capture the long-term dependencies in the walking movement that is present in the dataset. et al., 2018) (available under the Apache License 2.0) contains 9M frames of videos of a scripted policy that traverses procedurally generated mazes with randomized wall and floor textures. For Moving MINST (Srivastava et al., 2015) (available under the CC Attribution-ShareAlike 3.0 license) we generate 2M frames where two digits move with velocities sampled uniformly in the range of 2 to 6 pixels per frame and bounce within the edges. Baselines We compare CW-VAE to 3 well-established video prediction models, and an ablation of our method where all levels tick at the fastest scale, which we call NoTmpAbs. VTA (Kim et al., 2019) is the state-of-the-art for video prediction using temporal abstraction. It consists of two levels, with the lower level predicting when the higher level should step. The lower level is reset when the higher level steps. RSSM (Hafner et al., 2019b) is commonly used as a world model in reinforcement learning. It predicts forward using a sequence of compact latents without temporal abstraction. SVG-LP (Denton and Fergus, 2018), or SVG for short, has been shown to generate sharp predictions on visually complex datasets. It autogressively feeds generated images back into the model while also using a latent variable at every time step. The parameter counts are shown in Table D.2. NoTmpAbs simply sets the temporal abstraction factor to 1 and thus uses the same number of parameters as its temporally-abstract counterparts. Training details We train all models on all datasets for 300 epochs on training sequences of 100 frames of size 64 × 64 pixels. For the baselines, we tune the learning rate in the range [10−4 , 10−3 ] and the decoder stddev in the range [0.1, 1]. We use a temporal abstraction factor of 6 per level for CW-VAE, unless stated otherwise. Refer to Appendix D for hyperparameters and training durations. A 3-level CW-VAE with abstraction factor 6 takes 2.5 days to train on one Nvidia Titan Xp GPU. Higher temporal abstraction factors train faster because fewer state transitions need to be computed. Evaluation We evaluate the open-loop video predictions under 4 metrics: Structural Similarity index (SSIM, higher is better), Peak Signal-to-Noise Ratio (PSNR, higher is better), Learned Perceptual Image Patch Similarity (LPIPS, lower is better; Zhang et al., 2018), and Frechet Video Distance (FVD, lower is better; Unterthiner et al., 2018). All video predictions are open-loop, meaning that the models only receive the first 36 frames as context input and then predict forward without access to intermediate frames. The number of input frames equals one step of the slowest level of CW-VAE for simplicity; see Appendix D. 3.1 Benchmark Scores We evaluate the 5 models on 4 datasets and 4 metrics. To aggregate the scores, we compute how each model ranks compared to the other models, and average its ranks across datasets and metrics, shown in Figure 5c. Individual scores are shown in Table E.1. Averaged over datasets and metrics, CW-VAE substantially outperforms the existing methods, which we attribute to its hierarchical latents and temporal abstraction. The second best model is NoTmpAbs, which uses hierarchical latents but no temporal abstraction. It is followed by RSSM, which has only one level. The performance improvement due to hierarchical latents is smaller than the additional 5 GQN Mazes Room Sequences Moving MNIST Digit Identities 3.5 Accuracy 3.0 2.5 2.0 1.5 0 200 400 600 800 Prediction Horizon CW-VAE (ours) RSSM NoTmpAbs (ours) VTA (b) 1000 SVG 1.0 CW No VAE Tm pA RS bs SM SV G VT A No VAE Tm pA RS bs SM VT A SV G CW (a) Rank (lower is better) Avg Rank over Datasets and Metrics 4.0 0.65 0.60 0.55 0.50 0.45 0.40 Accuracy 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 (c) Figure 5: Evaluating video prediction models. (a) Accuracy of sequence of rooms in GQN Mazes. CW-VAE predicts video with a significantly higher accuracy of high-level details than the second best video prediction model for this evaluation (VTA). (b) Accuracy of digit identities in two-digit Moving MNIST. CW-VAE generates video with a significantly higher accuracy of digit identities until 1000 timesteps. (c) Aggregated performance ranks of all methods across 4 datasets and 4 metrics. The best possible rank is 1 and the worst is 5. Our CW-VAE uses a temporally abstract hierarchy and substantially outperforms a hierarchy without temporal abstraction (NoTmpAbs), as well as the top single-level model RSSM, the image-space model SVG, and the temporally abstract model VTA. improvement due to temporal abstraction. Specifically, NoTmpAbs is sometimes outperformed by RSSM but CW-VAE matches or outperforms RSSM on all datasets and metrics. While SVG achieves high scores on GQN Mazes due to its sharp predictions, it performs poorly compared to CW-VAE and RSSM on the other 3 datasets. This result indicates the benefits of predicting forward purely in latent space instead of feeding generated images back into the model. While VTA makes reasonable predictions, it lags behind the other methods in our experiments. One reason could be that its low level latents do not carry over the recurrent state when the high level ticks, which we found to cause sudden jumps in the predicted video sequence. 3.2 MineRL Navigate Video predictions for sequences of 500 frames are shown in Figure 2. Given only an island on the horizon as context, the temporally abstract models CW-VAE and VTA correctly predict that the player will enter the island (as the player typically navigates straight ahead in the dataset). However, the predictions of VTA lack diversity and contain artifacts. In contrast, CW-VAE predicts diverse variations in the terrain, such as grass, rocks, trees, and a pond. RSSM generates plausible images but fails to capture long-term dependencies, such as the consistent movement toward the island. SVG only predicts small to no changes after the first few frames. Additional samples in Figure F.1 further confirm the findings. The predictions of all models are a bit blurry, which we attribute to the model capacity, which was restricted due to our limited computational resources. 3.3 KTH Action Figure 4 shows video prediction samples for the walking task of KTH Action. CW-VAE accurately predicts the motion of the person as they walk across the frame until walking out of the frame. VTA predicts that the person suddenly disappears, which we attribute to it resetting the low level every time the high level steps. Both RSSM and SVG tend to forget the task demonstrated in the context frames, with RSSM predicting a person standing still, and SVG predicting the person starting to move in the opposite direction, after open-loop generation of about 20 frames. We also point out that SVG required the slower VGG architecture for KTH, whereas CW-VAE works well even with the smaller DCGAN architecture. 3.4 GQN Mazes Figure G.1 shows open-loop video prediction samples of GQN Mazes, where the camera traverses two rooms connected by a hallway with distinct randomized textures in a 3D maze. CW-VAE maintains and predicts wall and floor patterns of both rooms for 200 frames, whereas RSSM and VTA fail to predict the transition to the textures of the second room. SVG is the model with the sharpest predictions on this dataset which is reflected by the metrics. However, it confuses textures over a 6 Video Prediction without Further Inputs SVG RSSM VTA CW-VAE GT Inputs ✗ ✗ t =1 18 36 40 60 ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ 80 100 120 140 160 200 300 400 500 600 700 800 900 1000 Figure 6: Long-horizon open-loop video predictions on Moving MNIST. We compare samples of CW-VAE, VTA, RSSM, and SVG. Red crosses indicate predicted frames that show incorrect digit identities. We observe that CW-VAE remembers the digit identities across all 1000 frames of the prediction horizon, whereas all other models forget the digit identities before or around 300 steps. The image-space model SVG is the first to forget the digit identities, supporting the hypothesis that prediction errors accumulate more quickly when predicted images are fed back into the model compared to predicting forward purely in latent space. long horizon, generating mixed features on the walls and floor of the maze as visible in the video predictions in Figure G.1. While the open-loop predictions of CW-VAE differ from the ground truth in their camera viewpoints, the model remembers the wall and floor patterns for all 200 timesteps, highlighting its ability to maintain global information for a longer duration. To better evaluate CW-VAE, we compute the prediction accuracy for the following three high-level categories of sequences of rooms in the ground-truth video: 1) agent moves across the same room throughout the video, 2) agent traverses into the hallway but does not transition into another room, 3) agent goes into the hallway and traverses back into a room. Because the rooms and hallways use different textures, we can easily identify these classes via the color histogram. Figure 5a shows that CW-VAE predicts video with a significantly higher accuracy than the second best model, VTA, which is then followed by other baselines with a similar accuracy. This shows that CW-VAE immensely benefits from its temporal hierarchy for accurately predicting high-level details such as the sequence of rooms in a video. 3.5 Moving MNIST Figure 6 shows samples of open-loop video prediction of 1000 frames. We observe that CW-VAE remembers the digit identity for all 1000 timesteps. RSSM clearly outperforms SVG, which typically forgets digit identity within 50 timesteps, but starts to forget object identities much sooner than CW-VAE. Figure 5b shows the two-digit classification accuracy of video prediction as it varies over time, averaged over the test set. We observe that while RSSM initially has the highest accuracy, it falls below CW-VAE after 200 frames. CW-VAE maintains a stable accuracy over 1000 timesteps, ending with significantly more accurate predictions at the horizon than the second best model, VTA. With regards to digit positions, as shown in Figure 6, CW-VAE predicts accurate positions until 100 steps, and predicts a plausible sequence thereafter. RSSM also predicts the correct location of digits for at least as long as CW-VAE, whereas SVG starts to lose track of positions much sooner. We note Model Level 1 Level 2 Level 3 Level 4 4-level CW-VAE 3-level CW-VAE 2-level CW-VAE 1-level CW-VAE 397.10 366.70 389.40 440.50 41.20 45.28 39.18 – 2.66 6.29 – – 0.0001 – – – Table 1: KL loss at each level of the hierarchy for CW-VAEs with different numbers of levels, summed over time steps of a training sequence. Deeper hierarchies incorporate less information about the inputs into the lowest level and instead distribute the information content across levels. Higher levels tend to store less information than lower levels, suggesting that the dataset contains more short-term dependencies than long-term dependencies. 7 Inputs Ground truth Inputs Video prediction Inputs Video predictions when top level does not receive inputs t=1 4 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Figure 7: Visualization of the content stored at the top-level of a Clockwork VAE trained on GQN Mazes. The first row shows ground truth and the second row shows a normal video prediction. The remaining rows show video predictions where the top-level stochastic variables are drawn from the prior rather than the posterior. In other words, only the lower and middle levels have access to the context images but the top level is blind. We find that the positions of camera and nearby walls remain unchanged, so this information must have been represented at lower and middle levels. In contrast, the model predicts textures that differ from the context, meaning that this global information must have been stored at the top level. that the generations by models that predict purely in latent space are slightly blurry compared to those generated by SVG. Content Stored at Different Levels 3.7 Temporal Abstraction Factor Adapting to Sequence Speed KL divergence in nats 3.6 100 We visualize the information stored at different levels of the hierarchy. We generate video 10 predictions where only the lower and middle 1 layer have access to the context images, but Bottom level the high level follows its prior. This way, the 0.1 Middle level prediction will only be consistent with the con0.01 Top level text inputs for information that was extracted by the lower and middle level. Information that 3% 10% 30% 100% 300% is held by the top level is not informed by the Speed of Moving Digits context frames and thus will follow the training distribution. We use 8 instead of 36 input Figure 8: KL loss at each level of CW-VAE when images for this experiment because we do not trained on slower or faster variants of Moving need to compute the top level posterior here. MNIST. The KL value at a level indicates the Figure 7 shows these samples for one input amount of information stored at that level. Faster sequence and more examples are included in moving digits result in higher KLs at the lower level Figure G.1. We find that the video predictions and lower KLs at higher levels. Slower moving digcorrectly continue the positions of the camera its result in more information at higher levels. and nearby walls visible in the input frames, but that the textures appear randomized. This confirms that the top level stores the global information about the textures whereas more ephemeral information is stored in the faster ticking lower and middle levels. We also experimented with resetting the lower or middle level but found that they store similar information, suggesting that a 2-level model may be sufficient for this dataset. Figure 1 compares the quality of open-loop video predictions on Moving MNIST for CW-VAE with temporal abstraction factors 2, 4, 6, and 8 - all with equal number of model parameters. Increasing the temporal abstraction factor directly increases the duration for which the predicted frames are accurate for. Comparing to RSSM and SVG, CW-VAEs model long-term dependencies for 6× as many frames as the baselines before losing temporal context. The point at which the models lose temporal context is when they approach the “random” line, which shows the quality of using randomly sampled training images as a native baseline for video prediction that has no temporal dependencies. 8 3.8 Information Amount per Level Table 1 shows the KL regularizers for each level for CW-VAEs of varying number of levels. The KL regularizers are summed across evaluation sequences on the Moving MNIST dataset. The KL regularizers provide an upper bound on the amount of information incorporated into each level of the hierarchy, which we use as an indicator. The 2-level and 3-level CW-VAEs were trained with a temporal abstraction factor of 6, and the 4-level model with a factor of 4 to fit into GPU memory. We observe that the amount of information stored at the lowest level decreases as we use a deeper hierarchy, and further decreases for larger temporal abstraction factors. Using a larger number of levels means that the lowest level does not need to capture as much information. Moreover, increasing the amount of temporal abstraction makes the higher levels more useful to the model, again reducing the amount of information that needs to be incorporated at the lowest level. 3.9 Adapting to Sequence Speed To understand how our model adapts to changing temporal correlations in the dataset, we train CWVAE on slower and faster versions of moving MNIST. Figure 8 shows the KL divergence summed across the active timesteps of each level in the hierarchy. The KL regularizer at each level correlates with the frame rate of the dataset. The faster the digits move, the more the fast ticking lowest level of the hierarchy is used. The slower the digits move, the more the middle and top level are used. Even though the KL divergence at the top level is small, it follows a consistent trend. We conjecture that the top level stores relatively little information because the only global information is the two digit identities, which take about 5 nats to store. The experiment shows that the amount of information stored at any temporally abstract level adapts to the speed of the sequence, with high-frequency details pushed into fast latents. 4 Discussion This paper introduces the Clockwork Variational Autoencoder (CW-VAE) that leverages a temporally abstract hierarchy of latent variables for long-term video prediction. We demonstrate its empirical performance on 4 diverse video prediction datasets with up to 1000 frames and show that CW-VAE outperforms top video prediction models from the literature. Moreover, we confirm experimentally that the slower ticking higher levels of the hierarchy learn to represent content that changes more slowly in the video, whereas lower levels hold faster changing information. We point out the following limitations of our work as promising directions for future work: • The typical video prediction metrics are not ideal at capturing the quality of video predictions, especially for long horizons. To this end, we have experimented with evaluating the best out of 100 samples for each evaluation video but have not found significant differences in the results, while evaluating even more samples is computationally infeasible to us. Using datasets where underlying attributes of the scene are available would allow evaluating the multi-step predictions by how well underlying attributes can be extracted from the representations using a separately trained readout network. • In our experiments, we train the Clockwork VAE end-to-end on training sequences of length 100. With 3 levels and a temporal abstraction factor of 8, the top level can only step once within each training sequence. Our experiments show clear benefits of the temporally abstract model, but we conjecture that its performance could be further improved by training the top level on more consecutive transitions. • We used relatively small convolutional neural networks for encoding and decoding the images. This results in a relatively light-weight model that can easily be trained on a single GPU in a few days. However, we conjecture that using a larger architecture could increase the quality of generated images, resulting in a model that excels at predicting both high frequency details and long-term dependencies in the data. Temporally abstract latent hierarchies are an intuitive approach for processing high-dimensional input sequences. Besides video prediction, we hope that our findings can help advance other domains that deal with high-dimensional sequences, such as representation learning, video understanding, and reinforcement learning. Acknowledgments input. We thank Ruben Villegas and our anonymous reviewers for their valuable 9 References M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine. Stochastic variational video prediction. CoRR, abs/1710.11252, 2017. URL http://arxiv.org/abs/1710.11252. S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015. Y. Bengio. Markovian models for sequential data, 1996. H. Bourlard and N. Morgan. Connectionist Speech Recognition: A Hybrid Approach. Kluwer Academic Publishers, 01 1994. ISBN 978-1-4613-6409-2. doi: 10.1007/978-1-4615-3210-1. L. Buesing, T. Weber, S. Racaniere, S. Eslami, D. Rezende, D. P. Reichert, F. Viola, F. Besse, K. Gregor, D. Hassabis, et al. Learning and querying fast generative models for reinforcement learning. arXiv preprint arXiv:1802.03006, 2018. L. Castrejón, N. Ballas, and A. C. Courville. Improved conditional vrnns for video prediction. CoRR, abs/1904.12165, 2019. URL http://arxiv.org/abs/1904.12165. R. Child. Very deep vaes generalize autoregressive models and can outperform them on images, 2020. K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014. URL http://arxiv.org/abs/1406.1078. J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pages 2980–2988, 2015. J. Chung, S. Ahn, and Y. Bengio. Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704, 2016. A. Clark, J. Donahue, and K. Simonyan. Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571, 2019. E. Denton and R. Fergus. Stochastic video generation with a learned prior. arXiv preprint arXiv:1802.07687, 2018. S. A. Eslami, D. J. Rezende, F. Besse, F. Viola, A. S. Morcos, M. Garnelo, A. Ruderman, A. A. Rusu, I. Danihelka, K. Gregor, et al. Neural scene representation and rendering. Science, 360(6394): 1204–1210, 2018. J.-Y. Franceschi, E. Delasalles, M. Chen, S. Lamprier, and P. Gallinari. Stochastic latent residual video prediction. arXiv preprint arXiv:2002.09219, 2020. M. Gemici, C.-C. Hung, A. Santoro, G. Wayne, S. Mohamed, D. J. Rezende, D. Amos, and T. Lillicrap. Generative temporal models with memory. arXiv preprint arXiv:1702.04649, 2017. A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. CoRR, abs/1410.5401, 2014. URL http://arxiv.org/abs/1410.5401. A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwińska, S. Gómez, E. Grefenstette, T. Ramalho, J. Agapiou, A. Badia, K. Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, and D. Hassabis. Hybrid computing using a neural network with dynamic external memory. Nature, 538, 10 2016. doi: 10.1038/nature20101. K. Gregor and F. Besse. arXiv:1806.03107, 2018. Temporal difference variational auto-encoder. arXiv preprint W. H. Guss, B. Houghton, N. Topin, P. Wang, C. Codel, M. Veloso, and R. Salakhutdinov. MineRL: A large-scale dataset of Minecraft demonstrations. Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019. URL http://minerl.io. 10 D. Ha and J. Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018. D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019a. D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, pages 2555–2565, 2019b. D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020. G. E. Hinton and D. Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pages 5–13, 1993. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, Nov. 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL https://doi.org/10.1162/ neco.1997.9.8.1735. D. Jayaraman, F. Ebert, A. A. Efros, and S. Levine. Time-agnostic prediction: Predicting predictable video frames. arXiv preprint arXiv:1808.07784, 2018. N. Kalchbrenner, A. v. d. Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu. Video pixel networks. arXiv preprint arXiv:1610.00527, 2016. R. E. Kalman. A new approach to linear filtering and prediction problems. Journal of basic Engineering, 82(1):35–45, 1960. M. Karl, M. Soelch, J. Bayer, and P. van der Smagt. Deep variational bayes filters: Unsupervised learning of state space models from raw data. arXiv preprint arXiv:1605.06432, 2016. T. Kenter, V. Wan, C.-A. Chan, R. Clark, and J. Vit. Chive: Varying prosody in speech synthesis with a linguistically driven dynamic hierarchical conditional variational network. In International Conference on Machine Learning, pages 3331–3340. PMLR, 2019. T. Kim, S. Ahn, and Y. Bengio. Variational temporal abstraction. In Advances in Neural Information Processing Systems, pages 11566–11575, 2019. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980. D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. J. Koutník, K. Greff, F. J. Gomez, and J. Schmidhuber. A clockwork RNN. CoRR, abs/1402.3511, 2014. URL http://arxiv.org/abs/1402.3511. R. G. Krishnan, U. Shalit, and D. Sontag. Deep kalman filters. arXiv preprint arXiv:1511.05121, 2015. M. Kumar, M. Babaeizadeh, D. Erhan, C. Finn, S. Levine, L. Dinh, and D. Kingma. Videoflow: A conditional flow-based model for stochastic video generation. arXiv preprint arXiv:1903.01434, 2019. A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine. Stochastic adversarial video prediction. CoRR, abs/1804.01523, 2018. URL http://arxiv.org/abs/1804.01523. W. Lotter, G. Kreiman, and D. Cox. Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104, 2016. A. Mirchev, B. Kayalibay, M. Soelch, P. van der Smagt, and J. Bayer. Approximate bayesian inference in spatial environments. arXiv preprint arXiv:1805.07206, 2018. 11 A. Mujika, F. Meier, and A. Steger. Fast-slow recurrent neural networks. In Advances in Neural Information Processing Systems, pages 5915–5924, 2017. J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-conditional video prediction using deep networks in atari games. In Advances in Neural Information Processing Systems, pages 2863–2871, 2015. K. Pertsch, O. Rybkin, F. Ebert, C. Finn, D. Jayaraman, and S. Levine. Long-horizon visual planning with goal-conditioned hierarchical predictors. arXiv preprint arXiv:2006.13205, 2020. A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In Y. Bengio and Y. LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1511.06434. D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014. C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm approach. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., volume 3, pages 32–36 Vol.3, 2004. I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. Courville, and Y. Bengio. A hierarchical latent variable encoder-decoder model for generating dialogues. arXiv preprint arXiv:1605.06069, 2016. E. Shelhamer, K. Rakelly, J. Hoffman, and T. Darrell. Clockwork convnets for video semantic segmentation. CoRR, abs/1608.03609, 2016. URL http://arxiv.org/abs/1608.03609. C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther. Ladder variational autoencoders. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pages 3745–3753, USA, 2016. Curran Associates Inc. ISBN 978-1-5108-3881-9. URL http://dl.acm.org/citation.cfm?id=3157382.3157516. N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using lstms. CoRR, abs/1502.04681, 2015. URL http://arxiv.org/abs/1502.04681. T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. Towards accurate generative models of video: A new metric & challenges. CoRR, abs/1812.01717, 2018. URL http://arxiv.org/abs/1812.01717. A. Vahdat and J. Kautz. Nvae: A deep hierarchical variational autoencoder, 2020. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706. 03762. C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, 2016. D. Weissenborn, J. Uszkoreit, and O. Täckström. Scaling autoregressive video models. In ICLR, 2020. N. Wichers, R. Villegas, D. Erhan, and H. Lee. Hierarchical long-term video prediction without supervision. CoRR, abs/1806.04768, 2018. URL http://arxiv.org/abs/1806.04768. M. Zhang, S. Vikram, L. Smith, P. Abbeel, M. Johnson, and S. Levine. Solar: deep structured representations for model-based reinforcement learning. In International Conference on Machine Learning, 2019. R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018. S. Zhao, J. Song, and S. Ermon. Learning hierarchical features from generative models. CoRR, abs/1702.08396, 2017. URL http://arxiv.org/abs/1702.08396. 12