Learning Joint Spatial-Temporal Transformations For Video Inpainting
Learning Joint Spatial-Temporal Transformations For Video Inpainting
Learning Joint Spatial-Temporal Transformations For Video Inpainting
3
Microsoft Research Asia
[email protected], [email protected], [email protected]
1 Introduction
Video inpainting is a task that aims at filling missing regions in video frames with
plausible contents [2]. An effective video inpainting algorithm has a wide range of
practical applications, such as corrupted video restoration [10], unwanted object
removal [22,26], video retargeting [16] and under/over-exposed image restoration
[18]. Despite of the huge benefits of this technology, high-quality video inpainting
still meets grand challenges, such as the lack of high-level understanding of videos
[15,29] and high computational complexity [5,33].
Significant progress has been made by using 3D convolutions and recurrent
networks for video inpainting [5,16,29]. These approaches usually fill missing
regions by aggregating information from nearby frames. However, they suffer
from temporal artifacts due to limited temporal receptive fields. To solve the
above challenge, state-of-the-art methods apply attention modules to capture
∗
This work was done when Y. Zeng was an intern at Microsoft Research Asia.
†
J. Fu and H. Chao are the corresponding authors.
2 Y. Zeng, J. Fu, and H. Chao
mization [5,6]. Such a loss design can optimize STTN to learn both perceptually
pleasing and coherent visual contents for video inpainting.
In summary, our main contribution is to learn joint spatial and temporal
transformations for video inpainting, by a deep generative model with adver-
sarial training along spatial-temporal dimensions. Furthermore, the proposed
multi-scale patch-based video frame representations can enable fast training and
inference, which is important to video understanding tasks. We conduct both
quantitative and qualitative evaluations using both stationary masks and mov-
ing object masks for simulating real-world applications (e.g., watermark removal
and object removal). Experiments show that our model outperforms the state-
of-the-arts by a significant margin in terms of PSNR and VFID with relative
improvements of 2.4% and 19.7%, respectively. We also show extensive ablation
studies to verify the effectiveness of the proposed spatial-temporal transformer.
2 Related Work
patch scale n:
frame 1 patch scale 1: frame 1
Q
1x1
softmax
3x3
target frame K target frame
1x1 attention
V
frame t frame t
1x1
t+n
where Xt−n denotes a short clip of neighboring frames with a center moment t
T
and a temporal radius n. X1,s denotes distant frames that are uniformly sampled
T T
from the videos X1 in a sampling rate of s. Since X1,s can usually cover most
key frames of the video, it is able to describe “the whole story” of the video.
Under this formulation, video inpainting models are required to not only preserve
temporal consistency in neighboring frames, but also make the completed frames
to be coherent with “the whole story” of the video.
Network design: The overview of the proposed Spatial-Temporal Transfo-
rmer Networks (STTN) is shown in Figure 2. As indicated in Eq. (1), STTN
t+n T
takes both neighboring frames Xt−n and distant frames X1,s as conditions, and
complete all the input frames simultaneously. Specifically, STTN consists of three
components, including a frame-level encoder, multi-layer multi-head spatial-
temporal transformers, and a frame-level decoder. The frame-level encoder is
built by stacking several 2D convolution layers with strides, which aims at encod-
ing deep features from low-level pixels for each frame. Similarly, the frame-level
decoder is designed to decode features back to frames. Spatial-temporal trans-
formers are the core component, which aims at learning joint spatial-temporal
transformations for all missing regions in the deep encoding space.
frame are mapped into query and memory (i.e., key-value pair) for further re-
trieval. In the Matching step, region affinities are calculated by matching queries
and keys among spatial patches that are extracted from all the frames. Finally,
relevant regions are detected and transformed for missing regions in each frame
in the Attending step. We introduce more details of each step as below.
Embedding: We use f1T = {f1 , f2 , ..., fT }, where fi ∈ Rh×w×c to denote the
features encoded from the frame-level encoder or former transformers, which is
the input of transformers in Fig. 2. Similar to many sequence modeling models,
mapping features into key and memory embeddings is an important step in
transformers [9,28]. Such a step enables modeling deep correspondences for each
region in different semantic spaces:
qi , (ki , vi ) = Mq (fi ), (Mk (fi ), Mv (fi )), (2)
where 1 ≤ i ≤ T , Mq (·), Mk (·) and Mv (·) denote the 1 × 1 2D convolutions that
embed input features into query and memory (i.e., key-value pair) feature spaces
while maintaining the spatial size of features.
Matching: We conduct patch-based matching in each head. In practice, we
first extract spatial patches of shape r1 × r2 × c from the query feature of each
frame, and we obtain N = T × h/r1 × w/r2 patches. Similar operations are con-
ducted to extract patches in the memory (i.e., key-value pair in the transformer).
Such an effective multi-scale patch-based video frame representation can avoid
redundant patch matching and enable fast training and inference. Specifically, we
reshape the query patches and key patches into 1-dimension vectors separately,
so that patch-wise similarities can be calculated by matrix multiplication. The
similarity between i-th patch and j-th patch is denoted as:
pqi · (pkj )T
si,j = √ , (3)
r1 × r2 × c
where 1 ≤ i, j ≤ N , pqi denotes the i-th query patch, pkj denotes the j-th key
patch. The similarity value is normalized by the dimension of each vector to avoid
a small gradient caused by subsequent softmax function [28]. Corresponding
attention weights for all patches are calculated by a softmax function:
PN
exp(si,j )/ exp(si,n ), pj ∈ Ω,
αi,j = n=1 (4)
0, pj ∈ Ω̄.
where Ω denotes visible regions outside masks, and Ω̄ denotes missing regions.
Naturally, we only borrow features from visible regions for filling holes.
Attending: After modeling the deep correspondences for all spatial patches,
the output for the query of each patch can be obtained by weighted summation
of values from relevant patches:
N
X
oi = αi,j pvj , (5)
j=1
Spatial-Temporal Transformer Networks for Video Inpainting 7
output frame t = 15 t = 20
where pvj denotes the j-th value patch. After receiving the output for all patches,
we piece all patches together and reshape them into T frames with original spatial
size h × w × c. The resultant features from different heads are concatenated and
further passed through a subsequent 2D residual block [12]. This subsequent
processing is used to enhance the attention results by looking at the context
within the frame itself.
The power of the proposed transformer can be fully exploited by stacking
multiple layers, so that attention results for missing regions can be improved
based on updated region features in a single feed-forward process. Such a multi-
layer design promotes learning coherent spatial-temporal transformations for
filling in missing regions. As shown in Fig. 3, we highlight the attention maps
learned by STTN in the last layer in bright yellow. For the dog partially occluded
by a random mask in a target frame, spatial-temporal transformers are able to
“track” the moving dog over the video in both spatial and temporal dimensions
and fill missing regions in the dog with coherent contents.
We empirically set the weights for different losses as: λhole = 1, Lvalid = 1,
Ladv = 0.01. Since our model simultaneously complete all the input frames in a
single feed-forward process, our model runs at 24.3 fps on a single GPU NVIDIA
V100. More details are provided in the Section D of our supplementary material.
4 Experiments
4.1 Dataset
To evaluate the proposed model and make fair comparisons with SOTA ap-
proaches, we adopt the two most commonly-used datasets in video inpainting,
including Youtube-VOS [32] and DAVIS [3]. In particular, YouTube-VOS con-
tains 4,453 videos with various scenes, including bedrooms, streets, and so on.
The average video length in Youtube-VOS is about 150 frames. We follow the
Spatial-Temporal Transformer Networks for Video Inpainting 9
Qualitative Evaluation: For each video from test sets, we take all frames
for testing. To compare visual results from different models, we follow the setting
used by most video inpainting works and randomly sample three frames from
the video for case study [18,25,29]. We select the most three competitive models,
DFVI [33], LGTSM [6] and CAP [18] for comparing results for stationary masks
in Fig. 4. We also show a case for filling in moving masks in Fig. 5. To conduct
pair-wise comparisons and analysis in Fig. 5, we select the most competitive
model, CAP [18], according to the quantitative comparison results. We can find
from the visual results that our model is able to generate perceptually pleasing
and coherent contents in results. More video cases are available online§ .
In addition to visual comparisons, we visualize the attention maps learned
by STTN in Fig. 6. Specifically, we highlight the top three relevant regions
captured by the last transformer in STTN in bright yellow. The relevant regions
§
video demo: https://github.com/researchmm/STTN
Spatial-Temporal Transformer Networks for Video Inpainting 11
Fig. 4. Visual results for stationary masks. The first column shows input frames
from DAVIS [3] (top-3) and YouTube-VOS [32] (bottom-3), followed by results from
DFVI [33], LGTSM [6], CAP [18], and our model. Comparing with the SOTAs, our
model generates more coherent structures and details of the legs and boats in results.
are selected according to the attention weights calculated by Eq. (4). We can find
in Fig. 6 that STTN is able to precisely attend to the objects for filling partially
occluded objects in the first and the third cases. For filling the backgrounds in
the second and the fourth cases, STTN can correctly attend to the backgrounds.
User Study: We conduct a user study for a more comprehensive comparison.
we choose LGTSM [6] and CAP [18] as two strong baselines, since we have
observed their significantly better performance than other baselines from both
quantitative and qualitative results. We randomly sampled 10 videos (5 from
DAVIS and 5 from YouTube-VOS) for stationary masks filling, and 10 videos
from DAVIS for moving masks filling. In practice, 28 volunteers are invited to
the user study. In each trial, inpainting results from different models are shown
to the volunteers, and the volunteers are required to rank the inpainting results.
To ensure a reliable subjective evaluation, videos can be replayed multiple times
by volunteers. Each participant is required to finish 20 groups of trials without
time limit. Most participants can finish the task within 30 minutes. The results
of the user study are concluded in Fig 7. We can find that our model performs
better in most cases for these two types of masks.
12 Y. Zeng, J. Fu, and H. Chao
Fig. 5. Visual comparisons for filling moving masks. Comparing with CAP [18],
one of the most competitive models for filling moving masks, our model is able to
generate visually pleasing results even under complex scenes (e.g., clear faces for the
first and the third frames, and better results than CAP for the second frame).
t = 25 t = 25 t = 12 t = 14 t = 30
t = 43 t = 43 t = 40 t = 50 t = 60
t = 40 t = 40 t=8 t = 16 t = 28
t = 30 t = 30 t = 34 t = 46 t = 52
input frame output frame attention map
60% 60%
40% 40%
20% 20%
0% 0%
rank 1 rank 2 rank 3 rank 1 rank 2 rank 3
Ours CAP LGTSM Ours CAP LGTSM
Fig. 7. User study. “Rank x” means the percentage of results from each model being
chosen as the x-th best. Our model is ranked in first place in most cases.
Table 2. Ablation study by using different patch scales in attention layers. Ours com-
bines the above four scales. ? Higher is better. † Lower is better.
Table 3. Ablation study by using different stacking number of the proposed spatial-
temporal transformers. ? Higher is better. † Lower is better.
our results input+mask
Fig. 8. A failure case. The bottom row shows our results with enlarged patches in the
bottom right corner. For reconstructing the dancing woman occluded by a large mask,
STTN fails to generate continuous motions and it generates blurs inside the mask.
5 Conclusions
In this paper, we propose a novel joint spatial-temporal transformation learn-
ing for video inpainting. Extensive experiments have shown the effectiveness
of multi-scale patch-based video frame representation in deep video inpainting
models. Coupled with a spatial-temporal adversarial loss, our model can be op-
timized to simultaneously complete all the input frames in an efficient way. The
results on YouTube-VOS [32] and DAVIS [3] with challenging free-form masks
show the state-of-the-art performance by our model.
We note that STTN may generate blurs in large missing masks if continuous
quick motions occur. As shown in Fig. 8, STTN fails to generate continuous
dancing motions and it generates blurs when reconstructing the dancing woman
in the first frame. We infer that STTN only calculates attention among spatial
patches, and the short-term temporal continuity of complex motions are hard to
capture without 3D representations. In the future, we plan to extend the pro-
posed transformer by using attention on 3D spatial-temporal patches to improve
the short-term coherence. We also plan to investigate other types of temporal
losses [17,30] for joint optimization in the future.
Acknowledgments
This project was supported by NSF of China under Grant 61672548, U1611461.
Spatial-Temporal Transformer Networks for Video Inpainting 15
References
1. Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.B.: Patchmatch: A ran-
domized correspondence algorithm for structural image editing. TOG 28(3), 24:1–
24:11 (2009)
2. Bertalmio, M., Bertozzi, A.L., Sapiro, G.: Navier-stokes, fluid dynamics, and image
and video inpainting. In: CVPR. pp. 355–362 (2001)
3. Caelles, S., Montes, A., Maninis, K.K., Chen, Y., Van Gool, L., Perazzi, F., Pont-
Tuset, J.: The 2018 davis challenge on video object segmentation. arXiv (2018)
4. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the
kinetics dataset. In: CVPR. pp. 6299–6308 (2017)
5. Chang, Y.L., Liu, Z.Y., Lee, K.Y., Hsu, W.: Free-form video inpainting with 3d
gated convolution and temporal patchgan. In: ICCV. pp. 9066–9075 (2019)
6. Chang, Y.L., Liu, Z.Y., Lee, K.Y., Hsu, W.: Learnable gated temporal shift module
for deep video inpainting. In: BMVC (2019)
7. Criminisi, A., Pérez, P., Toyama, K.: Region filling and object removal by exemplar-
based image inpainting. TIP 13(9), 1200–1212 (2004)
8. Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional
neural networks. In: CVPR. pp. 2414–2423 (2016)
9. Girdhar, R., Carreira, J., Doersch, C., Zisserman, A.: Video action transformer
network. In: CVPR. pp. 244–253 (2019)
10. Granados, M., Tompkin, J., Kim, K., Grau, O., Kautz, J., Theobalt, C.: How not to
be seenobject removal from videos of crowded scenes. Computer Graphics Forum
31(21), 219–228 (2012)
11. Hausman, D.M., Woodward, J.: Independence, invariance and the causal markov
condition. The British journal for the philosophy of science 50(4), 521–583 (1999)
12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: CVPR. pp. 770–778 (2016)
13. Huang, J.B., Kang, S.B., Ahuja, N., Kopf, J.: Temporally coherent completion of
dynamic video. TOG 35(6), 1–11 (2016)
14. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and
super-resolution. In: ECCV. pp. 694–711 (2016)
15. Kim, D., Woo, S., Lee, J.Y., Kweon, I.S.: Deep blind video decaptioning by tem-
poral aggregation and recurrence. In: CVPR. pp. 4263–4272 (2019)
16. Kim, D., Woo, S., Lee, J.Y., Kweon, I.S.: Deep video inpainting. In: CVPR. pp.
5792–5801 (2019)
17. Lai, W.S., Huang, J.B., Wang, O., Shechtman, E., Yumer, E., Yang, M.H.: Learning
blind video temporal consistency. In: ECCV. pp. 170–185 (2018)
18. Lee, S., Oh, S.W., Won, D., Kim, S.J.: Copy-and-paste networks for deep video
inpainting. In: ICCV. pp. 4413–4421 (2019)
19. Lin, J., Gan, C., Han, S.: Tsm: Temporal shift module for efficient video under-
standing. In: ICCV. pp. 7083–7093 (2019)
20. Liu, G., Reda, F.A., Shih, K.J., Wang, T.C., Tao, A., Catanzaro, B.: Image inpaint-
ing for irregular holes using partial convolutions. In: ECCV. pp. 85–100 (2018)
21. Ma, S., Fu, J., Wen Chen, C., Mei, T.: Da-gan: Instance-level image translation by
deep attention generative adversarial networks. In: CVPR. pp. 5657–5666 (2018)
22. Matsushita, Y., Ofek, E., Ge, W., Tang, X., Shum, H.Y.: Full-frame video stabi-
lization with motion inpainting. TPAMI 28(7), 1150–1163 (2006)
23. Nazeri, K., Ng, E., Joseph, T., Qureshi, F., Ebrahimi, M.: Edgeconnect: Generative
image inpainting with adversarial edge learning. In: ICCVW (2019)
16 Y. Zeng, J. Fu, and H. Chao
24. Newson, A., Almansa, A., Fradet, M., Gousseau, Y., Pérez, P.: Video inpainting
of complex scenes. SIAM Journal on Imaging Sciences 7(4), 1993–2019 (2014)
25. Oh, S.W., Lee, S., Lee, J.Y., Kim, S.J.: Onion-peel networks for deep video com-
pletion. In: ICCV. pp. 4403–4412 (2019)
26. Patwardhan, K.A., Sapiro, G., Bertalmio, M.: Video inpainting of occluding and
occluded objects. In: ICIP. pp. 11–69 (2005)
27. Patwardhan, K.A., Sapiro, G., Bertalmı́o, M.: Video inpainting under constrained
camera motion. TIP 16(2), 545–553 (2007)
28. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
L., Polosukhin, I.: Attention is all you need. In: NeurIPS. pp. 5998–6008 (2017)
29. Wang, C., Huang, H., Han, X., Wang, J.: Video inpainting by jointly learning
temporal structure and spatial details. In: AAAI. pp. 5232–5239 (2019)
30. Wang, T.C., Liu, M.Y., Zhu, J.Y., Liu, G., Tao, A., Kautz, J., Catanzaro, B.:
Video-to-video synthesis. In: NeuraIPS. pp. 1152–1164 (2018)
31. Wexler, Y., Shechtman, E., Irani, M.: Space-time completion of video. TPAMI
29(3), 463–476 (2007)
32. Xu, N., Yang, L., Fan, Y., Yue, D., Liang, Y., Yang, J., Huang, T.: Youtube-vos:
A large-scale video object segmentation benchmark. arXiv (2018)
33. Xu, R., Li, X., Zhou, B., Loy, C.C.: Deep flow-guided video inpainting. In: CVPR.
pp. 3723–3732 (2019)
34. Yang, F., Yang, H., Fu, J., Lu, H., Guo, B.: Learning texture transformer network
for image super-resolution. In: CVPR. pp. 5791–5800 (2020)
35. Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Free-form image inpainting
with gated convolution. In: ICCV. pp. 4471–4480 (2019)
36. Zeng, Y., Fu, J., Chao, H., Guo, B.: Learning pyramid-context encoder network
for high-quality image inpainting. In: CVPR. pp. 1486–1494 (2019)
37. Zhang, H., Mai, L., Xu, N., Wang, Z., Collomosse, J., Jin, H.: An internal learning
approach to video inpainting. In: CVPR. pp. 2720–2729 (2019)
38. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable
effectiveness of deep features as a perceptual metric. In: CVPR. pp. 586–595 (2018)
Spatial-Temporal Transformer Networks for Video Inpainting 17
Supplementary Material
To compare visual results from different inpainting models in our main paper,
we follow the setting used in most video inpainting works [13,16,33]. Specifically,
we sample several frames from video results and show them in Figure 4 and
Figure 5 in the main paper. However, sampled frames cannot truly reflect video
results. Sometimes sampled static frames look less blurry but artifacts can be
stronger in a dynamic video. Therefore, we provide 20 video cases for a more
comprehensive comparison¶ .
In practice, we test all the videos in the test sets of DAVIS dataset [3] (90
cases) and Youtube-VOS dataset [32] (508 cases), and we randomly show 20 cases
for visual comparisons. Specifically, five cases from DAVIS and five cases from
Youtube-VOS are used to test filling stationary masks. Since Youtube-VOS has
no dense object annotations, we sample 10 videos with dense object annotations
from DAVIS to test filling moving masks following the setting used in previous
works [16,18,33]. To conduct side-by-side comparisons and analysis, we select
the two most competitive video inpainting models, LGTSM [6] and CAP [18] in
the videos. LGTSM and CAP are fine-tuned multiple times to achieve optimal
video results by the codes and models publicly provided by their official Github
homepagek . We can find from the video results that our model outperforms the
state-of-the-art models in most cases.
Inspired by Xu et al. [33], we use stationary masks and moving masks as testing
masks to simulate real-world applications (e.g., watermark removal and object
removal) in the main paper. As introduced in Section 4.1 in the main paper, on
one hand, we use frame-wise foreground object annotations from DAVIS datasets
[3] as moving masks to simulate applications like object removal. On the other
hand, we generate random shapes as stationary masks to simulate applications
like watermark removal. Specifically, for the task of removing watermarks, a user
often draw a mask along the outline of a watermark. Inspired by previous mask
¶
video demo: https://github.com/researchmm/STTN
k
LGTSM: https://github.com/amjltc295/Free-Form-Video-Inpainting
CAP: https://github.com/shleecs/Copy-and-Paste-Networks-for-Deep-Video-Inpainting
18 Y. Zeng, J. Fu, and H. Chao
D Implementation details
Hyper-parameters: To maintain the aspect ratio of videos and take into ac-
count the memory limitations of modern GPUs, we resize all video frames into
432 × 240 for both training and testing [13,16,18,33]. During training, we set the
batch size as 8, and the learning rate starts with 1e-4 and decays with factor
0.1 every 150k iterations. Specifically, for each iteration, we sample five frames
from a video in a consecutive or discontinuous manner with equal probability
for training following Lee et al. [18,25].
Computation complexity: Our full model has a total of 12.6M train-
able parameters. It costs about 3.9G GPU memory for completing a video from
DAVIS dataset [3] by STTN on average. The proposed multi-scale patch-based
video frame representations can enable fast training and inference. Specifically,
our model runs at about 24.3fps with an NVIDIA V100 GPU and it runs at
about 10.43 fps with an NVIDIA P100 GPU on average. Its total training time
was about 3 days on YouTube-VOS dataset [32] and one day for fine-tuning on
DAVIS dataset [3] with 8 Tesla V100 GPUs. The computation complexity of the
proposed spatial-temporal transformers are denoted as:
D
X HW 2
O( 2 · (n · ) · (pw ph Cl ) + nkl2 HW Cl−1 Cl ) ≈ O(n2 ), (11)
pw ph
l=1
Table 6. Ablation study by utilizing distant frames in different sampling rates. Our
full model set s = 10. ? Higher is better. † Lower is better.
loss can generate more coherent results than the one optimized by a perceptual
loss and a style loss. The superior results show the effectiveness of the joint
spatial-temporal adversarial learning in STTN.
? †
Table 8. Ablation study for different losses. Higher is better. Lower is better.
Fig. 9. Visual comparisons between an STTN optimized by a perceptual loss [14] and
a style loss [8] and an STTN optimized by a T-PatchGAN loss [5]. These two models
perform similarly in small missing regions, while in large missing regions, the model
optimized by perceptual and style losses tends to generate artifacts in the missing
regions. [Best viewed with zoom-in]
Specifically, perceptual loss and style loss have shown great impacts in many
image generation tasks since they were proposed [8,14,20]. A perceptual loss
computes L1 distance between the activation maps of real frames and generated
frames. A style loss is similar to the perceptual loss but aims at minimizing the
L1 distance between Gram matrices of the activation maps of real frames and
generated frames. In practice, the activation maps are extracted from layers (e.g.,
Spatial-Temporal Transformer Networks for Video Inpainting 23
pool1, pool2 and pool3) of a pre-trained classification network (more details see
[18,20,25]). With the help of extracted low-level features, the perceptual loss and
the style loss are helpful in generating high-frequency details.
Unfortunately, perceptual and style losses are calculated on the features of
a single frame and they are unable to leverage temporal contexts. When filling
in a large missing region in videos, the perceptual and style losses are hard to
enforce the generator to synthesize rational contents due to limited contexts.
As a result, they have to generate meaningless high-frequency textures to match
ground truths’ low-level features. For example, for filling the large missing regions
in the second and the third frames in Fig. 9, the STTN optimized by perceptual
and style losses tends to generate high-frequency artifacts in the large missing
regions. Similar artifacts can be found in the failure cases of previous works [5,20].
Since the T-PatchGAN is able to leverage temporal contexts to optimize the
generator, there are fewer artifacts in the results by using the T-PatchGAN. For
the above considerations, we use the T-PatchGAN loss instead of the perceptual
and style losses in our final optimization objectives. In the future, we plan to
design video-based perceptual and style losses which are computed on spatial-
temporal features to leverage temporal contexts for optimization.