Efficient Video Deblurring Guided by Motion Magnitude
Efficient Video Deblurring Guided by Motion Magnitude
Efficient Video Deblurring Guided by Motion Magnitude
Motion Magnitude
1
The University of Tokyo
{wang,yamashita}@robot.t.u-tokyo.ac.jp
[email protected], [email protected]
2
AI Thrust, Information Hub, HKUST Guangzhou
3
Department of Computer Science and Engineering, HKUST
{yunfanlu,linwang}@ust.hk
4
Tokyo Research Center, Huawei
[email protected]
1 Introduction
Video deblurring is a classical yet challenging problem that aims to restore con-
secutive frames from the spatially and temporally varying blur. The problem
is highly ill-posed because of the presence of camera shakes, object motions,
and depth variations during the exposure interval. Recent methods using deep
convolutional networks (CNNs) have shown a significant improvement in video
2 Y. Wang et al.
Fig. 1. The blurry frame and the estimated MMP with results compared to the SoTA
method [19].
2 Related Works
2.1 Prior for Deblurring
As mentioned above, image segmentation, inter-frame optical flow and intra-
frame optical flow have been frequently used as priors in the deblurring tasks.
4 Y. Wang et al.
region segmentation, deblurring and blur magnification. The images are usually
separated into blur and non-sharp regions; however, it is difficult to separate
an image under a binarized way for dynamic scenes. Instead of classifying the
pixels into blurry and sharp regions, we use a continuous way to represent the
blur maps by estimating the motion magnitude of each pixel.
FL23 MMP
(estimated)
S3 M3
B M
16 16 32 64 128 256 128 64 32 16 16 3
MMP (GT)
Conv9x9 RDB
SN MN
Fig. 2. Motion magnitude prior. High frequency sharp frames are used to generate
synthetic blurry frames. Meanwhile, we estimate the bi-directional optical flows for
each frame and calculate the average magnitude of bi-directional optical flows. Then,
we take the average of the motion magnitude maps of all the latent frames. Finally, we
use a modified UNet-like structure to learn the motion magnitude prior by regression.
N
!
1 X
B=c Si , (1)
N i=1
where c(.) refers to the camera response function (CRF) [32]. During exposure
time, we sample N sharp images to generate a blurry image. To measure the
movement of pixels, we calculate the optical flow between the sharp frames. We
use bi-directional flows to represent the pixel movement of one sharp frame. For
instance, as shown in Fig. 2, for frame 1, we calculate the optical flow F L21
and F L23 to represent the pixel movement condition. For each pixel (m, n), the
optical flow between frame i and j in x and y direction can be represented as
ui,j (m, n) and vi,j (m, n), respectively. We denote the motion magnitude M for
frame i as follows.
q q
2
u2i,i−1 (m, n) + vi,i−1 2
(m, n) + u2i,i+1 (m, n) + vi,i+1 (m, n)
Mi (m, n) = . (2)
2
Efficient Video Deblurring Guided by Motion Magnitude 7
For frame 1 and frame N , we only use F L12 and F LN,N −1 for calculation,
respectively. Then, to acquire the pixel-wise motion magnitude for the synthetic
blurry frame, we calculate the average movement during exposure time.
N
1 X
M̄ = Mi , (3)
KN i=1
where M̄ refers to the motion magnitude or blur level map for the blurry frame.
K is used to normalize the value at each position in MMP to 0∼1. K is deter-
mined by the maximum value of the MMP before normalization in the training
dataset which was set to 15.
The Learning of Motion Magnitude Prior In this paper, we use the GO-
PRO raw dataset [17] to generate blurry image and MMP pairs. The GOPRO
benchmark dataset used 7∼13 successive sharp frames in raw dataset to gener-
ate one blurry frame. We also use 7∼13 to generate the blurry image and MMP
pairs. We use the SoTA optical flow method RAFT [34] to estimate the optical
flow. We propose a compact network to learn the blur level map by regression.
We apply a modified UNet [23], as shown in Fig. 2. At the beginning, we use a 9
× 9 convolution layer to enlarge the reception field. The features are downsam-
pled and reconstructed with a UNet-like structure. Then, a residual dense block
(RDB) is used to refine the result. To process a 720P (1280 × 720) frame, the
computational cost is only 38.81 GMACs with a model size of 0.85M parameters.
At last, we can estimate the MMP for each frame using the compact network.
In this paper, we apply the MMP as guidance to the video deblurring network.
We will describe it in the following section.
nc
Conv5x5
Bt MMAM Conv5x5 RDB Conv1x1 Conv3x3 Conv9x9 DeConv Concat
ft+1 ft+2
Conv1x1
Conv1x1
LReLU
Mt
3
Motion Magnitude
nc/2 nc Attentive Module
MMP-Net
na nc nb
Mt Reconstruction
ft-1 ft-2 module
lt-1 ht-1 Feature extraction
lt module ht
Ot
Fig. 3. The structure of MMP-RNN. Both center frame It and corresponding blur prior
Bt estimated by MMP-Net are passed into RNN cell. We transmit both non-deblurred
feature l and deblurred feature h from the previous frame to the next frame. Deblurred
features f from the current frame and the adjacent frames are fused globally in decoder
to generate final output image.
The non-deblurred features and deblurred features are then concatenated and
passed through RDB-Net-A with na RDBs. At last, deblurred features ht are
passed to the next FEM and ft are used for image reconstruction.
RM is designed to globally aggregate high-level features for image reconstruc-
tion. We concatenate the features {ft−2 , . . . , ft+2 } and squeeze them in channel
direction using 1 × 1 convolution operation. We use nb RDBs in RDB-Net-B to
implicitly align the features from different frames and then apply convolution
transpose operation to reconstruct the image. A global skip connection is added
to directly pass It to the output after 9 × 9 convolution operation. This can
better preserve the information of the center frame.
Motion Magnitude Attentive Module Both the blurry and sharp pixels
are important to our task. The blurry pixels are the region to be concentrated
for deblurring and the sharp pixels can be utilized for deblurring of adjacent
frames. Especially for sharp pixels, the value of MMP is close to 0, directly
multiplying MMP to image features may lose information of sharp features. To
better utilize MMP, we use MMAM to integrate it with image features. The
structure of MMAM layer is shown in Fig. 3. The MMP is passed through two 1
× 1 convolution layers to optimize MMP and adjust the dimension. The MMP
is transformed to tensor γ . The operation to integrate γ and the feature from
B, x can be represented as xout = γ ⊗ xin , where ⊗ refers to element-wise
multiplication. The whole operation can better integrate the MMP information
with the blurry image.
Efficient Video Deblurring Guided by Motion Magnitude 9
where ǫ is a small value which is set to 0.001. m and n are the width and height
of the image. I refers to the ground truth and O is the estimated image.
m−1 n−1
1 XX
Lgrad = (Gi,j − Ĝi,j )2 , (7)
mn i=0 j=0
where G and Ĝ refers to the image gradient of I and O, respectively. Then, the
total loss is written as follows.
L = Lchar + λ1 Lgrad + λ2 LMM , (8)
where the weight λ1 and λ2 are set to 0.5 and 1.0 in our experiments.
4 Experiment
4.1 MMP-Net
Dataset Generation To learn the deep image prior, we utilize GOPRO dataset
[17]. The raw GOPRO dataset consists of 33 high-frequency video sequences with
10 Y. Wang et al.
0.8
Fig. 4. Examples of motion magnitude estimation on GOPRO test dataset. Each col-
umn refers to the input blurry frames, the estimated results, and GT, respectively.
34,874 images in total. In GOPRO benchmark dataset, the sharp images are used
to generate synthetic dataset with 22 sequences for training and 11 sequences for
evaluation with 2,103 training samples and 1,111 evaluation samples. We use the
same data separation as GOPRO benchmark dataset to build the MMP dataset
to avoid information leakage during video deblurring. We use 7∼11 consecutive
sharp frames to generate one blurry frame and corresponding MMP. We generate
22,499 training samples (22 sequences) and use the original GOPRO test dataset.
We trim the training dataset by 50% during training.
Implementation details We train the model for 400 epochs with a mini-batch
of size 8 using ADAM optimizer [14] with initial learning rate 0.0003. The learn-
ing rate decades by half after 200 epochs. The patch size is set to 512 × 512
1 Pm−1 Pn−1
for training and validation. The loss L1 = mn i=0 j=0 |Mi,j − M̂i,j | is used
for MMP training and as a metric for test. Here, Mi,j refers to the value of
estimated MMP at position i, j and M̂ refers to the GT.
Results The training results are listed in Table 1. We visualize our results on
GOPRO test dataset in Fig. 4. To prove the generality of the proposed method,
we also test the GOPRO trained model to HIDE dataset [25]. HIDE dataset
contains blurry images of human. As shown in Fig. 5, our proposed method can
also detect the blur caused by human motion on other dataset.
Datasets We test the proposed video deblurring method on two public datasets,
GORPRO benchmark dataset [17] and beam-splitter datasets (BSD) [43]. BSD
Efficient Video Deblurring Guided by Motion Magnitude 11
Fig. 5. Examples of test results on HIDE dataset [25]. The model was trained on
GOPRO dataset and tested on HIDE dataset. Each column refer to the input blurry
images, estimated results, and the estimated results overlaid to the input images. It
indicates that the proposed method can successfully detect salient blurry region on
other datasets.
is a dataset of images from real world by controlling the length of exposure time
and strength of exposure intensity during video shooting using beam-splitter
system [11]. It can better evaluate deblurring performance in real scenarios. We
use the 2ms-16mes BSD that the exposure time for sharp and blurry frames
are 2 ms and 16 ms, respectively. The training and validation sets have 60 and
20 video sequences with 100 frames respectively, and the test set has 20 video
sequences with 150 frames. The size of the frames is 640 × 480.
Implementation details We train the model using ADAM optimizer with a
learning rate of 0.0005. We adopt cosine annealing schedule [16] to adjust learn-
ing rate during training. We sample 10-frame 256 × 256 RGB patch sequences
from the dataset to construct a mini-batch of size 8 with random vertical and
horizontal flips as well as 90◦ rotation for data augmentation for training. We
train 1,000 epochs for GOPRO and 500 epochs for BSD, respectively. It is worth
mentioning that we try to train the other methods using publicly available codes
by ourselves and keep the same hyper-parameters if possible. For CDVD-TSP,
we use the available test images for GOPRO and keep the training strategy used
in [19] for BSD.
Benchmark Results The results of our method with SoTA lightweight im-
age [33] and video deblurring methods on GOPRO is shown in Table 2. We use
A# B# C# F# to represent na , nb ,nc and the length of the input image sequence in
our model. We evaluate the image quality in terms of PSNR [10] and SSIM. We
also measure the computational cost for each model when processing one 720P
frame in terms of GMACs. Running time (s) of one 720P frame is also listed. For
IFIRNN, c2 refers to the dual cell model and h3 refers to ‘three times’ of hidden
state iteration. Our A7 B8 C16 F8 model outperforms the other methods with only
204.19 GMACs. The visual comparison of the results are shown in Fig. 6.
12 Y. Wang et al.
The results on BSD of the proposed method with other SoTA methods are
shown in Table 3. Here, we use the A8 B9 C18 F8 model. The visualization results
of the video deblurring on BSD are shown in Fig. 7. Our method outperforms
SoTA methods with less computational cost.
(a) (b)
(c) (d)
Fig. 6. Visualization results of GoPRO. (a) Blurry frame. (b) the estimated MMP. (c)
Overlaying (b) on (a). (d) Result from MMP-RNN. From (e) to (l) are, blurry input,
DBN, IFIRNN, ESTRNN, proposed method w/o MMP, MMP-RNN, and the sharp
frame, respectively.
MMP from MMP-Net can increase the PSNR by 0.31 dB. On the other hand,
using the flow magnitude of the center frame will decrease the PSNR by 0.07 dB.
14 Y. Wang et al.
Fig. 7. Visualization results of BSD. a) Blurry frame. (b) the estimated MMP. (c)
Result from MMP-RNN. From (d) to (k) are, blurry input, DBN, IFIRNN, ESTRNN,
proposed method w/o MMP, MMP-RNN, and the sharp frame, respectively.
The contour of the center frame magnitude only corresponds to the center frame,
the attentive field cannot cover the whole blurry object. Normalizing the MMP
by the maximum value of each MMP may lose temporal information, especially
for some sharp images, the whole MMP value may be close to 1, which may
influence the performance. We also compare the proposed method to spatial-
self-attention [36]. Our proposed method use a supervised approach to tell the
network where to concentrate and the results outperform vanilla spatial-self-
attention.
5 Conclusion
In this paper, we proposed a motion magnitude prior for deblurring tasks. We
built a dataset of blurry image and motion magnitude prior pairs and used a
compact network to learn by regression. We applied the prior to video deblurring
task, combining the prior with an efficient RNN. The prior is fused by motion
magnitude attentive module and motion magnitude loss. We also transmitted
the features before deblurring with features after deblurring between RNN cells
to improve efficiency. We tested the proposed method on GOPRO and BSD,
which achieved better performance on video deblurring tasks compared to SoTA
methods on image quality and computational cost.
Acknowledgements This paper is supported by JSPS KAKENHI Grant Num-
bers 22H00529 and 20H05951.
Efficient Video Deblurring Guided by Motion Magnitude 15
References
1. Argaw, D.M., Kim, J., Rameau, F., Cho, J.W., Kweon, I.S.: Optical flow estimation
from a single motion-blurred image. In: AAAI (2021) 2, 4
2. Bar, L., Berkels, B., Rumpf, M., Sapiro, G.: A variational framework for simultane-
ous motion estimation and restoration of motion-blurred video. In: ICCV. pp. 1–8
(2007) 2, 4
3. Charbonnier, P., Blanc-Feraud, L., Aubert, G., Barlaud, M.: Two deterministic
half-quadratic regularization algorithms for computed imaging. In: ICIP. pp. 168–
172 (1994) 9
4. Cho, S., Matsushita, Y., Lee, S.: Removing non-uniform motion blur from images.
In: ICCV. pp. 1–8 (2007) 2, 4
5. Couzinié-Devy, F., Sun, J., Alahari, K., Ponce, J.: Learning to estimate and remove
non-uniform image blur. In: CVPR. pp. 1075–1082 (2013) 2, 4
6. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convo-
lutional networks. In: ICCV. pp. 764–773 (2017) 5
7. Gao, H., Tao, X., Shen, X., Jia, J.: Dynamic scene deblurring with parameter
selective sharing and nested skip connections. In: CVPR. pp. 3848–3856 (2019) 5
8. Gong, D., Yang, J., Liu, L., Zhang, Y., Reid, I., Shen, C., Hengel, A.V.D., Shi, Q.:
From motion blur to motion flow: A deep learning solution for removing heteroge-
neous motion blur. In: CVPR. pp. 3806–3815 (2017) 2, 4
9. Hirsch, M., Schuler, C.J., Harmeling, S., Schölkopf, B.: Fast removal of non-uniform
camera shake. In: ICCV. pp. 463–470 (2011) 6
10. Horé, A., Ziou, D.: Image quality metrics: Psnr vs. ssim. In: ICPR. pp. 2366–2369
(2010) 11
11. Jiang, H., Zheng, Y.: Learning to see moving objects in the dark. In: ICCV. pp.
7324–7333 (2019) 11
12. Kim, T.H., Lee, K.M.: Generalized video deblurring for dynamic scenes. In: CVPR.
pp. 5426–5434 (2015) 2
13. Kim, T.H., Lee, K.M., Scholkopf, B., Hirsch, M.: Online video deblurring via dy-
namic temporal blending network. In: ICCV. pp. 4038–4047 (2017) 2, 3, 5
14. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR
(2015) 10
15. Li, D., Xu, C., Zhang, K., Yu, X., Zhong, Y., Ren, W., Suominen, H., Li, H.: Arvo:
Learning all-range volumetric correspondence for video deblurring. In: CVPR. pp.
7721–7731 (June 2021) 2, 4, 5
16. Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts.
In: ICLR (2017) 11
17. Nah, S., Kim, T.H., Lee, K.M.: Deep multi-scale convolutional neural network for
dynamic scene deblurring. In: CVPR. pp. 3883–3891 (2017) 2, 5, 6, 7, 9, 10
18. Nah, S., Son, S., Lee, K.M.: Recurrent neural networks with intra-frame iterations
for video deblurring. In: CVPR. pp. 8102–8111 (2019) 2, 3, 5, 12
19. Pan, J., Bai, H., Tang, J.: Cascaded deep video deblurring using temporal sharpness
prior. In: CVPR. pp. 3043–3051 (2020) 2, 3, 4, 5, 11, 12
20. Pan, J., Sun, D., Pfister, H., Yang, M.: Blind image deblurring using dark channel
prior. In: CVPR. pp. 1628–1636 (2016) 4
21. Portz, T., Zhang, L., Jiang, H.: Optical flow in the presence of spatially-varying
motion blur. In: CVPR. pp. 1752–1759 (2012) 2
22. Ren, W., Pan, J., Cao, X., Yang, M.H.: Video deblurring via semantic segmentation
and pixel-wise non-linear kernel. In: ICCV. pp. 1077–1085 (2017) 2, 4
16 Y. Wang et al.
23. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed-
ical image segmentation. In: MICCAI. pp. 234–241 (2015) 7
24. Shan, Q., Jia, J., Agarwala, A.: High-quality motion deblurring from a single image.
ACM Trans. Graph. 27(3), 1–10 (2008) 4
25. Shen, Z., Wang, W., Shen, J., Ling, H., Xu, T., Shao, L.: Human-aware motion
deblurring. In: ICCV (2019) 2, 4, 10, 11
26. Shi, J., Xu, L., Jia, J.: Discriminative blur detection features. In: CVPR. pp. 2965–
2972 (2014) 2, 4
27. Son, H., Lee, J., Lee, J., Cho, S., Lee, S.: Recurrent video deblurring with blur-
invariant motion estimation and pixel volumes. ACM Trans. Graph. (2021) 5
28. Su, S., Delbracio, M., Wang, J., Sapiro, G., Heidrich, W., Wang, O.: Deep video
deblurring for hand-held cameras. In: CVPR. pp. 237–246 (2017) 2, 4, 5, 6, 12
29. Suin, M., Purohit, K., Rajagopalan, A.N.: Spatially-attentive patch-hierarchical
network for adaptive motion deblurring. In: CVPR. pp. 3606–3615 (2020) 5
30. Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow using
pyramid, warping, and cost volume. In: CVPR (2018) 2
31. Sun, J., Cao, W., Xu, Z., Ponce, J.: Learning a convolutional neural network for
non-uniform motion blur removal. In: CVPR. pp. 769–777 (2015) 4
32. Tai, Y.W., Chen, X., Kim, S., Kim, S.J., Li, F., Yang, J., Yu, J., Matsushita, Y.,
Brown, M.S.: Nonlinear camera response functions and image deblurring: Theo-
retical analysis and practice. IEEE Transactions on Pattern Analysis and Machine
Intelligence 35(10), 2498–2512 (2013) 6
33. Tao, X., Gao, H., Shen, X., Wang, J., Jia, J.: Scale-recurrent network for deep
image deblurring. In: CVPR. pp. 8174–8182 (2018) 5, 11, 12
34. Teed, Z., Deng, J.: Raft: Recurrent all pairs field transforms for optical flow. In:
ECCV. pp. 402–419 (2020) 2, 7
35. Wang, X., Chan, K.C., Yu, K., Dong, C., Loy, C.C.: Edvr: video restoration with
enhanced deformable convolutional networks. In: CVPRW (2019) 2, 5, 6
36. Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: Convolutional block attention
module. In: ECCV (2018) 13, 14
37. Wulff, J., Black, M.J.: Modeling blurred video with layers. In: ECCV. pp. 236–252
(2014) 2, 4
38. Xu, L., Zheng, S., Jia, J.: Unnatural l0 sparse representation for natural image
deblurring. In: CVPR. pp. 1107–1114 (2013) 4
39. Yan, Y., Ren, W., Guo, Y., Wang, R., Cao, X.: Image deblurring via extreme
channels prior. In: CVPR. pp. 6978–6986 (2017) 4
40. Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H., Shao, L.:
Multi-stage progressive image restoration. In: CVPR (2021) 5
41. Zhang, Y., Tian, Y., Kong, Y., Zhong, B., Fu, Y.: Residual dense network for image
super-resolution. In: CVPR. pp. 2472–2481 (2018) 3, 7
42. Zhang, Y., Tian, Y., Kong, Y., Zhong, B., Fu, Y.: Residual dense network for im-
age restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence
(2020) 3, 7, 12
43. Zhong, Z., Gao, Y., Zheng, Y., Zheng, B.: Efficient spatio-temporal recurrent neural
network for video deblurring. In: ECCV. pp. 191–207 (2020) 2, 3, 5, 7, 10, 12
44. Zhou, S., Zhang, J.and Pan, J., Xie, H., Zuo, W., Ren, J.: Spatio-temporal filter
adaptive network for video deblurring. In: ICCV. pp. 2482–2491 (2019) 2, 5