Efficient Video Deblurring Guided by Motion Magnitude

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Efficient Video Deblurring Guided by

Motion Magnitude

Yusheng Wang1 , Yunfan Lu2 , Ye Gao4 , Lin Wang2,3 , Zhihang Zhong1 ,


Yinqiang Zheng1 , and Atsushi Yamashita1
arXiv:2207.13374v1 [cs.CV] 27 Jul 2022

1
The University of Tokyo
{wang,yamashita}@robot.t.u-tokyo.ac.jp
[email protected], [email protected]
2
AI Thrust, Information Hub, HKUST Guangzhou
3
Department of Computer Science and Engineering, HKUST
{yunfanlu,linwang}@ust.hk
4
Tokyo Research Center, Huawei
[email protected]

Abstract. Video deblurring is a highly under-constrained problem due


to the spatially and temporally varying blur. An intuitive approach for
video deblurring includes two steps: a) detecting the blurry region in the
current frame; b) utilizing the information from clear regions in adjacent
frames for current frame deblurring. To realize this process, our idea is to
detect the pixel-wise blur level of each frame and combine it with video
deblurring. To this end, we propose a novel framework that utilizes the
motion magnitude prior (MMP) as guidance for efficient deep video de-
blurring. Specifically, as the pixel movement along its trajectory during
the exposure time is positively correlated to the level of motion blur, we
first use the average magnitude of optical flow from the high-frequency
sharp frames to generate the synthetic blurry frames and their corre-
sponding pixel-wise motion magnitude maps. We then build a dataset
including the blurry frame and MMP pairs. The MMP is then learned
by a compact CNN by regression. The MMP consists of both spatial
and temporal blur level information, which can be further integrated
into an efficient recurrent neural network (RNN) for video deblurring.
We conduct intensive experiments to validate the effectiveness of the
proposed methods on the public datasets. Our codes are available at
https://github.com/sollynoay/MMP-RNN.

Keywords: Blur estimation, motion magnitude, video deblurring

1 Introduction
Video deblurring is a classical yet challenging problem that aims to restore con-
secutive frames from the spatially and temporally varying blur. The problem
is highly ill-posed because of the presence of camera shakes, object motions,
and depth variations during the exposure interval. Recent methods using deep
convolutional networks (CNNs) have shown a significant improvement in video
2 Y. Wang et al.

deblurring performance. Among them, alignment-based methods usually align


the adjacent frames explicitly to the current frame and reconstruct a sharp
frame simultaneously or by multiple stages [28,35,44,19,15]. Such methods show
stable performance for video deblurring; however, the alignment or warping
processes require accurate flow estimation, which is difficult for blurry frames,
and usually the computational cost is relatively high. Recurrent neural network
(RNN)-based methods achieve deblurring by passing information between adja-
cent frames [13,18,43]; such methods usually have lower computation costs but
with lower restoration quality compared with the alignment-based methods.
So far, many deblurring methods have utilized priors to improve the restora-
tion quality. The priors usually come from optical flows or image segmentation.
Inter-frame optical flow-based methods [19,15] directly borrow pixel information
from the adjacent frames to improve the quality of the current frame. By con-
trast, intra-frame optical flow estimates the pixel movement during the exposure
time of a blurry frame. Intra-frame optical flow-based methods [8,1] estimate the
flow during the exposure time, and usually restore the sharp frame by energy
minimization optimization. Image segmentation-based methods [4,2,37,22,25]
utilize motion, semantic or background and foreground information to estimate
the prior for deblurring.
However, previous priors for video deblurring suffer from one or more of
the following problems. First, image segmentation and inter-frame optical flow
estimation on blurry images are error-prone. And it is necessary to estimate
the prior and implement video deblurring simultaneously via complex energy
functions [21,12,22]. Second, inter-frame optical flow estimation requires heavy
computational cost. For instance, the state-of-the-art (SoTA) methods, PWC-
Net requires 181.68 GAMCs [30] and RAFT small model (20 iterations) requires
727.99 GMACs [34] on the 720P (1280 x 720) frames. However, a typical efficient
video deblurring method is supposed to have 50∼300 GMACs. Third, motion
blur is directly correlated to the intra-frame optical flow. Although it is possible
to directly estimate the intra-frame optical flow from a single image [8,1], the
restored results suffer from artifacts.
An intuitive way for video deblurring is to detect the blurry region in the
image first and then utilize information of clear pixels from adjacent frames. The
detection of blurry region task can be considered as a blur estimation problem,
which separates the image into binarized sharp and blurry regions, or directly
tells the blur level of each pixel. Previous blur estimation methods, e.g., [5,26],
usually binarize the image into sharp and blurry regions and manually label the
regions as it is difficult to determine the blur level of each pixel automatically.
However, the labelling process may be inaccurate and usually requires much
human effort.
In this work, we propose a motion magnitude prior (MMP) to represent
the blur level of a pixel which can determine the pixel-wise blur level without
manually labelling. Recent works use high frequency sharp frames to generate
a synthetic blurry frame [17]. Inspired by this, the pixel movement during the
exposure time of a blurry frame, i.e., the MMP, can be estimated from the
Efficient Video Deblurring Guided by Motion Magnitude 3

Blurry frame MMP CDVD-TSP Ours GT

Fig. 1. The blurry frame and the estimated MMP with results compared to the SoTA
method [19].

average magnitude of bi-directional optical flow from the high-frequency sharp


frames. The value of MMP positively correlated to the level of motion blur. We
propose a compact CNN to learn the MMP.
The proposed MMP can directly indicate the spatial distribution of blur level
in one image. Besides, if the overall value of MMP is low, the image is sharp,
and vice versa. Temporal information is also included in the prior. The CNN can
be further merged into a spatio-temporal network (RNN) for video deblurring.
For convenience, we use an efficient RNN with residual dense blocks (RDBs)
[41,42] as our backbone video deblurring network (Sec. 3.2). For the utilization
of MMP, we design three components: a) for the intra-frame utilization, we
propose a motion magnitude attentive module (MMAM) to inject the MMP
into the network; b) for the inter-frame utilization, different from other RNN-
based methods [13,18,43] which only pass the deblurred features to the next
frame, we also pass the features before deblurring to the next frame. The blurry
frame with pixels of different blur level is weighted by the MMAM, as such,
pixels under low blur level can be directly utilized by the next frame; c) for
loss-level utilization, since the motion magnitude of network output can reflect
the deblur performance. That is, the sharper the image is, the lower the average
score is. Therefore, we can also use the prior as a loss term to further improve the
optimization. Figure 1 is an example of the learned MMP from the network and
the estimation result. High quality results can be generated from the proposed
framework.
In summary, our contributions are three folds. (I) We propose a novel motion
magnitude prior for blurry image and its lightweight generation method. To
the best of our knowledge, this is the first work to apply motion magnitude
prior for the video deblurring task. (II) We propose a motion magnitude prior
guided network, which utilizes the prior at intra-frame level, inter-frame level
and loss level. (III) Our proposed method achieves competitive results with SoTA
methods while with relatively lower computational cost.

2 Related Works
2.1 Prior for Deblurring
As mentioned above, image segmentation, inter-frame optical flow and intra-
frame optical flow have been frequently used as priors in the deblurring tasks.
4 Y. Wang et al.

Image Segmentation In early studies, segmentation priors have been proposed


for dynamic scene deblur. Cho et al . [4] segmented the images into multiple re-
gions of homogeneous motions while Bar et al . [2] segmented the images into
foreground and background layers. Based on [2] , Wulff et al . [37] focused on
estimating the parameters for both foreground and background motions. Us-
ing segmentation priors can handle the dynamic scenes; however, with simple
models, it is difficult to segment the non-parametrically varying complex mo-
tions. Ren et al . [22] exploited semantic segmentation which significantly im-
proves the optical flow estimation for the deblurring tasks. Shen et al . [25] pro-
posed a human-aware deblurring method by combining human segmentation
and learning-based deblurring. Although these methods on the basis of physical
models show promising results, the deblurring performance is highly related to
the blur kernel estimation results. That is, inaccurate estimation of blur kernels
results in severe artifacts.
Inter-frame optical flow The inter-frame optical flows are usually directly
estimated on blurry frames, and are used to warp the adjacent frames to the
center frame before inputting to the network [28,19,15]. The flow from blurry
frame is inaccurate. To solve the problem, Pan et al . [19] proposed a method
to estimate optical flow from the intermediate latent frames using a deep CNN
and restore the latent frames based on the optical flow estimations. A temporal
sharpness prior is applied to constrain the CNN model to help the latent frame
restoration. However, calculating inter-frame optical flow requires heavy com-
putation and the temporal sharpness prior cannot deal with all frames with no
sharp pixels.
Intra-frame optical flow With deep learning, it is even possible to directly es-
timate intra-frame optical flows from a blurry frame [31,8,1], followed by energy
function optimizations to estimate the blur kernels. However, the optimizations
are usually difficult to solve and require a huge computational cost. The re-
stored images also suffer from artifacts. Moreover, the definition of intra-frame
is ambiguous as, during the exposure time, the movement of one pixel may be
non-linear, which cannot be represented by a 2D vector.
Others Statistical priors and extreme channel priors have shown their feasibility
in single image deblurring [24,38,20,39]. Such methods are valid under certain
circumstances, but are sensitive to noise and still require accurate estimation of
the blur kernels. Differently, we propose a motion magnitude prior learned by a
compact network. The prior shows pixel-wise blur level of the blurry image and
can be easily applied to the video deblurring framework. It can detect which
part of the image is blurry and to what extent it is blurred.

2.2 Blur Estimation


Blur estimation has been studied for the non-uniform blur. It is similar to image
segmentation prior by specifically segmenting the image into blur and non-blur
regions. In [5], horizontal motion or defocus blurs were estimated and the image is
deconvolved for deblurring. Shi et al . [26] studied the effective local blur features
to differentiate between blur and sharp regions which can be applied to the blur
Efficient Video Deblurring Guided by Motion Magnitude 5

region segmentation, deblurring and blur magnification. The images are usually
separated into blur and non-sharp regions; however, it is difficult to separate
an image under a binarized way for dynamic scenes. Instead of classifying the
pixels into blurry and sharp regions, we use a continuous way to represent the
blur maps by estimating the motion magnitude of each pixel.

2.3 DNN-based Deblurring


For single image deblurring, SoTA methods apply self-recurrent module on the
multi-scale, multi-patch, or multi-stage to solve the problem [17,33,7,29,40]. De-
spite the high performance, they usually require a large computational cost. For
video deblurring, the problem is less ill-posed and can be solved with less compu-
tational cost. The learning-based video deblurring methods can be grouped into
alignment-based methods [28,35,44,19,15] and RNN-based methods [13,18,43].
The former usually explicitly aligns the adjacent frames to the center frame for
deblurring. This can be achieved by directly warping the adjacent frames to the
center frame [28], warping the intermediate latent frames, or both [19,15]. Wang
et al . [35] implemented the deformable CNN [6] to realize the alignment process.
Zhou et al . [44] proposed a spatio-temporal filter adaptive network (STFAN) for
alignment and deblurring in a single network. Son et al. blur-invariant motion
estimation learning to avoid the inaccuracy caused by optical flow estimation on
blurry frames [27]. Although the alignment-based methods can achieve relatively
high performance, the alignment process is usually computationally inefficient.
In addition, the alignment has to be accurate, otherwise it may degrade the
performance of deblurring. To ensure high quality results, multi-stage strate-
gies were applied by iteratively implementing the alignment-deblurring process,
which may lead to huge computational cost [19,15].
On the other hand, RNN-based methods transfer information between the
RNN cells and usually have lower computational cost. Kim et al . [13] propose
an RNN for video deblurring by dynamically blending features from previous
frames. Nah et al . [17] iteratively updated the hidden state with one RNN cell
before the final output. Zhong et al . [43] proposed an RNN network using
RDB backbone with globally fusion of high-level features from adjacent frames
and applied attention mechanism to efficiently utilize inter-frame information.
In this paper, we propose an RNN with RDBs by utilizing the MMP. We use a
compact CNN to estimate the MMP to reduce the computational cost. The MMP
consists of spatial and temporal information, which can improve the effectiveness
of information delivery between RNN cells. Our proposed method outperforms
the SoTA performance.

3 The Proposed Approach


3.1 Motion Magnitude Prior
Motion Magnitude Prior from Optical Flow In this section, we introduce
preparing the ground truth of motion magnitude prior (MMP). For the learning-
based deblurring methods, the datasets for training and validation are usually
6 Y. Wang et al.

High frequency Bi-directional flow


Optical flow
sharp frames magnitude
Loss
Exposure time
S1 M1
FL21 MMP-Net
Average flow
Blurry frame magnitude
S2 M2

FL23 MMP
(estimated)
S3 M3
B M
16 16 32 64 128 256 128 64 32 16 16 3
MMP (GT)

Conv9x9 RDB

Conv3x3 Skip connection

SN MN

Fig. 2. Motion magnitude prior. High frequency sharp frames are used to generate
synthetic blurry frames. Meanwhile, we estimate the bi-directional optical flows for
each frame and calculate the average magnitude of bi-directional optical flows. Then,
we take the average of the motion magnitude maps of all the latent frames. Finally, we
use a modified UNet-like structure to learn the motion magnitude prior by regression.

synthesized by high-frequency sharp frames [28,17,35]. It is based on the fact


that images tend to be blurry after accumulating multiple sharp images with
slight misalignment caused by motion [9]. The blur level of each pixel in the
blurry frame is positively correlated to the motion magnitude of the pixel during
exposure time. Although directly measuring the motion magnitude of one pixel
is difficult, it is possible to be calculated by latent sharp frames during exposure
time. Inspired by this, we generate blurry frame and blur level map pairs based on
high-frequency sharp image sequence, as shown in Fig. 2. Denoting the blurry
image as B and sampled sharp image as {S1 , S2 , . . . , SN }, the blurry image
simulation process can be represented as follows.

N
!
1 X
B=c Si , (1)
N i=1

where c(.) refers to the camera response function (CRF) [32]. During exposure
time, we sample N sharp images to generate a blurry image. To measure the
movement of pixels, we calculate the optical flow between the sharp frames. We
use bi-directional flows to represent the pixel movement of one sharp frame. For
instance, as shown in Fig. 2, for frame 1, we calculate the optical flow F L21
and F L23 to represent the pixel movement condition. For each pixel (m, n), the
optical flow between frame i and j in x and y direction can be represented as
ui,j (m, n) and vi,j (m, n), respectively. We denote the motion magnitude M for
frame i as follows.

q q
2
u2i,i−1 (m, n) + vi,i−1 2
(m, n) + u2i,i+1 (m, n) + vi,i+1 (m, n)
Mi (m, n) = . (2)
2
Efficient Video Deblurring Guided by Motion Magnitude 7

For frame 1 and frame N , we only use F L12 and F LN,N −1 for calculation,
respectively. Then, to acquire the pixel-wise motion magnitude for the synthetic
blurry frame, we calculate the average movement during exposure time.
N
1 X
M̄ = Mi , (3)
KN i=1

where M̄ refers to the motion magnitude or blur level map for the blurry frame.
K is used to normalize the value at each position in MMP to 0∼1. K is deter-
mined by the maximum value of the MMP before normalization in the training
dataset which was set to 15.
The Learning of Motion Magnitude Prior In this paper, we use the GO-
PRO raw dataset [17] to generate blurry image and MMP pairs. The GOPRO
benchmark dataset used 7∼13 successive sharp frames in raw dataset to gener-
ate one blurry frame. We also use 7∼13 to generate the blurry image and MMP
pairs. We use the SoTA optical flow method RAFT [34] to estimate the optical
flow. We propose a compact network to learn the blur level map by regression.
We apply a modified UNet [23], as shown in Fig. 2. At the beginning, we use a 9
× 9 convolution layer to enlarge the reception field. The features are downsam-
pled and reconstructed with a UNet-like structure. Then, a residual dense block
(RDB) is used to refine the result. To process a 720P (1280 × 720) frame, the
computational cost is only 38.81 GMACs with a model size of 0.85M parameters.
At last, we can estimate the MMP for each frame using the compact network.
In this paper, we apply the MMP as guidance to the video deblurring network.
We will describe it in the following section.

3.2 Motion Magnitude Prior-Guided Deblurring Network


In this section, we describe using MMP as guidance to video deblurring network.
Network Structure The structure of MMP-RNN is shown in Fig. 3. nc refers
to the channel dimension. The center frame Bt is first inputted into MMP-Net to
estimate the MMP Mt . Then, Bt and Mt are passed to RNN cell to extract fea-
tures. Non-deblurred features lt−1 and deblurred features ht−1 from the previous
frame are delivered to feature extraction module (FEM). In this paper, we also
use the deblurred features from adjacent frames globally to reconstruct output
images. Features ft−2 , ft−1 from the past frames and ft+1 , ft+2 from the future
frames with the current frame features ft are inputted to reconstruction module
(RM) for image reconstruction. We denote the output result as Ot . RDBs have
high performances in low-level tasks which efficiently preserve features and save
the computational cost [41,42,43]. In this work, we use RDBs as the backbone
for downsampling, feature extraction and implicit alignment.
The FEM receives center frame Bt and corresponding MMP Mt with features
lt−1 and ht−1 from the past frames to extract features of current frame. The
structure of FEM is shown in Fig. 3. The Motion Magnitude Attentive Module
(MMAM) is used to fuse the information of Bt and Mt . Here, we pass the non-
deblurred features lt to the next FEM and receive lt−1 from the previous FEM.
8 Y. Wang et al.

nc

Conv5x5
Bt MMAM Conv5x5 RDB Conv1x1 Conv3x3 Conv9x9 DeConv Concat

ft+1 ft+2

Conv1x1

Conv1x1
LReLU
Mt
3
Motion Magnitude
nc/2 nc Attentive Module

nc Downsample RDB-Net-A RDB-Net-B


nc 3
Bt
nc 2nc
2nc 5nc 5nc 5nc 5nc 5nc
ft

MMP-Net

na nc nb
Mt Reconstruction
ft-1 ft-2 module
lt-1 ht-1 Feature extraction
lt module ht
Ot

Fig. 3. The structure of MMP-RNN. Both center frame It and corresponding blur prior
Bt estimated by MMP-Net are passed into RNN cell. We transmit both non-deblurred
feature l and deblurred feature h from the previous frame to the next frame. Deblurred
features f from the current frame and the adjacent frames are fused globally in decoder
to generate final output image.

The non-deblurred features and deblurred features are then concatenated and
passed through RDB-Net-A with na RDBs. At last, deblurred features ht are
passed to the next FEM and ft are used for image reconstruction.
RM is designed to globally aggregate high-level features for image reconstruc-
tion. We concatenate the features {ft−2 , . . . , ft+2 } and squeeze them in channel
direction using 1 × 1 convolution operation. We use nb RDBs in RDB-Net-B to
implicitly align the features from different frames and then apply convolution
transpose operation to reconstruct the image. A global skip connection is added
to directly pass It to the output after 9 × 9 convolution operation. This can
better preserve the information of the center frame.
Motion Magnitude Attentive Module Both the blurry and sharp pixels
are important to our task. The blurry pixels are the region to be concentrated
for deblurring and the sharp pixels can be utilized for deblurring of adjacent
frames. Especially for sharp pixels, the value of MMP is close to 0, directly
multiplying MMP to image features may lose information of sharp features. To
better utilize MMP, we use MMAM to integrate it with image features. The
structure of MMAM layer is shown in Fig. 3. The MMP is passed through two 1
× 1 convolution layers to optimize MMP and adjust the dimension. The MMP
is transformed to tensor γ . The operation to integrate γ and the feature from
B, x can be represented as xout = γ ⊗ xin , where ⊗ refers to element-wise
multiplication. The whole operation can better integrate the MMP information
with the blurry image.
Efficient Video Deblurring Guided by Motion Magnitude 9

Feature Transmission We transmit two kinds of features l and h between FEM


of the adjacent frames. The non-deblurred feature lt consists of the information
only from the center frame before integration. By contrast, ht possesses features
from the previous frame. The features are integrated in the network as follows.
at = CAT (lt , lt−1 , ht−1 ), (4)
where CAT refers to concatenation operation. The RNN-based methods implic-
itly fusing information from previous frames which may sometimes cause lower
performance compared to alignment-based methods. Passing non-deblurred fea-
tures only consisting of the current frame with its blur level information can
improve the overall performance of the network.
Motion Magnitude Loss The proposed MMP-Net can estimate pixel-wise
motion magnitude of the image. If Ot is an ideal sharp image, inputting Ot
into MMP-Net should get a prior with all zeros. In this work, we propose a loss
function based on the idea as follows.
m−1 n−1
1 XX
LMM = Mi,j (Ot ), (5)
mn i=0 j=0

where M (Ot ) refers to the MMP of the output image Ot . Theoretically, by


minimizing the average motion magnitude of Ot , it can generate ideal sharp
image.
We also consider two kinds of content loss functions for training. We use a
modified Charbonnier loss (Lchar ) [3] and gradient loss (Lgrad ) as content loss.
v
m−1 n−1 ur,g,b
1 X X uX
Lchar = t (Ii,j,ch − Oi,j,ch ) + ǫ2 , (6)
mn i=0 j=0
ch

where ǫ is a small value which is set to 0.001. m and n are the width and height
of the image. I refers to the ground truth and O is the estimated image.
m−1 n−1
1 XX
Lgrad = (Gi,j − Ĝi,j )2 , (7)
mn i=0 j=0

where G and Ĝ refers to the image gradient of I and O, respectively. Then, the
total loss is written as follows.
L = Lchar + λ1 Lgrad + λ2 LMM , (8)
where the weight λ1 and λ2 are set to 0.5 and 1.0 in our experiments.

4 Experiment
4.1 MMP-Net
Dataset Generation To learn the deep image prior, we utilize GOPRO dataset
[17]. The raw GOPRO dataset consists of 33 high-frequency video sequences with
10 Y. Wang et al.

0.8

Fig. 4. Examples of motion magnitude estimation on GOPRO test dataset. Each col-
umn refers to the input blurry frames, the estimated results, and GT, respectively.

Table 1. The training result of MMP-Net.

Train Test (512 patch) Test (Full image)


L1 0.0137 0.0169 0.0192

34,874 images in total. In GOPRO benchmark dataset, the sharp images are used
to generate synthetic dataset with 22 sequences for training and 11 sequences for
evaluation with 2,103 training samples and 1,111 evaluation samples. We use the
same data separation as GOPRO benchmark dataset to build the MMP dataset
to avoid information leakage during video deblurring. We use 7∼11 consecutive
sharp frames to generate one blurry frame and corresponding MMP. We generate
22,499 training samples (22 sequences) and use the original GOPRO test dataset.
We trim the training dataset by 50% during training.
Implementation details We train the model for 400 epochs with a mini-batch
of size 8 using ADAM optimizer [14] with initial learning rate 0.0003. The learn-
ing rate decades by half after 200 epochs. The patch size is set to 512 × 512
1 Pm−1 Pn−1
for training and validation. The loss L1 = mn i=0 j=0 |Mi,j − M̂i,j | is used
for MMP training and as a metric for test. Here, Mi,j refers to the value of
estimated MMP at position i, j and M̂ refers to the GT.
Results The training results are listed in Table 1. We visualize our results on
GOPRO test dataset in Fig. 4. To prove the generality of the proposed method,
we also test the GOPRO trained model to HIDE dataset [25]. HIDE dataset
contains blurry images of human. As shown in Fig. 5, our proposed method can
also detect the blur caused by human motion on other dataset.

4.2 Video Deblurring

Datasets We test the proposed video deblurring method on two public datasets,
GORPRO benchmark dataset [17] and beam-splitter datasets (BSD) [43]. BSD
Efficient Video Deblurring Guided by Motion Magnitude 11

Fig. 5. Examples of test results on HIDE dataset [25]. The model was trained on
GOPRO dataset and tested on HIDE dataset. Each column refer to the input blurry
images, estimated results, and the estimated results overlaid to the input images. It
indicates that the proposed method can successfully detect salient blurry region on
other datasets.

is a dataset of images from real world by controlling the length of exposure time
and strength of exposure intensity during video shooting using beam-splitter
system [11]. It can better evaluate deblurring performance in real scenarios. We
use the 2ms-16mes BSD that the exposure time for sharp and blurry frames
are 2 ms and 16 ms, respectively. The training and validation sets have 60 and
20 video sequences with 100 frames respectively, and the test set has 20 video
sequences with 150 frames. The size of the frames is 640 × 480.
Implementation details We train the model using ADAM optimizer with a
learning rate of 0.0005. We adopt cosine annealing schedule [16] to adjust learn-
ing rate during training. We sample 10-frame 256 × 256 RGB patch sequences
from the dataset to construct a mini-batch of size 8 with random vertical and
horizontal flips as well as 90◦ rotation for data augmentation for training. We
train 1,000 epochs for GOPRO and 500 epochs for BSD, respectively. It is worth
mentioning that we try to train the other methods using publicly available codes
by ourselves and keep the same hyper-parameters if possible. For CDVD-TSP,
we use the available test images for GOPRO and keep the training strategy used
in [19] for BSD.
Benchmark Results The results of our method with SoTA lightweight im-
age [33] and video deblurring methods on GOPRO is shown in Table 2. We use
A# B# C# F# to represent na , nb ,nc and the length of the input image sequence in
our model. We evaluate the image quality in terms of PSNR [10] and SSIM. We
also measure the computational cost for each model when processing one 720P
frame in terms of GMACs. Running time (s) of one 720P frame is also listed. For
IFIRNN, c2 refers to the dual cell model and h3 refers to ‘three times’ of hidden
state iteration. Our A7 B8 C16 F8 model outperforms the other methods with only
204.19 GMACs. The visual comparison of the results are shown in Fig. 6.
12 Y. Wang et al.

Table 2. Quantitative results on GOPRO.

Model PSNR SSIM GMACs Param Time (s)


SRN [33] 29.94 0.8953 1527.01 10.25 0.173
DBN [28] 28.55 0.8595 784.75 15.31 0.128
IFIRNN (c2h3) [18] 29.69 0.8867 217.89 1.64 0.034
ESTRNN (C70 B7 ) [43] 29.93 0.8903 115.19 1.17 0.021
ESTRNN (C90 B10 ) [42] 31.02 0.9109 215.26 2.38 0.035
CDVD-TSP [19] 31.67 0.9279 5122.25 16.19 0.729
MMP-RNN (A3 B4 C16 F5 ) 30.48 0.9021 136.42 1.97 0.032
MMP-RNN (A7 B8 C16 F8 ) 31.71 0.9225 204.19 3.05 0.045
MMP-RNN (A9 B10 C18 F8 ) 32.64 0.9359 264.52 4.05 0.059

Table 3. Quantitative results on BSD 2ms-16ms.

Model PSNR SSIM GMACs


DBN [28] 31.33 0.9132 784.75
IFIRNN (c2h3) [18] 31.59 0.9209 217.89
ESTRNN (C90 B10 ) [43] 31.80 0.9245 215.26
CDVD-TSP [19] 32.06 0.9268 5122.25
MMP-RNN (A8 B9 C18 F8 ) 32.79 0.9365 247.41
MMP-RNN (A9 B10 C18 F8 ) 32.81 0.9369 264.52

The results on BSD of the proposed method with other SoTA methods are
shown in Table 3. Here, we use the A8 B9 C18 F8 model. The visualization results
of the video deblurring on BSD are shown in Fig. 7. Our method outperforms
SoTA methods with less computational cost.

4.3 Ablation Study


Network Structure We conduct ablation tests on the proposed method with
A9 B10 C18 model on GOPRO and A8 B9 C18 model on BSD. We focus on three
parts, the MMAM, motion magnitude loss and the transmission of non-deblurred
features as shown in Table 4. On GOPRO, the MMAM with MMP can improve
PSNR by 0.31 dB. Together with motion magnitude loss, the prior can improve
the score by 0.39 dB. If we remove all the components, the PSNR may drop by
0.58, which can indicate the effectiveness of the proposed method. As for BSD,
the PSNR significantly increased by 0.63 dB after fusing prior using MMAM.
The motion magnitude loss can improve PSNR by 0.09 dB.
Influence of Prior We also did ablation tests with different types of priors
using A3 B4 C80 F5 model on GOPRO. As shown in Table 5, we tried the ground
truth MMP, the ground truth of flow magnitude of the center frame (from F Lc1
and F LcN , where c refers to the center frame), the normalized ground truth map
(M /max(M )), without MMP and the MMP from MMP-Net. We noticed that
the ground truth as an upper boundary can increase the score by 0.37 dB. The
Efficient Video Deblurring Guided by Motion Magnitude 13

(a) (b)

(c) (d)

(e) (f) (g) (h) (i) (j) (k) (l)

Fig. 6. Visualization results of GoPRO. (a) Blurry frame. (b) the estimated MMP. (c)
Overlaying (b) on (a). (d) Result from MMP-RNN. From (e) to (l) are, blurry input,
DBN, IFIRNN, ESTRNN, proposed method w/o MMP, MMP-RNN, and the sharp
frame, respectively.

Table 4. Ablation tests. NDF refers the transmission of non-deblurred features.

Model GOPRO BSD


MMAM LM M NDF PSNR GMACs PSNR GMACs
X X X 32.64 264.52 32.79 247.41
✕ X X 32.33 225.34 32.16 208.23
✕ ✕ X 32.25 225.34 32.07 208.23
✕ ✕ ✕ 32.06 227.21 32.03 210.09

Table 5. Influence of different types of prior. I. Ground truth, II.Ground truth of


flow magnitude of center frame, III. Normalizing each ground truth map to 0∼1, IV.
Replacing the MMP and MMAM by spatial-self-attention [36]. V. None, VI. Estimated
from MMP-Net.

Prior type I II III IV V VI


PSNR 30.54 30.47 30.41 30.37 30.17 30.48

MMP from MMP-Net can increase the PSNR by 0.31 dB. On the other hand,
using the flow magnitude of the center frame will decrease the PSNR by 0.07 dB.
14 Y. Wang et al.

(a) (b) (c)

(d) (e) (f) (g) (h) (i) (j) (k)

Fig. 7. Visualization results of BSD. a) Blurry frame. (b) the estimated MMP. (c)
Result from MMP-RNN. From (d) to (k) are, blurry input, DBN, IFIRNN, ESTRNN,
proposed method w/o MMP, MMP-RNN, and the sharp frame, respectively.

The contour of the center frame magnitude only corresponds to the center frame,
the attentive field cannot cover the whole blurry object. Normalizing the MMP
by the maximum value of each MMP may lose temporal information, especially
for some sharp images, the whole MMP value may be close to 1, which may
influence the performance. We also compare the proposed method to spatial-
self-attention [36]. Our proposed method use a supervised approach to tell the
network where to concentrate and the results outperform vanilla spatial-self-
attention.

5 Conclusion
In this paper, we proposed a motion magnitude prior for deblurring tasks. We
built a dataset of blurry image and motion magnitude prior pairs and used a
compact network to learn by regression. We applied the prior to video deblurring
task, combining the prior with an efficient RNN. The prior is fused by motion
magnitude attentive module and motion magnitude loss. We also transmitted
the features before deblurring with features after deblurring between RNN cells
to improve efficiency. We tested the proposed method on GOPRO and BSD,
which achieved better performance on video deblurring tasks compared to SoTA
methods on image quality and computational cost.
Acknowledgements This paper is supported by JSPS KAKENHI Grant Num-
bers 22H00529 and 20H05951.
Efficient Video Deblurring Guided by Motion Magnitude 15

References
1. Argaw, D.M., Kim, J., Rameau, F., Cho, J.W., Kweon, I.S.: Optical flow estimation
from a single motion-blurred image. In: AAAI (2021) 2, 4
2. Bar, L., Berkels, B., Rumpf, M., Sapiro, G.: A variational framework for simultane-
ous motion estimation and restoration of motion-blurred video. In: ICCV. pp. 1–8
(2007) 2, 4
3. Charbonnier, P., Blanc-Feraud, L., Aubert, G., Barlaud, M.: Two deterministic
half-quadratic regularization algorithms for computed imaging. In: ICIP. pp. 168–
172 (1994) 9
4. Cho, S., Matsushita, Y., Lee, S.: Removing non-uniform motion blur from images.
In: ICCV. pp. 1–8 (2007) 2, 4
5. Couzinié-Devy, F., Sun, J., Alahari, K., Ponce, J.: Learning to estimate and remove
non-uniform image blur. In: CVPR. pp. 1075–1082 (2013) 2, 4
6. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convo-
lutional networks. In: ICCV. pp. 764–773 (2017) 5
7. Gao, H., Tao, X., Shen, X., Jia, J.: Dynamic scene deblurring with parameter
selective sharing and nested skip connections. In: CVPR. pp. 3848–3856 (2019) 5
8. Gong, D., Yang, J., Liu, L., Zhang, Y., Reid, I., Shen, C., Hengel, A.V.D., Shi, Q.:
From motion blur to motion flow: A deep learning solution for removing heteroge-
neous motion blur. In: CVPR. pp. 3806–3815 (2017) 2, 4
9. Hirsch, M., Schuler, C.J., Harmeling, S., Schölkopf, B.: Fast removal of non-uniform
camera shake. In: ICCV. pp. 463–470 (2011) 6
10. Horé, A., Ziou, D.: Image quality metrics: Psnr vs. ssim. In: ICPR. pp. 2366–2369
(2010) 11
11. Jiang, H., Zheng, Y.: Learning to see moving objects in the dark. In: ICCV. pp.
7324–7333 (2019) 11
12. Kim, T.H., Lee, K.M.: Generalized video deblurring for dynamic scenes. In: CVPR.
pp. 5426–5434 (2015) 2
13. Kim, T.H., Lee, K.M., Scholkopf, B., Hirsch, M.: Online video deblurring via dy-
namic temporal blending network. In: ICCV. pp. 4038–4047 (2017) 2, 3, 5
14. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR
(2015) 10
15. Li, D., Xu, C., Zhang, K., Yu, X., Zhong, Y., Ren, W., Suominen, H., Li, H.: Arvo:
Learning all-range volumetric correspondence for video deblurring. In: CVPR. pp.
7721–7731 (June 2021) 2, 4, 5
16. Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts.
In: ICLR (2017) 11
17. Nah, S., Kim, T.H., Lee, K.M.: Deep multi-scale convolutional neural network for
dynamic scene deblurring. In: CVPR. pp. 3883–3891 (2017) 2, 5, 6, 7, 9, 10
18. Nah, S., Son, S., Lee, K.M.: Recurrent neural networks with intra-frame iterations
for video deblurring. In: CVPR. pp. 8102–8111 (2019) 2, 3, 5, 12
19. Pan, J., Bai, H., Tang, J.: Cascaded deep video deblurring using temporal sharpness
prior. In: CVPR. pp. 3043–3051 (2020) 2, 3, 4, 5, 11, 12
20. Pan, J., Sun, D., Pfister, H., Yang, M.: Blind image deblurring using dark channel
prior. In: CVPR. pp. 1628–1636 (2016) 4
21. Portz, T., Zhang, L., Jiang, H.: Optical flow in the presence of spatially-varying
motion blur. In: CVPR. pp. 1752–1759 (2012) 2
22. Ren, W., Pan, J., Cao, X., Yang, M.H.: Video deblurring via semantic segmentation
and pixel-wise non-linear kernel. In: ICCV. pp. 1077–1085 (2017) 2, 4
16 Y. Wang et al.

23. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed-
ical image segmentation. In: MICCAI. pp. 234–241 (2015) 7
24. Shan, Q., Jia, J., Agarwala, A.: High-quality motion deblurring from a single image.
ACM Trans. Graph. 27(3), 1–10 (2008) 4
25. Shen, Z., Wang, W., Shen, J., Ling, H., Xu, T., Shao, L.: Human-aware motion
deblurring. In: ICCV (2019) 2, 4, 10, 11
26. Shi, J., Xu, L., Jia, J.: Discriminative blur detection features. In: CVPR. pp. 2965–
2972 (2014) 2, 4
27. Son, H., Lee, J., Lee, J., Cho, S., Lee, S.: Recurrent video deblurring with blur-
invariant motion estimation and pixel volumes. ACM Trans. Graph. (2021) 5
28. Su, S., Delbracio, M., Wang, J., Sapiro, G., Heidrich, W., Wang, O.: Deep video
deblurring for hand-held cameras. In: CVPR. pp. 237–246 (2017) 2, 4, 5, 6, 12
29. Suin, M., Purohit, K., Rajagopalan, A.N.: Spatially-attentive patch-hierarchical
network for adaptive motion deblurring. In: CVPR. pp. 3606–3615 (2020) 5
30. Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow using
pyramid, warping, and cost volume. In: CVPR (2018) 2
31. Sun, J., Cao, W., Xu, Z., Ponce, J.: Learning a convolutional neural network for
non-uniform motion blur removal. In: CVPR. pp. 769–777 (2015) 4
32. Tai, Y.W., Chen, X., Kim, S., Kim, S.J., Li, F., Yang, J., Yu, J., Matsushita, Y.,
Brown, M.S.: Nonlinear camera response functions and image deblurring: Theo-
retical analysis and practice. IEEE Transactions on Pattern Analysis and Machine
Intelligence 35(10), 2498–2512 (2013) 6
33. Tao, X., Gao, H., Shen, X., Wang, J., Jia, J.: Scale-recurrent network for deep
image deblurring. In: CVPR. pp. 8174–8182 (2018) 5, 11, 12
34. Teed, Z., Deng, J.: Raft: Recurrent all pairs field transforms for optical flow. In:
ECCV. pp. 402–419 (2020) 2, 7
35. Wang, X., Chan, K.C., Yu, K., Dong, C., Loy, C.C.: Edvr: video restoration with
enhanced deformable convolutional networks. In: CVPRW (2019) 2, 5, 6
36. Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: Convolutional block attention
module. In: ECCV (2018) 13, 14
37. Wulff, J., Black, M.J.: Modeling blurred video with layers. In: ECCV. pp. 236–252
(2014) 2, 4
38. Xu, L., Zheng, S., Jia, J.: Unnatural l0 sparse representation for natural image
deblurring. In: CVPR. pp. 1107–1114 (2013) 4
39. Yan, Y., Ren, W., Guo, Y., Wang, R., Cao, X.: Image deblurring via extreme
channels prior. In: CVPR. pp. 6978–6986 (2017) 4
40. Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H., Shao, L.:
Multi-stage progressive image restoration. In: CVPR (2021) 5
41. Zhang, Y., Tian, Y., Kong, Y., Zhong, B., Fu, Y.: Residual dense network for image
super-resolution. In: CVPR. pp. 2472–2481 (2018) 3, 7
42. Zhang, Y., Tian, Y., Kong, Y., Zhong, B., Fu, Y.: Residual dense network for im-
age restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence
(2020) 3, 7, 12
43. Zhong, Z., Gao, Y., Zheng, Y., Zheng, B.: Efficient spatio-temporal recurrent neural
network for video deblurring. In: ECCV. pp. 191–207 (2020) 2, 3, 5, 7, 10, 12
44. Zhou, S., Zhang, J.and Pan, J., Xie, H., Zuo, W., Ren, J.: Spatio-temporal filter
adaptive network for video deblurring. In: ICCV. pp. 2482–2491 (2019) 2, 5

You might also like