Spatio-Temporal Deformable Convolution For Compressed Video Quality Enhancement

The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)
Spatio-Temporal Deformable Convolution

for Compressed Video Quality Enhancement
Jianing Deng,1 Li Wang,2 Shiliang Pu,2 Cheng Zhuo1∗
1
College of Information Science and Electronic Engineering, Zhejiang University
2
Hikvision Research Institute
{dengjn, czhuo}@zju.edu.cn, {wangli7, pushiliang}@hikvision.com
Abstract
Recent years have witnessed remarkable success of deep
learning methods in quality enhancement for compressed
video. To better explore temporal information, existing meth-
ods usually estimate optical flow for temporal motion com-
pensation. However, since compressed video could be se-
riously distorted by various compression artifacts, the es-
timated optical flow tends to be inaccurate and unreliable,
thereby resulting in ineffective quality enhancement. In addi-
tion, optical flow estimation for consecutive frames is gener-
ally conducted in a pairwise manner, which is computational
expensive and inefficient. In this paper, we propose a fast yet
effective method for compressed video quality enhancement
by incorporating a novel Spatio-Temporal Deformable Fusion
(STDF) scheme to aggregate temporal information. Specifi-
cally, the proposed STDF takes a target frame along with its
neighboring reference frames as input to jointly predict an Figure 1: Illustration of compression artifacts. Videos are
offset field to deform the spatio-temporal sampling positions compressed by the latest H.265/HEVC coding algorithm.
of convolution. As a result, complementary information from
both target and reference frames can be fused within a sin-
gle Spatio-Temporal Deformable Convolution (STDC) oper- During the past decades, extensive works have been con-
ation. Extensive experiments show that our method achieves
the state-of-the-art performance of compressed video quality
ducted on artifacts removal or quality enhancement for sin-
enhancement in terms of both accuracy and efficiency. gle compressed image. Traditional methods (Foi, Katkovnik,
and Egiazarian 2007; Zhang et al. 2013) reduced artifacts by
optimizing the transform coefficients for specific compres-
1 Introduction sion standard, thus they are hard to extend to other compres-
sion schemes. With the recent advances in Convolutional
Nowadays, video content has become a major fraction of
Neural Networks (CNNs), CNN-based methods (Dong et al.
digital network traffic and is still growing (Wien 2015). To
2015; Tai et al. 2017; Zhang et al. 2017; 2019) have also
transmit video under limited bandwidth, video compression
emerged for image quality enhancement. They usually learn
is indispensable to significantly reduce the bit-rate. How-
a non-linear mapping to directly regress the artifact-free im-
ever, compression algorithms, such as H.264/AVC (Wiegand
age from a large amount of training data, leading to impres-
et al. 2003) and H.265/HEVC (Sullivan, Ohm, and Wiegand
sive results with high efficiency. However, these methods
2013), often introduce various artifacts in the compressed
cannot be directly extended to compressed video since they
video, especially at low bit-rate. As shown in Figure 1, such
treat frames independently and thus fail to exploit temporal
artifacts may considerably diminish video quality, resulting
information.
in degradation of Quality of Experience (QoE). The dis-
torted contents in low-quality compressed video may also On the other hand, there is only limited study on qual-
reduce performance of subsequent vision tasks (e.g., recog- ity enhancement for compressed video. Yang et al. first pro-
nition, detection, tracking) in low-bandwidth applications posed Multi-Frame Quality Enhancement (MFQE 1.0) ap-
(Galteri et al. 2017; Lu et al. 2019). Thus, it’s crucial to study proach to leverage temporal information for VQE (Yang et
on compressed video quality enhancement (VQE). al. 2018). Specifically, high quality frames in compressed
video are utilized as reference frame to help enhancing qual-
∗
Corresponding author. ity of neighboring low quality target frame via a novel
Copyright c 2020, Association for the Advancement of Artificial Multi-Frame CNN (MF-CNN). Recently, an upgraded ver-
Intelligence (www.aaai.org). All rights reserved. sion MFQE 2.0 (Guan et al. 2019) was introduced to further
10696
improve the efficiency of MF-CNN, and achieved state-of- of image quality enhancement. These methods tend to ap-
the-art performance. In order to aggregate information from ply large CNNs to capture discriminative features within an
target frame and reference frames, both MFQE methods image, resulting in a large amount of computations and pa-
adopt a widely used temporal fusion scheme that incorpo- rameters. On the other hand, MFQE 1.0 (Yang et al. 2018)
rates dense optical flow for motion compensation (Kappeler pioneered on applying multi-frame CNN to take advantage
et al. 2016; Caballero et al. 2017; Xue et al. 2017). How- of temporal information for compressed video quality en-
ever, this temporal fusion scheme may be suboptimal in the hancement, where high quality frames are utilized to help
context of VQE task. Since compression artifacts could seri- enhancing quality of the adjacent low quality frames. To ex-
ously distort video contents and break pixelwise correspon- ploit long range temporal information, Yang et al. later in-
dances between frames, the estimated optical flow tends to troduced a modified convolutional long short-term memory
be inaccurate and unreliable, thereby resulting in ineffective network (Yang et al. 2019) for video quality enhancement.
quality enhancement. In addition, optical flow estimation Most recently, Guan et al. proposed MFQE 2.0 (Guan et al.
needs to be repeatedly performed for different reference- 2019) to upgrade several key components of MFQE 1.0 and
target frame pairs in a pairwise manner, which involves sub- achieved state-of-the-art performance in terms of accuracy
stantially increased computational cost to explore more ref- and speed.
erence frames.
To address the aforementioned issues, we introduce a Leveraging Temporal Information. It is crucial to lever-
Spatio-Temporal Deformable Fusion (STDF) scheme for age complementary information across multiple frames for
VQE task. Specifically, we propose to learn a novel Spatio- video related tasks. Karpathy et al. first introduced sev-
Temporal Deformable Convolution (STDC) to aggregate eral convolution based fusion schemes to combine spatio-
temporal information while avoiding explicit optical flow temporal information for video classification (Karpathy et
estimation. The main idea of STDC is to adaptively deform al. 2014). Kappeler et al. later investigated those fusion
the spatio-temporal sampling positions of convolution so as schemes for low-level vision tasks (Kappeler et al. 2016),
to capture the most relevant context and exclude the noisy and managed to improve accuracy by compensating motion
content for quality enhancement of the target frame. To this across consecutive frames with a Total Variation (TV) based
end, we adopt a CNN-based predictor to jointly model the optical flow estimation algorithm. Caballero et al. further re-
correspondance across target and reference frames, and ac- placed the TV based flow estimator with CNN to enable end-
cordingly regress those sampling positions within a single to-end training (Caballero et al. 2017). Since then, temporal
inference pass. The main contributions of this paper are sum- fusion with motion compensation has been widely adopted
marized as follows: for various vision tasks (Xue et al. 2017; Yang et al. 2018;
• We propose an end-to-end CNN-based method for VQE Kim et al. 2018; Guan et al. 2019). However, these methods
task, which incorporates a novel STDF scheme to aggre- heavily rely on accurate optical flow which is hard to obtain
gate temporal information. due to general problems (e.g., occlusion, large motion) or
task-specific problems (e.g., compression artifacts). To cope
• We analytically and experimentally compare the proposed
with this, works have been conducted to bypass explicit op-
STDF to prior fusion schemes, and demonstrate its higher
tical flow estimation. Niklaus, Mai, and Liu proposed Ada-
flexibility and robustness.
Conv (Niklaus, Mai, and Liu 2017) to adaptively generate
• We quantitatively and qualitatively evaluate the proposed convolution kernels by implicitly utilizing motion cues for
method on VQE benchmark dataset and show that it video frame interpolation. Shi et al. introduced ConvLSTM
achieves state-of-the-art performance in terms of accuracy network to exploit contextual information from a long range
and efficiency. of adjacent frames (Shi et al. 2015). In this work, we pro-
pose to combine motion cues with convolution to efficiently
2 Related Work aggregate spatio-temporal information, which also omits the
Image and Video Quality Enhancement. Over the past explicit estimation of optical flow.
decade, an increasing number of works have focused on
quality enhancement for compressed image (Foi, Katkovnik, Deformable Convolution. Dai et al. first proposed to aug-
and Egiazarian 2007; Jancsary, Nowozin, and Rother 2012; ment regular convolution with learnable sampling offsets
Chang, Ng, and Zeng 2013; Zhang et al. 2013; Dong et to model complex geometric transformations for object de-
al. 2015; Guo and Chao 2016; Zhang et al. 2017; 2019). tection (Dai et al. 2017; Zhu et al. 2019). Later, several
Among them, CNN-based end-to-end methods achieve the works (Bertasius, Torresani, and Shi 2018; Tian et al. 2018;
recent state-of-the-art performance. Specifically, Dong et al. Wang et al. 2019) extended it along temporal extent to im-
first introduced a 4-layer AR-CNN (Dong et al. 2015) to plicitly capture motion cues for video-related applications,
remove various JPEG compression artifacts. Later, Zhang and achieved better performance than traditional methods.
et al. managed to learn a very deep DnCNN (Zhang et However, these methods perform deformable convolution in
al. 2017) with residual learning scheme for several image a pairwise manner, thus fail to fully explore temporal corre-
restoration tasks. Most recently, Zhang et al. proposed an spondances across multiple frames. In this work, we propose
even deeper network RNAN (Zhang et al. 2019) with resid- STDC to jointly consider a video clip rather than splitting it
ual non-local attention mechanism to capture long-range de- into several reference-target frame pairs, leading to more ef-
pendencies between pixels and set up a new state-of-the-art fective use of contextual information.
10697
Figure 2: Overview of the proposed framework for compressed video quality enhancement. Given a compressed video clip with
2R + 1 concatenated frames, an offset prediction network is first adopted to generate deformable offset field. With this offset
field, spatio-temporal deformable convolution is then performed to fuse temporal information and produce fused feature maps.
At last, QE network is used to compute the enhancement residual map, and the final enhanced result can be obtained by adding
the residual map back to the compressed target frame. Herein, temporal radius R=1, and deformable kernel size K=3.
3 Proposed Method 3.2 STDF Module

3.1 Overview Spatio-Temporal Deformable Convolution. For a com-
Given a compressed video which is distorted by compres- pressed video clip {ItLQ
0 −R
, · · · , ItLQ
0
, · · · , ItLQ
0 +R
}, the most
sion artifacts, the goal of our method is to remove those straightforward temporal fusion scheme, i.e., Early Fusion
artifacts and accordingly enhance the video quality. To be (EF) (Karpathy et al. 2014), can be formulated as multi-
specific, we conduct the enhancement separately for each channel convolution applied directly on the compressed
compressed frame ItLQ 0
∈ RH×W at time t0 1 . In order to frames as
leverage temporal information, we take the preceding and t
0 +R K 2
succeeding R frames as reference to help enhancing quality

F (p) = Wt,k · ItLQ (p + pk ) (2)
of each target ItLQ
0
. The enhanced solution IˆtHQ
0
∈ RH×W t=t0 −R k=1
can then be expressed as
where F is the resulting feature map, K represents the size
IˆtHQ
0
= Fθ ({ItLQ
0 −R
, · · · , ItLQ
0
, · · · , ItLQ
0 +R
}) (1) 2
of convolution kernel, Wt ∈ RK is the kernel for t-th
where Fθ (·) represents the proposed quality enhancement channel, p indicates arbitrary spatial position and pk rep-
model and θ are the learnable parameters. resents the regular sampling offsets. For example, pk ∈
Figure 2 demonstrates the framework of our method, {(−1, −1), (−1, 0), · · · , (1, 1)} for K=3. Despite the high
which is composed of a Spatio-Temporal Deformable Fu- efficiency, EF may easily introduce noisy content and reduce
sion (STDF) module and a Quality Enhancement (QE) the performance of subsequent enhancement due to tempo-
module. The STDF module takes both target frame and ral motion, as shown in Figure 3. Inspired by Dai et al. (Dai
reference frames as input, and fuse contextual informa- et al. 2017), we address this issue by introducing a novel
tion via a spatio-temporal deformable convolution, where Spatio-Temporal Deformable Convolution (STDC) to aug-
the deformable offsets are adaptively generated by an off- ment the regular sampling offset with extra learnable offset
2
set prediction network. Then, with the fused feature maps, δ (t,p) ∈ R2K as
the QE module incorporates a fully convolutional enhance-
ment network to compute the enhanced result. Since both pk ← pk + δ (t,p),k (3)
STDF module and QE module are convolutional, our uni- It is worth noting that the deformable offset δ (t,p) are
fied framework can be trained in an end-to-end manner. position-specific, i.e., individual δ (t,p) will be assigned for
1 each convolution window centered at spatio-temporal posi-
For simplicity, we assume enhancement is performed on lumi-
nance channel only. Thus we represent all frames as 2D matrices. tion (t, p). Thus, spatial deformations as well as temporal
10698
(a) Motion Compensation before Convolution
(b) Spatio-Temporal Deformable Convolution
Figure 3: Visualization of EF and our STDC. Herein, red Figure 4: Comparison between motion compensation based
points represent the sampling positions of 3 × 3 convolution convolution and spatio-temporal deformable convolution.
window centered at green points. The STDC can adapt to Herein, 3 × 3 convolution is used for demonstration.
both large temporal motion (ball) and the small one (head),
and accordingly captures the relevant context for quality en-
hancement. of all (de)convolutional layers to C1 . Rectified Linear Unit
(ReLU) is adopted as activation function for all layers except
the last one which is followed by linear activation to regress
dynamics within the video clip can be simultaneously mod- the offset field Δ. We do not use any normalization layer in
eled, as shown in Figure 3. Since the learnable offsets can the network.
be fractional, we follow Dai et al. (Dai et al. 2017) to apply
the differentiable bilinear interpolation to sample sub-pixel 3.3 QE Module
ItLQ (p + pk ). The main idea of QE module is to fully explore complemen-
Unlike previous VQE methods (Yang et al. 2018; Guan et tary information from fused feature maps F and accordingly
al. 2019) which perform explicit motion compensation be- generate the enhanced target frame IˆtHQ
0
. In order to take ad-
fore fusion to alleviate the effect of temporal motion, STDC vantage of residual learning (Kim, Lee, and Lee 2016), we
implicitly combines motion cues with position-specific sam- first learn a non-linear mapping Fθqe (·) to predict the en-
pling while fusion. This leads to higher flexibility and ro- hancement residual as
bustness because adjacent convolution windows can sample R̂HQ
t0 = Fθqe (F ) (5)
contents independently, as shown in Figure 4.
The enhanced target frame can then be generated as
Joint Deformable Offset Prediction. Different from op-
tical flow estimation that solely handles one reference-target IˆtHQ = R̂HQ
0 t 0
+ ItLQ 0
(6)
frame pair at a time, we propose to take the whole clip into As illustrated in Figure 2-(c), we implement Fθqe (·)
consideration and jointly predict all deformable offsets at through another CNN which consists of L convolutional lay-
once. To this end, we apply an offset prediction network ers of stride 1. All layers except the last one have C2 con-
2
Fθop (·) to predict an offset field Δ ∈ R(2R+1)×2K ×H×W volutional filters followed by ReLU activation. The last con-
for all spatio-temporal positions in the video clip as volutional layer outputs the enhancement residual. Without
bells and whistles, such plain QE network is able to achieve
Δ = Fθop ([ItLQ
0 −R
, · · · , ItLQ
0
, · · · , ItLQ
0 +R
]) (4) satisfactory enhancement results.
where frames are concatenated together as input. Since con-
secutive frames are highly correlated, offset prediction for 3.4 Training Scheme
one frame can benefit from the other frames, leading to more Since STDF module and QE module are fully-convolutional
effective use of temporal information than pairwise scheme. and thus differentiable, we jointly optimize θop and θqe in an
In addition, joint prediction is more computational efficient end-to-end fashion. The overall loss function L is set to the
because all deformable offsets can be obtained in a single Sum of Squared Error (SSE) between the enhanced target
inference pass. frame IˆtHQ
0
and the raw one ItHQ0
as
As shown in Figure 2-(a), we adopt a U-Net based net-
work (Ronneberger, Fischer, and Brox 2015) for offset pre- L = IˆtHQ
0
− ItHQ
0
22 (7)
diction to enlarge receptive field so as to capture large Note that, as there is no ground-truth for deformable off-
temporal dynamics. Convolutional and deconvolutional lay- sets, learning for offset prediction network Fθop (·) is to-
ers (Zeiler and Fergus 2014) with stride of 2 are used for tally unsupervised and fully driven by the final loss L,
downsampling and upsampling respectively. For convolu- which is different from previous works (Yang et al. 2018;
tional layer with stride of 1, zero padding is used to re- Guan et al. 2019) that incorporate auxiliary losses to con-
tain feature size. For simplicity, we set the filter number strain optical flow estimation.
10699
Image QE Methods Video QE Methods
QP Test Videos† Dong et al. Zhang et al. Zhang et al. Yang et al. Guan et al. Ours Ours Ours
AR-CNN DnCNN RNAN‡ MFQE 1.0 MFQE 2.0 STDF-R1 STDF-R3 STDF-R3L
Traffic 0.27 / 0.50 0.35 / 0.64 0.40 / 0.86 0.50 / 0.90 0.59 / 1.02 0.56 / 0.92 0.65 / 1.04 0.73 / 1.15
Class A
PeopleOnStreet 0.37 / 0.76 0.54 / 0.94 0.74 / 1.30 0.80 / 1.37 0.92 / 1.57 1.05 / 1.66 1.18 / 1.82 1.25 / 1.96
Kimono 0.20 / 0.59 0.27 / 0.73 0.33 / 0.98 0.50 / 1.13 0.55 / 1.18 0.66 / 1.32 0.77 / 1.47 0.85 / 1.61
ParkScene 0.14 / 0.44 0.17 / 0.52 0.20 / 0.77 0.39 / 1.03 0.46 / 1.23 0.41 / 1.05 0.54 / 1.32 0.59 / 1.47
Class B Cactus 0.20 / 0.41 0.28 / 0.53 0.35 / 0.76 0.44 / 0.88 0.50 / 1.00 0.59 / 1.06 0.70 / 1.23 0.77 / 1.38
BQTerrace 0.23 / 0.43 0.33 / 0.53 0.42 / 0.84 0.27 / 0.48 0.40 / 0.67 0.55 / 0.89 0.58 / 0.93 0.63 / 1.06
BasketballDrive 0.23 / 0.51 0.33 / 0.63 0.43 / 0.92 0.41 / 0.80 0.47 / 0.83 0.60 / 0.99 0.66 / 1.07 0.75 / 1.23
RaceHorses 0.23 / 0.49 0.31 / 0.70 0.39 / 0.99 0.34 / 0.55 0.39 / 0.80 0.41 / 0.98 0.48 / 1.09 0.55 / 1.35
BQMall 0.28 / 0.69 0.38 / 0.87 0.45 / 1.15 0.51 / 1.03 0.62 / 1.20 0.75 / 1.44 0.90 / 1.61 0.99 / 1.80
Class C
37 PartyScene 0.14 / 0.52 0.22 / 0.69 0.30 / 0.98 0.22 / 0.73 0.36 / 1.18 0.52 / 1.49 0.60 / 1.60 0.68 / 1.94
BasketballDrill 0.23 / 0.48 0.42 / 0.89 0.50 / 1.07 0.48 / 0.90 0.58 / 1.20 0.64 / 1.19 0.70 / 1.26 0.79 / 1.49
RaceHorses 0.26 / 0.59 0.34 / 0.80 0.42 / 1.02 0.51 / 1.13 0.59 / 1.43 0.63 / 1.51 0.73 / 1.75 0.83 / 2.08
BQSquare 0.21 / 0.30 0.30 / 0.46 0.32 / 0.63 -0.01 / 0.15 0.34 / 0.65 0.75 / 1.03 0.91 / 1.13 0.94 / 1.25
Class D
BlowingBubbles 0.16 / 0.46 0.25 / 0.76 0.31 / 1.08 0.39 / 1.20 0.53 / 1.70 0.53 / 1.69 0.68 / 1.96 0.74 / 2.26
BasketballPass 0.26 / 0.63 0.38 / 0.83 0.46 / 1.08 0.63 / 1.38 0.73 / 1.55 0.80 / 1.54 0.95 / 1.82 1.08 / 2.12
FourPeople 0.40 / 0.56 0.54 / 0.73 0.70 / 0.97 0.66 / 0.85 0.73 / 0.95 0.83 / 1.01 0.92 / 1.07 0.94 / 1.17
Class E Johnny 0.24 / 0.21 0.47 / 0.54 0.56 / 0.88 0.55 / 0.55 0.60 / 0.68 0.65 / 0.71 0.69 / 0.73 0.81 / 0.88
KristenAndSara 0.41 / 0.47 0.59 / 0.62 0.63 / 0.80 0.66 / 0.75 0.75 / 0.85 0.84 / 0.83 0.94 / 0.89 0.97 / 0.96
Average 0.25 / 0.50 0.36 / 0.69 0.44 / 0.95 0.46 / 0.88 0.56 / 1.09 0.65 / 1.18 0.75 / 1.32 0.83 / 1.51
32 Average 0.19 / 0.17 0.33 / 0.41 0.41 / 0.62 0.43 / 0.58 0.52 / 0.68 0.64 / 0.77 0.73 / 0.87 0.86 / 1.04
27 Average 0.16 / 0.09 0.33 / 0.26 -/- 0.40 / 0.34 0.49 / 0.42 0.59 / 0.47 0.67 / 0.53 0.72 / 0.57
22 Average 0.13 / 0.04 0.27 / 0.14 -/- 0.31 / 0.19 0.46 / 0.27 0.51 / 0.27 0.57 / 0.30 0.63 / 0.34
†
Video resolution: Class A (2560×1600), Class B (1920×1080), Class C (832×480), Class D (480×240), Class E
(1280×720).
‡
Patch-wise enhancement is performed for RNAN method due to memory restriction.
Table 1: Quantitative results of ΔPSNR (dB) / ΔSSIM (×10−2 ) on test videos at 4 different QPs.
Method
Processing Speed @ Different Resolution
#Param(K) ducted at 4 different Quantization Parameters (QPs), i.e., 22,
120p 240p 480p 720p 1080p 27, 32, 37, in order to evaluate performance under different
DnCNN 191.8 54.7 14.1 6.1 2.6 556 compression levels.
RNAN 5.6 - - - - 8957
MFQE 1.0 - 12.6 3.8 1.6 0.7 1788 4.2 Implementation Details
MFQE 2.0 - 25.3 8.4 3.7 1.6 255
The proposed method is implemented based on PyTorch
STDF-R1 141.9 38.9 9.9 4.2 1.8 330
STDF-R3 132.7 36.4 9.1 3.8 1.6 365 framework with reference to MMDetection toolbox (Chen
STDF-R3L 96.6 23.8 5.9 2.5 1.0 1275 et al. 2019) for deformable convolution. For training, we
randomly crop 64 × 64 clips from raw and the correspond-
Table 2: Quantitative results of speed (FPS) and amount ing compressed videos as training samples. Data augmen-
of parameters. Results of speed are measured on Nvidia tation (i.e., rotation or flip) is further used to better ex-
GeForce GTX 1080 Ti GPU. ploit those training samples. We train all models using
Adam optimizer (Kingma and Ba 2014) with β1 = 0.9,
β2 = 0.999 and = 10−8 . Learning rate is initially set
4 Experiments to 10−4 and retained throughout training. We train 4 mod-
4.1 Datasets els from scratch for the 4 QPs respectively. For evaluation,
as with previous works, we only apply quality enhancement
Following MFQE 2.0 (Guan et al. 2019), we collect a to- on Y-channel (i.e., luminance component) in YUV/YCbCr
tal of 130 uncompressed videos with various resolutions space. We adopt incremental Peak Signal-to-Noise Ratio
and contents from two databases, i.e., Xiph (Xiph.org) and (ΔPSNR) and Structural Similarity (ΔSSIM) (Wang et al.
VQEG (VQEG), where 106 of them are selected for train- 2004) to evaluate quality enhancement performance, which
ing and the rest are for validation. For testing, we adopt measure the improvement of the enhanced video from the
the dataset from Joint Collaborative Team on Video Cod- compressed one. We also evaluate the complexity of a qual-
ing (Ohm et al. 2012) with 18 uncompressed videos. These ity enhancement approach in terms of parameters and com-
testing videos are widely used for video quality assessment putational cost.
with around 450 frames per video. We compress all the
above videos by the latest H.265/HEVC reference software 4.3 Comparison to State-of-the-arts
HM16.5 2 under Low Delay P (LDP) configuration, as in
previous work (Guan et al. 2019). The compression is con- We compare the proposed method with state-of-the-art im-
age/video quality enhancement methods, AR-CNN (Dong
2
https://hevc.hhi.fraunhofer.de et al. 2015), DnCNN (Zhang et al. 2017), RNAN (Zhang
10700
Compressed Frame Compressed AR-CNN DnCNN RNAN MFQE 2.0 STDF-R1 STDF-R3 Raw
Figure 5: Qualitative results at QP 37. Note that enhancement is only conducted on luminance component for all methods.
Video index (from top to bottom): BasketballPass, Kimono, FourPeople, Cactus.
et al. 2019), MFQE 1.0 (Yang et al. 2018) and MFQE image quality enhancement methods can decently reduce
2.0 (Guan et al. 2019). For fair comparison, all image qual- those artifacts, the resulting frames usually become over-
ity enhancement methods are retrained on our training set. blurred and lack of details. On the other hand, video quality
Results of video quality enhancement methods are cited enhancement methods achieve better enhancement results
from (Guan et al. 2019). Three variants of our method with with the help of reference frames. Compared to MFQE 2.0,
different configurations (refer to previous section for details) our STDF models are more robust to compression artifacts
are evaluated. 1) STDF-R1 with R=1, C1 =32, C2 =48, L=8. and can better explore spatio-temporal information, thereby
2) STDF-R3 with R=3, C1 =32, C2 =48, L=8. 3) STDF-R3L leading to better restoration of structural details.
with R=3, C1 =64, C2 =64, L=16.
Quantitative Results. Table 1 and Table 2 present the Quality Fluctuation. It is observed that dramatic quality
quantitative results of accuracy and model complexity re- fluctuation exists in compressed video (Guan et al. 2019),
spectively. As can be observed, our method consistently out- which may severely break temporal consistency and degrade
perform all compared methods in terms of average ΔPSNR QoE. To investigate how our method can help with this, we
and ΔSSIM on the 18 test videos. More specifically, at QP plot PSNR curves of 2 sequences in Figure 6. As can be
37, our STDF-R1 outperforms MFQE 2.0 on most of the seen, our STDF-R1 model can effectively enhance most of
videos, with faster processing speed and comparable param- the low quality frames and alleviate quality fluctuation. By
eters. We note that our STDF-R1 simply takes the preced- enlarging the temporal radius R to 3, our STDF-R3 model
ing and succeeding frames as reference, unlike MFQE 2.0 manages to take advantage of adjacent high quality frames,
which utilizes high quality neighboring frames, thereby sav- leading to better performance than other compared methods.
ing the computational cost for searching those high qual-
ity frames in advance. As temporal radius R increases to 3,
+(9& 51$1 0)4( 67')5 67')5
our STDF-R3 manages to leverage more temporal informa-
3615G%
tion, and thus further improves the average ΔPSNR to 0.75

dB, which is 34% higher than MFQE 2.0, 63% higher than

MFQE 1.0 and 70% higher than RNAN. Due to the high ef-

ficiency of the proposed STDF module, the overall speed of +(9& 51$1 0)4( 67')5 67')5
STDF-R3 is still faster than that of MFQE 2.0. Furthermore,
3615G%
ΔPSNR of the enlarged model STDF-R3L reaches 0.83 dB,

showing there is still room for improvement of our method

by optimizing network architecture. Similar results can be
found for ΔSSIM as well as other QPs. )UDPH
Qualitative Results. Figure 5 provides the qualitative re- Figure 6: PSNR curves of 2 test sequences at QP 37. Top:
sults on 4 test videos. It can be seen that compressed frames BQSquare. Bottom: PartyScene.
are seriously distorted by various compression artifacts (e.g.,
ringing in Kimono and blurring in FourPeople). Although
10701
4.4 Analysis and Discussions
67')5
In this section, we conduct ablation study and further anal- 67')5 ()B)ORZ1HW65
ysis. For fair comparison, we only ablate the fusion scheme 67')5
and fix the QE network to L=8, C2 =48. All models are ()B)ORZ1HW65
Δ3615G%

trained following the same protocol and ΔPSNR/ΔSSIM ()5
are averaged over all test videos from Class B to E at QP 37.
()B670&5
Float-point operations (FLOPs) computed on 480p video is ()5 RI3DUDPHWHUV
used to evaluate the computational cost. ()B670&5

. .
%DVHOLQH5 ()B670&5
Effectiveness of STDF. To demonstrate the effectiveness

of STDF for temporal fusion, we compare it with two )/23V*)UDPH
previous fusion schemes, i.e., early fusion (EF) (Karpathy
et al. 2014) and early fusion with motion compensation
(EFMC) 3 (Yang et al. 2018; Guan et al. 2019). Specifi- Figure 7: Comparison of temporal fusion schemes.
cally, for EFMC scheme, we select two CNN-based opti-
cal flow estimators for motion compensation. 1) EF STMC, Fusion Scheme R ΔPSNR ΔSSIM #Param(K) FLOPs(G)
a lightweight Spatial Transformer Motion Compensation 1 0.60 0.0104 308.17 159.22
EFMC & Joint
3 0.66 0.0116 313.94 163.82
(STMC) network used in MFQE 2.0 (Guan et al. 2019). 2)
1 0.59 0.0105 313.95 245.10
EF FlowNetS, a larger network used in FlowNet (Dosovit- STDC & Pairwise
3 0.65 0.0117 316.25 409.49
skiy et al. 2015). We train models under different temporal 1 0.64 0.0117 329.84 176.47
radius R to evaluate the scalability. STDC & Joint
3 0.74 0.0133 364.51 204.08
Figure 7 presents the comparative results. We can observe
that all methods with reference frames outperform the sin- Table 3: Ablation study on convolution and offset prediction.
gle frame baseline, demonstrating the effectiveness of using
temporal information. For R=1, STDF significantly outper-
forms both EF and EFMC in terms of ΔPSNR with com- Effectiveness of Joint Offset Prediction. Joint predic-
parable computational cost, which suggests STDF can make tion scheme is introduced to generate deformable offsets for
better use of temporal information. As R further increases, STDC. To demonstrate its effectiveness, we replace it with
it is intriguing that ΔPSNR of EF STMC deteriorates in- pairwise prediction scheme. Specifically, we modify the in-
stead, and that of EF FlowNetS only has marginal improve- put and output layers of offset prediction network, and con-
ment. We think the reason is twofold. First, it is difficult duct offset prediction separately for each reference-target
for optical flow estimator to capture large temporal mo- pair. From Table 3 we can see that ΔPSNR and ΔSSIM with
tion, which results in ineffective use of the added reference pairwise scheme are reduced, while FLOPs is greatly in-
frames. Second, the training samples with different motion creased, which shows the proposed joint prediction scheme
intensity may confuse the optical flow estimator, especially can better exploit temporal information with high efficiency.
for EF STMC which has relatively low capacity. In contrast,
the proposed STDF takes the whole video clip into consid- 5 Conclusion
eration, forcing the offset prediction network to simultane-
ously learn motion with various intensity. Thus, ΔPSNR of We have presented a fast yet effective method for com-
STDF consistently improves as R increases. In addition, the pressed video quality enhancement, which incorporates a
computation of STDF increases much slower than that of novel spatio-temporal deformable convolution to aggregate
EF STMC and EF FlowNetS as R increases, which demon- temporal information from consecutive frames. Our method
strates the higher efficiency of STDF. performs favorably against previous methods in terms of
both accuracy and efficiency on benchmark dataset. We be-
Effectiveness of STDC. The proposed STDC features lieve the proposed spatio-temporal deformable convolution
position-specific sampling for temporal fusion, which en- can also extend to other video related low-level vision tasks,
ables higher flexibility and robustness than traditional fu- including super-resolution, restoration and frame synthesis,
sion scheme with motion compensation (EFMC). To ver- for efficient temporal information fusion.
ify this, we introduce a variant of our method that replaces
STDC with EFMC. Specifically, the optical flow estima- Acknowledgments. This work was partially supported by
tor in EFMC is modified from the offset prediction net- NSFC (Grant No.61601406, No.61974133), and Guang-
dong Province (Grant No.2018B030338001).
work, where the output layer of the network is revised for
flow estimation instead of offset prediction. According to
Table 3, although parameters and FLOPs slightly improve References
when replacing STDC with EFMC, the overall ΔPSNR at Bertasius, G.; Torresani, L.; and Shi, J. 2018. Object detection in
R=1 and R=3 drops by 0.04 dB and 0.08 dB respectively. video with spatiotemporal sampling networks. In ECCV, 342–357.
This demonstrates the effectiveness of the proposed STDC. Caballero, J.; Ledig, C.; Aitken, A.; Acosta, A.; Totz, J.; Wang, Z.;
and Shi, W. 2017. Real-time video super-resolution with spatio-
3
EFMC scheme applies optical flow based motion compensa- temporal networks and motion compensation. In CVPR, 4778–
tion to mitigate temporal motion before early fusion. 4787.
10702
Chang, H.; Ng, M. K.; and Zeng, T. 2013. Reducing artifacts in Sullivan, G. J.; Ohm, J. R.; and Wiegand, T. 2013. Overview of the
jpeg decompression via a learned dictionary. TSP 62(3):718–728. high efficiency video coding (hevc) standard. TCSVT 22(12):1649–
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; 1668.
Feng, W.; Liu, Z.; Xu, J.; Zhang, Z.; Cheng, D.; Zhu, C.; Cheng, Tai, Y.; Yang, J.; Liu, X.; and Xu, C. 2017. Memnet: A persistent
T.; Zhao, Q.; Li, B.; Lu, X.; Zhu, R.; Wu, Y.; Dai, J.; Wang, J.; Shi, memory network for image restoration. In ICCV, 4549–4557.
J.; Ouyang, W.; Loy, C. C.; and Lin, D. 2019. MMDetection: Tian, Y.; Zhang, Y.; Fu, Y.; and Xu, C. 2018. Tdan: Tempo-
Open mmlab detection toolbox and benchmark. arXiv preprint rally deformable alignment network for video super-resolution.
arXiv:1906.07155. arXiv:1812.02898.
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; and Wei, Y. VQEG. Vqeg video datasets and organizations. https://www.its.
2017. Deformable convolutional networks. In ICCV, 764–773. bldrdoc.gov/vqeg/video-datasets-and-organizations.aspx.
Dong, C.; Deng, Y.; Change Loy, C.; and Tang, X. 2015. Compres- Wang, Z.; Bovik, A. C.; Sheikh, H. R.; and Simoncelli, E. P. 2004.
sion artifacts reduction by a deep convolutional network. In ICCV, Image quality assessment: from error visibility to structural simi-
576–584. larity. TIP 13(4):600–612.
Dosovitskiy, A.; Fischer, P.; Ilg, E.; Hausser, P.; Hazirbas, C.; Wang, X.; Chan, K. C.; Yu, K.; Dong, C.; and Loy, C. C. 2019.
Golkov, V.; Van Der Smagt, P.; Cremers, D.; and Brox, T. 2015. Edvr: Video restoration with enhanced deformable convolutional
Flownet: Learning optical flow with convolutional networks. In networks. In CVPRW.
ICCV, 2758–2766.
Wiegand, T.; Sullivan, G. J.; Bjontegaard, G.; and Luthra, A.
Foi, A.; Katkovnik, V.; and Egiazarian, K. 2007. Pointwise shape- 2003. Overview of the h.264/avc video coding standard. TCSVT
adaptive dct for high-quality denoising and deblocking of grayscale 13(7):560–576.
and color images. TIP 16(5):1395–1411.
Wien, M. 2015. High Efficiency Video Coding. Springer.
Galteri, L.; Seidenari, L.; Bertini, M.; and Bimbo, A. D. 2017.
Deep generative adversarial compression artifact removal. In Xiph.org. Xiph.org video test media (derf’s collection). https://
ICCV, 4836–4845. media.xiph.org/video/derf/.
Guan, Z.; Xing, Q.; Xu, M.; Yang, R.; Liu, T.; and Wang, Z. 2019. Xue, T.; Chen, B.; Wu, J.; Wei, D.; and Freeman, W. T. 2017. Video
Mfqe 2.0: A new approach for multi-frame quality enhancement on enhancement with task-oriented flow. IJCV 127(1):1106–1250.
compressed video. TPAMI. Yang, R.; Xu, M.; Wang, Z.; and Li, T. 2018. Multi-frame quality
Guo, J., and Chao, H. 2016. Building dual-domain representations enhancement for compressed video. In CVPR, 6664–6673.
for compression artifacts reduction. In ECCV, 628–644. Yang, R.; Sun, X.; Xu, M.; and Zeng, W. 2019. Quality-gated con-
Jancsary, J.; Nowozin, S.; and Rother, C. 2012. Loss-specific train- volutional lstm for enhancing compressed video. In ICME, 532–
ing of non-parametric image restoration models: A new state of the 537.
art. In ECCV, 112–125. Zeiler, M. D., and Fergus, R. 2014. Visualizing and understanding
Kappeler, A.; Yoo, S.; Dai, Q.; and Katsaggelos, A. K. 2016. convolutional networks. In ECCV, 818–833.
Video super-resolution with convolutional neural networks. TCI Zhang, X.; Xiong, R.; Fan, X.; Ma, S.; and Gao, W. 2013. Com-
2(2):109–122. pression artifact reduction by overlapped-block transform coeffi-
Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; cient estimation with block similarity. TIP 22(12):4613–4626.
and Fei-Fei, L. 2014. Large-scale video classification with convo- Zhang, K.; Zuo, W.; Chen, Y.; Meng, D.; and Zhang, L. 2017. Be-
lutional neural networks. In CVPR, 1725–1732. yond a gaussian denoiser: Residual learning of deep cnn for image
Kim, T.; Sajjadi, M. S. M.; Hirsch, M.; and Bernhard, S. 2018. denoising. TIP 26(7):3142–3155.
Spatio-temporal transformer network for video restoration. In Zhang, Y.; Li, K.; Li, K.; Zhong, B.; and Fu, Y. 2019. Residual
ECCV, 111–127. non-local attention networks for image restoration. In ICLR.
Kim, J. W.; Lee, J. K.; and Lee, K. M. 2016. Accurate image Zhu, X.; Hu, H.; Lin, S.; and Dai, J. 2019. Deformable convnets
super-resolution using very deep convolutional networks. In CVPR, v2: More deformable, better results. In CVPR, 9308–9316.
1646–1654.
Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic
optimization. arXiv:1412.6980.
Lu, G.; Ouyang, W.; Xu, D.; Zhang, X.; Cai, C.; and Gao, Z. 2019.
Dvc: An end-to-end deep video compression framework. In CVPR,
11006–11015.
Niklaus, S.; Mai, L.; and Liu, F. 2017. Video frame interpolation
via adaptive separable convolution. In ICCV, 261–270.
Ohm, J.-R.; Sullivan, G. J.; Schwarz, H.; Tan, T. K.; and Wiegand,
T. 2012. Comparison of the coding efficiency of video coding
standards—including high efficiency video coding (hevc). TCSVT
22(12):1669–1684.
Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-net: Convolu-
tional networks for biomedical image segmentation. In MICCAI,
234–241.
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.-Y.; Wong, W.-K.; and Woo,
W.-c. 2015. Convolutional lstm network: A machine learning ap-
proach for precipitation nowcasting. In NIPS, 802–810.
10703

Spatio-Temporal Deformable Convolution For Compressed Video Quality Enhancement

Uploaded by

Copyright:

Available Formats

Spatio-Temporal Deformable Convolution For Compressed Video Quality Enhancement

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spatio-Temporal Deformable Convolution For Compressed Video Quality Enhancement

Uploaded by

Copyright:

Available Formats

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)