Spatio-Temporal Deformable Convolution For Compressed Video Quality Enhancement
Spatio-Temporal Deformable Convolution For Compressed Video Quality Enhancement
Spatio-Temporal Deformable Convolution For Compressed Video Quality Enhancement
Abstract
Recent years have witnessed remarkable success of deep
learning methods in quality enhancement for compressed
video. To better explore temporal information, existing meth-
ods usually estimate optical flow for temporal motion com-
pensation. However, since compressed video could be se-
riously distorted by various compression artifacts, the es-
timated optical flow tends to be inaccurate and unreliable,
thereby resulting in ineffective quality enhancement. In addi-
tion, optical flow estimation for consecutive frames is gener-
ally conducted in a pairwise manner, which is computational
expensive and inefficient. In this paper, we propose a fast yet
effective method for compressed video quality enhancement
by incorporating a novel Spatio-Temporal Deformable Fusion
(STDF) scheme to aggregate temporal information. Specifi-
cally, the proposed STDF takes a target frame along with its
neighboring reference frames as input to jointly predict an Figure 1: Illustration of compression artifacts. Videos are
offset field to deform the spatio-temporal sampling positions compressed by the latest H.265/HEVC coding algorithm.
of convolution. As a result, complementary information from
both target and reference frames can be fused within a sin-
gle Spatio-Temporal Deformable Convolution (STDC) oper- During the past decades, extensive works have been con-
ation. Extensive experiments show that our method achieves
the state-of-the-art performance of compressed video quality
ducted on artifacts removal or quality enhancement for sin-
enhancement in terms of both accuracy and efficiency. gle compressed image. Traditional methods (Foi, Katkovnik,
and Egiazarian 2007; Zhang et al. 2013) reduced artifacts by
optimizing the transform coefficients for specific compres-
1 Introduction sion standard, thus they are hard to extend to other compres-
sion schemes. With the recent advances in Convolutional
Nowadays, video content has become a major fraction of
Neural Networks (CNNs), CNN-based methods (Dong et al.
digital network traffic and is still growing (Wien 2015). To
2015; Tai et al. 2017; Zhang et al. 2017; 2019) have also
transmit video under limited bandwidth, video compression
emerged for image quality enhancement. They usually learn
is indispensable to significantly reduce the bit-rate. How-
a non-linear mapping to directly regress the artifact-free im-
ever, compression algorithms, such as H.264/AVC (Wiegand
age from a large amount of training data, leading to impres-
et al. 2003) and H.265/HEVC (Sullivan, Ohm, and Wiegand
sive results with high efficiency. However, these methods
2013), often introduce various artifacts in the compressed
cannot be directly extended to compressed video since they
video, especially at low bit-rate. As shown in Figure 1, such
treat frames independently and thus fail to exploit temporal
artifacts may considerably diminish video quality, resulting
information.
in degradation of Quality of Experience (QoE). The dis-
torted contents in low-quality compressed video may also On the other hand, there is only limited study on qual-
reduce performance of subsequent vision tasks (e.g., recog- ity enhancement for compressed video. Yang et al. first pro-
nition, detection, tracking) in low-bandwidth applications posed Multi-Frame Quality Enhancement (MFQE 1.0) ap-
(Galteri et al. 2017; Lu et al. 2019). Thus, it’s crucial to study proach to leverage temporal information for VQE (Yang et
on compressed video quality enhancement (VQE). al. 2018). Specifically, high quality frames in compressed
video are utilized as reference frame to help enhancing qual-
∗
Corresponding author. ity of neighboring low quality target frame via a novel
Copyright c 2020, Association for the Advancement of Artificial Multi-Frame CNN (MF-CNN). Recently, an upgraded ver-
Intelligence (www.aaai.org). All rights reserved. sion MFQE 2.0 (Guan et al. 2019) was introduced to further
10696
improve the efficiency of MF-CNN, and achieved state-of- of image quality enhancement. These methods tend to ap-
the-art performance. In order to aggregate information from ply large CNNs to capture discriminative features within an
target frame and reference frames, both MFQE methods image, resulting in a large amount of computations and pa-
adopt a widely used temporal fusion scheme that incorpo- rameters. On the other hand, MFQE 1.0 (Yang et al. 2018)
rates dense optical flow for motion compensation (Kappeler pioneered on applying multi-frame CNN to take advantage
et al. 2016; Caballero et al. 2017; Xue et al. 2017). How- of temporal information for compressed video quality en-
ever, this temporal fusion scheme may be suboptimal in the hancement, where high quality frames are utilized to help
context of VQE task. Since compression artifacts could seri- enhancing quality of the adjacent low quality frames. To ex-
ously distort video contents and break pixelwise correspon- ploit long range temporal information, Yang et al. later in-
dances between frames, the estimated optical flow tends to troduced a modified convolutional long short-term memory
be inaccurate and unreliable, thereby resulting in ineffective network (Yang et al. 2019) for video quality enhancement.
quality enhancement. In addition, optical flow estimation Most recently, Guan et al. proposed MFQE 2.0 (Guan et al.
needs to be repeatedly performed for different reference- 2019) to upgrade several key components of MFQE 1.0 and
target frame pairs in a pairwise manner, which involves sub- achieved state-of-the-art performance in terms of accuracy
stantially increased computational cost to explore more ref- and speed.
erence frames.
To address the aforementioned issues, we introduce a Leveraging Temporal Information. It is crucial to lever-
Spatio-Temporal Deformable Fusion (STDF) scheme for age complementary information across multiple frames for
VQE task. Specifically, we propose to learn a novel Spatio- video related tasks. Karpathy et al. first introduced sev-
Temporal Deformable Convolution (STDC) to aggregate eral convolution based fusion schemes to combine spatio-
temporal information while avoiding explicit optical flow temporal information for video classification (Karpathy et
estimation. The main idea of STDC is to adaptively deform al. 2014). Kappeler et al. later investigated those fusion
the spatio-temporal sampling positions of convolution so as schemes for low-level vision tasks (Kappeler et al. 2016),
to capture the most relevant context and exclude the noisy and managed to improve accuracy by compensating motion
content for quality enhancement of the target frame. To this across consecutive frames with a Total Variation (TV) based
end, we adopt a CNN-based predictor to jointly model the optical flow estimation algorithm. Caballero et al. further re-
correspondance across target and reference frames, and ac- placed the TV based flow estimator with CNN to enable end-
cordingly regress those sampling positions within a single to-end training (Caballero et al. 2017). Since then, temporal
inference pass. The main contributions of this paper are sum- fusion with motion compensation has been widely adopted
marized as follows: for various vision tasks (Xue et al. 2017; Yang et al. 2018;
• We propose an end-to-end CNN-based method for VQE Kim et al. 2018; Guan et al. 2019). However, these methods
task, which incorporates a novel STDF scheme to aggre- heavily rely on accurate optical flow which is hard to obtain
gate temporal information. due to general problems (e.g., occlusion, large motion) or
task-specific problems (e.g., compression artifacts). To cope
• We analytically and experimentally compare the proposed
with this, works have been conducted to bypass explicit op-
STDF to prior fusion schemes, and demonstrate its higher
tical flow estimation. Niklaus, Mai, and Liu proposed Ada-
flexibility and robustness.
Conv (Niklaus, Mai, and Liu 2017) to adaptively generate
• We quantitatively and qualitatively evaluate the proposed convolution kernels by implicitly utilizing motion cues for
method on VQE benchmark dataset and show that it video frame interpolation. Shi et al. introduced ConvLSTM
achieves state-of-the-art performance in terms of accuracy network to exploit contextual information from a long range
and efficiency. of adjacent frames (Shi et al. 2015). In this work, we pro-
pose to combine motion cues with convolution to efficiently
2 Related Work aggregate spatio-temporal information, which also omits the
Image and Video Quality Enhancement. Over the past explicit estimation of optical flow.
decade, an increasing number of works have focused on
quality enhancement for compressed image (Foi, Katkovnik, Deformable Convolution. Dai et al. first proposed to aug-
and Egiazarian 2007; Jancsary, Nowozin, and Rother 2012; ment regular convolution with learnable sampling offsets
Chang, Ng, and Zeng 2013; Zhang et al. 2013; Dong et to model complex geometric transformations for object de-
al. 2015; Guo and Chao 2016; Zhang et al. 2017; 2019). tection (Dai et al. 2017; Zhu et al. 2019). Later, several
Among them, CNN-based end-to-end methods achieve the works (Bertasius, Torresani, and Shi 2018; Tian et al. 2018;
recent state-of-the-art performance. Specifically, Dong et al. Wang et al. 2019) extended it along temporal extent to im-
first introduced a 4-layer AR-CNN (Dong et al. 2015) to plicitly capture motion cues for video-related applications,
remove various JPEG compression artifacts. Later, Zhang and achieved better performance than traditional methods.
et al. managed to learn a very deep DnCNN (Zhang et However, these methods perform deformable convolution in
al. 2017) with residual learning scheme for several image a pairwise manner, thus fail to fully explore temporal corre-
restoration tasks. Most recently, Zhang et al. proposed an spondances across multiple frames. In this work, we propose
even deeper network RNAN (Zhang et al. 2019) with resid- STDC to jointly consider a video clip rather than splitting it
ual non-local attention mechanism to capture long-range de- into several reference-target frame pairs, leading to more ef-
pendencies between pixels and set up a new state-of-the-art fective use of contextual information.
10697
Figure 2: Overview of the proposed framework for compressed video quality enhancement. Given a compressed video clip with
2R + 1 concatenated frames, an offset prediction network is first adopted to generate deformable offset field. With this offset
field, spatio-temporal deformable convolution is then performed to fuse temporal information and produce fused feature maps.
At last, QE network is used to compute the enhancement residual map, and the final enhanced result can be obtained by adding
the residual map back to the compressed target frame. Herein, temporal radius R=1, and deformable kernel size K=3.
10698
(a) Motion Compensation before Convolution
Figure 3: Visualization of EF and our STDC. Herein, red Figure 4: Comparison between motion compensation based
points represent the sampling positions of 3 × 3 convolution convolution and spatio-temporal deformable convolution.
window centered at green points. The STDC can adapt to Herein, 3 × 3 convolution is used for demonstration.
both large temporal motion (ball) and the small one (head),
and accordingly captures the relevant context for quality en-
hancement. of all (de)convolutional layers to C1 . Rectified Linear Unit
(ReLU) is adopted as activation function for all layers except
the last one which is followed by linear activation to regress
dynamics within the video clip can be simultaneously mod- the offset field Δ. We do not use any normalization layer in
eled, as shown in Figure 3. Since the learnable offsets can the network.
be fractional, we follow Dai et al. (Dai et al. 2017) to apply
the differentiable bilinear interpolation to sample sub-pixel 3.3 QE Module
ItLQ (p + pk ). The main idea of QE module is to fully explore complemen-
Unlike previous VQE methods (Yang et al. 2018; Guan et tary information from fused feature maps F and accordingly
al. 2019) which perform explicit motion compensation be- generate the enhanced target frame IˆtHQ
0
. In order to take ad-
fore fusion to alleviate the effect of temporal motion, STDC vantage of residual learning (Kim, Lee, and Lee 2016), we
implicitly combines motion cues with position-specific sam- first learn a non-linear mapping Fθqe (·) to predict the en-
pling while fusion. This leads to higher flexibility and ro- hancement residual as
bustness because adjacent convolution windows can sample R̂HQ
t0 = Fθqe (F ) (5)
contents independently, as shown in Figure 4.
The enhanced target frame can then be generated as
Joint Deformable Offset Prediction. Different from op-
tical flow estimation that solely handles one reference-target IˆtHQ = R̂HQ
0 t 0
+ ItLQ 0
(6)
frame pair at a time, we propose to take the whole clip into As illustrated in Figure 2-(c), we implement Fθqe (·)
consideration and jointly predict all deformable offsets at through another CNN which consists of L convolutional lay-
once. To this end, we apply an offset prediction network ers of stride 1. All layers except the last one have C2 con-
2
Fθop (·) to predict an offset field Δ ∈ R(2R+1)×2K ×H×W volutional filters followed by ReLU activation. The last con-
for all spatio-temporal positions in the video clip as volutional layer outputs the enhancement residual. Without
bells and whistles, such plain QE network is able to achieve
Δ = Fθop ([ItLQ
0 −R
, · · · , ItLQ
0
, · · · , ItLQ
0 +R
]) (4) satisfactory enhancement results.
where frames are concatenated together as input. Since con-
secutive frames are highly correlated, offset prediction for 3.4 Training Scheme
one frame can benefit from the other frames, leading to more Since STDF module and QE module are fully-convolutional
effective use of temporal information than pairwise scheme. and thus differentiable, we jointly optimize θop and θqe in an
In addition, joint prediction is more computational efficient end-to-end fashion. The overall loss function L is set to the
because all deformable offsets can be obtained in a single Sum of Squared Error (SSE) between the enhanced target
inference pass. frame IˆtHQ
0
and the raw one ItHQ0
as
As shown in Figure 2-(a), we adopt a U-Net based net-
work (Ronneberger, Fischer, and Brox 2015) for offset pre- L = IˆtHQ
0
− ItHQ
0
22 (7)
diction to enlarge receptive field so as to capture large Note that, as there is no ground-truth for deformable off-
temporal dynamics. Convolutional and deconvolutional lay- sets, learning for offset prediction network Fθop (·) is to-
ers (Zeiler and Fergus 2014) with stride of 2 are used for tally unsupervised and fully driven by the final loss L,
downsampling and upsampling respectively. For convolu- which is different from previous works (Yang et al. 2018;
tional layer with stride of 1, zero padding is used to re- Guan et al. 2019) that incorporate auxiliary losses to con-
tain feature size. For simplicity, we set the filter number strain optical flow estimation.
10699
Image QE Methods Video QE Methods
QP Test Videos† Dong et al. Zhang et al. Zhang et al. Yang et al. Guan et al. Ours Ours Ours
AR-CNN DnCNN RNAN‡ MFQE 1.0 MFQE 2.0 STDF-R1 STDF-R3 STDF-R3L
Traffic 0.27 / 0.50 0.35 / 0.64 0.40 / 0.86 0.50 / 0.90 0.59 / 1.02 0.56 / 0.92 0.65 / 1.04 0.73 / 1.15
Class A
PeopleOnStreet 0.37 / 0.76 0.54 / 0.94 0.74 / 1.30 0.80 / 1.37 0.92 / 1.57 1.05 / 1.66 1.18 / 1.82 1.25 / 1.96
Kimono 0.20 / 0.59 0.27 / 0.73 0.33 / 0.98 0.50 / 1.13 0.55 / 1.18 0.66 / 1.32 0.77 / 1.47 0.85 / 1.61
ParkScene 0.14 / 0.44 0.17 / 0.52 0.20 / 0.77 0.39 / 1.03 0.46 / 1.23 0.41 / 1.05 0.54 / 1.32 0.59 / 1.47
Class B Cactus 0.20 / 0.41 0.28 / 0.53 0.35 / 0.76 0.44 / 0.88 0.50 / 1.00 0.59 / 1.06 0.70 / 1.23 0.77 / 1.38
BQTerrace 0.23 / 0.43 0.33 / 0.53 0.42 / 0.84 0.27 / 0.48 0.40 / 0.67 0.55 / 0.89 0.58 / 0.93 0.63 / 1.06
BasketballDrive 0.23 / 0.51 0.33 / 0.63 0.43 / 0.92 0.41 / 0.80 0.47 / 0.83 0.60 / 0.99 0.66 / 1.07 0.75 / 1.23
RaceHorses 0.23 / 0.49 0.31 / 0.70 0.39 / 0.99 0.34 / 0.55 0.39 / 0.80 0.41 / 0.98 0.48 / 1.09 0.55 / 1.35
BQMall 0.28 / 0.69 0.38 / 0.87 0.45 / 1.15 0.51 / 1.03 0.62 / 1.20 0.75 / 1.44 0.90 / 1.61 0.99 / 1.80
Class C
37 PartyScene 0.14 / 0.52 0.22 / 0.69 0.30 / 0.98 0.22 / 0.73 0.36 / 1.18 0.52 / 1.49 0.60 / 1.60 0.68 / 1.94
BasketballDrill 0.23 / 0.48 0.42 / 0.89 0.50 / 1.07 0.48 / 0.90 0.58 / 1.20 0.64 / 1.19 0.70 / 1.26 0.79 / 1.49
RaceHorses 0.26 / 0.59 0.34 / 0.80 0.42 / 1.02 0.51 / 1.13 0.59 / 1.43 0.63 / 1.51 0.73 / 1.75 0.83 / 2.08
BQSquare 0.21 / 0.30 0.30 / 0.46 0.32 / 0.63 -0.01 / 0.15 0.34 / 0.65 0.75 / 1.03 0.91 / 1.13 0.94 / 1.25
Class D
BlowingBubbles 0.16 / 0.46 0.25 / 0.76 0.31 / 1.08 0.39 / 1.20 0.53 / 1.70 0.53 / 1.69 0.68 / 1.96 0.74 / 2.26
BasketballPass 0.26 / 0.63 0.38 / 0.83 0.46 / 1.08 0.63 / 1.38 0.73 / 1.55 0.80 / 1.54 0.95 / 1.82 1.08 / 2.12
FourPeople 0.40 / 0.56 0.54 / 0.73 0.70 / 0.97 0.66 / 0.85 0.73 / 0.95 0.83 / 1.01 0.92 / 1.07 0.94 / 1.17
Class E Johnny 0.24 / 0.21 0.47 / 0.54 0.56 / 0.88 0.55 / 0.55 0.60 / 0.68 0.65 / 0.71 0.69 / 0.73 0.81 / 0.88
KristenAndSara 0.41 / 0.47 0.59 / 0.62 0.63 / 0.80 0.66 / 0.75 0.75 / 0.85 0.84 / 0.83 0.94 / 0.89 0.97 / 0.96
Average 0.25 / 0.50 0.36 / 0.69 0.44 / 0.95 0.46 / 0.88 0.56 / 1.09 0.65 / 1.18 0.75 / 1.32 0.83 / 1.51
32 Average 0.19 / 0.17 0.33 / 0.41 0.41 / 0.62 0.43 / 0.58 0.52 / 0.68 0.64 / 0.77 0.73 / 0.87 0.86 / 1.04
27 Average 0.16 / 0.09 0.33 / 0.26 -/- 0.40 / 0.34 0.49 / 0.42 0.59 / 0.47 0.67 / 0.53 0.72 / 0.57
22 Average 0.13 / 0.04 0.27 / 0.14 -/- 0.31 / 0.19 0.46 / 0.27 0.51 / 0.27 0.57 / 0.30 0.63 / 0.34
†
Video resolution: Class A (2560×1600), Class B (1920×1080), Class C (832×480), Class D (480×240), Class E
(1280×720).
‡
Patch-wise enhancement is performed for RNAN method due to memory restriction.
Table 1: Quantitative results of ΔPSNR (dB) / ΔSSIM (×10−2 ) on test videos at 4 different QPs.
Method
Processing Speed @ Different Resolution
#Param(K) ducted at 4 different Quantization Parameters (QPs), i.e., 22,
120p 240p 480p 720p 1080p 27, 32, 37, in order to evaluate performance under different
DnCNN 191.8 54.7 14.1 6.1 2.6 556 compression levels.
RNAN 5.6 - - - - 8957
MFQE 1.0 - 12.6 3.8 1.6 0.7 1788 4.2 Implementation Details
MFQE 2.0 - 25.3 8.4 3.7 1.6 255
The proposed method is implemented based on PyTorch
STDF-R1 141.9 38.9 9.9 4.2 1.8 330
STDF-R3 132.7 36.4 9.1 3.8 1.6 365 framework with reference to MMDetection toolbox (Chen
STDF-R3L 96.6 23.8 5.9 2.5 1.0 1275 et al. 2019) for deformable convolution. For training, we
randomly crop 64 × 64 clips from raw and the correspond-
Table 2: Quantitative results of speed (FPS) and amount ing compressed videos as training samples. Data augmen-
of parameters. Results of speed are measured on Nvidia tation (i.e., rotation or flip) is further used to better ex-
GeForce GTX 1080 Ti GPU. ploit those training samples. We train all models using
Adam optimizer (Kingma and Ba 2014) with β1 = 0.9,
β2 = 0.999 and = 10−8 . Learning rate is initially set
4 Experiments to 10−4 and retained throughout training. We train 4 mod-
4.1 Datasets els from scratch for the 4 QPs respectively. For evaluation,
as with previous works, we only apply quality enhancement
Following MFQE 2.0 (Guan et al. 2019), we collect a to- on Y-channel (i.e., luminance component) in YUV/YCbCr
tal of 130 uncompressed videos with various resolutions space. We adopt incremental Peak Signal-to-Noise Ratio
and contents from two databases, i.e., Xiph (Xiph.org) and (ΔPSNR) and Structural Similarity (ΔSSIM) (Wang et al.
VQEG (VQEG), where 106 of them are selected for train- 2004) to evaluate quality enhancement performance, which
ing and the rest are for validation. For testing, we adopt measure the improvement of the enhanced video from the
the dataset from Joint Collaborative Team on Video Cod- compressed one. We also evaluate the complexity of a qual-
ing (Ohm et al. 2012) with 18 uncompressed videos. These ity enhancement approach in terms of parameters and com-
testing videos are widely used for video quality assessment putational cost.
with around 450 frames per video. We compress all the
above videos by the latest H.265/HEVC reference software 4.3 Comparison to State-of-the-arts
HM16.5 2 under Low Delay P (LDP) configuration, as in
previous work (Guan et al. 2019). The compression is con- We compare the proposed method with state-of-the-art im-
age/video quality enhancement methods, AR-CNN (Dong
2
https://hevc.hhi.fraunhofer.de et al. 2015), DnCNN (Zhang et al. 2017), RNAN (Zhang
10700
Compressed Frame Compressed AR-CNN DnCNN RNAN MFQE 2.0 STDF-R1 STDF-R3 Raw
Figure 5: Qualitative results at QP 37. Note that enhancement is only conducted on luminance component for all methods.
Video index (from top to bottom): BasketballPass, Kimono, FourPeople, Cactus.
et al. 2019), MFQE 1.0 (Yang et al. 2018) and MFQE image quality enhancement methods can decently reduce
2.0 (Guan et al. 2019). For fair comparison, all image qual- those artifacts, the resulting frames usually become over-
ity enhancement methods are retrained on our training set. blurred and lack of details. On the other hand, video quality
Results of video quality enhancement methods are cited enhancement methods achieve better enhancement results
from (Guan et al. 2019). Three variants of our method with with the help of reference frames. Compared to MFQE 2.0,
different configurations (refer to previous section for details) our STDF models are more robust to compression artifacts
are evaluated. 1) STDF-R1 with R=1, C1 =32, C2 =48, L=8. and can better explore spatio-temporal information, thereby
2) STDF-R3 with R=3, C1 =32, C2 =48, L=8. 3) STDF-R3L leading to better restoration of structural details.
with R=3, C1 =64, C2 =64, L=16.
Quantitative Results. Table 1 and Table 2 present the Quality Fluctuation. It is observed that dramatic quality
quantitative results of accuracy and model complexity re- fluctuation exists in compressed video (Guan et al. 2019),
spectively. As can be observed, our method consistently out- which may severely break temporal consistency and degrade
perform all compared methods in terms of average ΔPSNR QoE. To investigate how our method can help with this, we
and ΔSSIM on the 18 test videos. More specifically, at QP plot PSNR curves of 2 sequences in Figure 6. As can be
37, our STDF-R1 outperforms MFQE 2.0 on most of the seen, our STDF-R1 model can effectively enhance most of
videos, with faster processing speed and comparable param- the low quality frames and alleviate quality fluctuation. By
eters. We note that our STDF-R1 simply takes the preced- enlarging the temporal radius R to 3, our STDF-R3 model
ing and succeeding frames as reference, unlike MFQE 2.0 manages to take advantage of adjacent high quality frames,
which utilizes high quality neighboring frames, thereby sav- leading to better performance than other compared methods.
ing the computational cost for searching those high qual-
ity frames in advance. As temporal radius R increases to 3,
+ ( 9 & 5 1 $ 1 0 ) 4 ( 6 7 '