Academia.eduAcademia.edu

Unfolding a blurred image

2022

We present a solution for the goal of extracting a video from a single motion blurred image to sequentially reconstruct the clear views of a scene as beheld by the camera during the time of exposure. We first learn motion representation from sharp videos in an unsupervised manner through training of a convolutional recurrent video autoencoder network that performs a surrogate task of video reconstruction. Once trained, it is employed for guided training of a motion encoder for blurred images. This network extracts embedded motion information from the blurred image to generate a sharp video in conjunction with the trained recurrent video decoder. As an intermediate step, we also design an efficient architecture that enables real-time single image deblurring and outperforms competing methods across all factors: accuracy, speed, and compactness. Experiments on real scenes and standard datasets demonstrate the superiority of our framework over the state-of-the-art and its ability to gen...

Unfolding a blurred image A. N. Rajagopalan1 Kuldeep Purohit1 Anshul Shah2 * 1 2 Indian Institute of Technology Madras, India University of Maryland, College Park arXiv:2201.12010v1 [cs.CV] 28 Jan 2022 [email protected], [email protected], [email protected] Abstract the exposure time. In [42], it has been shown that standard network models used for vision tasks and trained only on high-quality images suffer a significant degradation in performance when applied to images degraded by blur. We present a solution for the goal of extracting a video from a single motion blurred image to sequentially reconstruct the clear views of a scene as beheld by the camera during the time of exposure. We first learn motion representation from sharp videos in an unsupervised manner through training of a convolutional recurrent video autoencoder network that performs a surrogate task of video reconstruction. Once trained, it is employed for guided training of a motion encoder for blurred images. This network extracts embedded motion information from the blurred image to generate a sharp video in conjunction with the trained recurrent video decoder. As an intermediate step, we also design an efficient architecture that enables real-time single image deblurring and outperforms competing methods across all factors: accuracy, speed, and compactness. Experiments on real scenes and standard datasets demonstrate the superiority of our framework over the state-of-the-art and its ability to generate a plausible sequence of temporally consistent sharp frames. Motion deblurring is a challenging problem in computer vision due to its ill-posed nature. Recent years have witnessed significant advances in deblurring [44, 25, 23, 2, 27, 28, 45, 26, 33, 32, 21, 44, 22, 20, 43, 31, 30, 17, 17, 29, 18]. Several methods [39, 24, 4, 37, 3, 9, 12, 13, 39] have been proposed to address this problem using hand-designed priors as well as Convolutional Neural Networks (CNN) [1, 35, 36] for recovering the latent image. A few methods [40, 6] have been proposed to remove heterogeneous blur but they are limited in their capability to handle general dynamic scenes. Most of these methods strongly rely on the accuracy of the assumed image degradation model and include intensive, sometimes heuristic, parameter-tuning and expensive computations, factors which severely restrict their accuracy and applicability in real-world scenarios. The recent works of [19, 21, 14, 41] overcome these limitations to some extent by learning to directly generate the latent sharp image, without the need for blur kernel estimation. We present a two-stage deep convolutional architecture to carve out a video from a motion blurred image that is applicable to non-uniform motion caused by individual or combined effects of camera motion, object motion and arbitrary depth variations in the scene. We avoid overly simplified models to represent motion and hence refrain from creating synthetic datasets for supervised training. The first stage consists of training a video auto-encoder wherein the encoder accepts a sequence of video frames to extract a latent motion representation while the decoder estimates the same video by applying estimated motion trajectories to a single sharp frame in a recurrent fashion. We use this trained video decoder to guide the training of a CNN (which we refer to as Blurred Image Encoder (BIE)) to extract the same motion information from a blurred image as the video encoder would from the image sequence corresponding to that blurred image. For testing, we propose an efficient deblurring network to first estimate a sharp frame from the given blurred image. The BIE is responsible for extracting 1. Introduction Recent works on future frame prediction reveal that direct intensity estimation leads to blurred predictions. Instead, if a frame is reconstructed based on the original image and corresponding transformations, both scene dynamics and invariant appearance can be preserved well. Based on this premise, [5, 51] and [16] model the task as a flow of image pixels. The methods [46, 48] generate a video from a single sharp image, but have a severe limitation in that they work only on the specific scene for which they are trained. All of these approaches work only on sharp images and videos. However, motion during exposure is known to cause severe degradation in the captured image quality due to the blur it induces. This is usually the case in lowlight situations where the exposure time of each frame is high and in scenes where significant motion happens within * Work done while at Indian Institute of Technology Madras, India. 1 motion features from the blurred image. The video decoder uses the outputs of the BIE and the deblurred sharp frame to generate the video underlying the motion blurred image. As the only other work of this kind, [8] very recently proposed a method to estimate a video from a single blurred image by training multiple neural networks to estimate the underlying frames. In contrast, our architecture utilizes a single recurrent neural network to generate the entire sequence. Our recurrent design implicitly addresses temporal ambiguity to a large extent, since generation of any frame in the sequence is naturally preconditioned on all the previous frames. The approach of [8] is limited to small motion, owing to its architecture and training procedure. We estimate pixel level motion instead of intensities which proves to be an advantage for the task at hand, especially in cases with large blur (which is an issue with [8]). Our deblurirng architecture not only outperforms all existing deblurring methods but is also smaller and significantly faster. In fact, separating the processes of content and motion estimation allows our architecture to be used with any off-the-shelf deblurring approach. the encoding returned by RVE for training due to lack of ground truth for the encoded representation. Instead, the BIE is trained such that the predicted video at the output of RVD for the given xB matches as closely as possible to the ground truth frames x1..N . This ensures that the BIE learns to capture ordered motion information for the RVD to return a realistic video. Directly training the BIE-RVD pair poses a challenge since it requires learning to perform two tasks jointly: “video generation from motion representation” and “ambiguity-invariant motion extraction from a blurred image”. Such training delivers below-par performance (see supplementary material). The overall architecture of the proposed methodology is given in Fig. 1. It is fully convolutional, end-to-end differentiable and can be trained using unlabeled high frame-rate videos, without the need for optical flow supervision, which is challenging to produce at large scale. During testing, the central sharp frame is not available and is estimated using an independently trained deblurring module (DM). We now describe the design aspects of the different modules. 2. The Proposed Architecture At each time-step, a frame is fed to a convolutional encoder, which generates a feature-map to be fed as input to the ConvLSTM cell. Interpreting ConvLSTM’s hiddenstates as a representation of motion, the kernel-size of a ConvLSTM is correlated with the speed of the motion which it can capture. Since we need to extract motion taking place within a single exposure at fine resolution, we choose a kernel-size of 3 × 3. As can be seen in Fig. 2(a), the encoder block is made of 4 convolutional blocks with 3 × 3 filters. The first block is a conv layer with stride of 1 and the rest contain a conv layer with stride of 2, followed by a Resblock. The number of feature maps in the outputs of these blocks are 16, 32, 64 and 128, respectively. A ConvLSTM cell operates on the features returned by the last block and augments it with memory from previous time-steps. Overall, each module can be represented as henc = n enc enc(henc is encoder ConvLSTM state at n−1 , xn ), where hn time step n and xn is the nth sharp frame of the video. In our proposed video autoencoder, the encoder utilizes all the video frames to extract a latent representation, which is then fed to decoder which estimates the frame sequence in a recurrent fashion. The Recurrent Video Encoder (RVE) reads N sharp frames x1..N , one at each timestep. It returns a tensor at the last time-step, which is utilized as the motion representation of the image sequence. This tensor is used to initialize the first hidden state of another ConvLSTM based network called Recurrent Video Decoder (RVD) whose task is to recurrently estimate N optical flows. Since the RVE-RVD pair is trained using reconstruction loss between the estimated frames x̂1..N and ground-truth frames x1..N , the RVD must return the predicted video. To enable this, the (known) central frame of the video is acted upon by the flows predicted by the RVD. Specifically, the estimated flows are individually fed to a differentiable transformation layer to transform the central frame xb N c to obtain the frames x̂1..N . Once trained, we 2 have an RVD which can estimate sequential motion flows, given a particular motion representation. In addition, we introduce another network called Blurred Image Encoder (BIE) whose task is to accept blurred image xB corresponding to the spatio-temporal average of the input frames x1..N and return a motion encoding, which too can be used to generate a sharp video. To achieve this task, we employ the already trained RVD to guide the training of BIE so as to extract the same motion information from the blurred image as the RVE would from that image sequence. In other words, the weights are to be learnt such that BIE(xB ) ≈ RV E(x1..N ). We refrain from using 2.1. Recurrent Video Encoder (RVE) 2.2. Recurrent Video Decoder (RVD) The task of RVD is to construct a sequence of frames using the motion representation provided by RVE and the (known) central frame (xb N c ) of the sequence. The RVD 2 contains a flow encoder which utilizes a structure similar to the RVE. Instead of accepting images, it accepts optical flows. The flow encoding is fed to a ConvLSTM cell whose first hidden state is initialized with the last hidden state he,N of the RVE. To estimate optical flows for a time-step, the output of the ConvLSTM cell is passed to a Flow decoder network (FD ). The flow estimated by FD at each time-step is fed to a transformer module (T ) which returns RVE : Recurrent Video Encoder Transformation Layer Recurrent Video Decoder Cell Recurrent Video Encoder Cell BIE: Blurred Image Encoder Motion Embedding RVD : Recurrent Video Decoder A3 A4 A1 A2 Figure 1. An overview of our video generation architecture during training. The first step involves training the RVE-RVD for the task of video reconstruction. This is followed by guided training of BIE through the trained RVD. Conv LSTM A1 3x3x16 stride 1 A2 3x3x32 stride 2 A3 3x3x64 stride 2 A4 3x3x128 stride 2 (a) RVE architecture. (b) BIE architecture. Figure 2. Architectures of BIE and RVE. The RVE is trained to extract a motion representation from a sequence of frames while the BIE is trained to extract a motion representation from a blurred image and a sharp image. the estimated frame x̂n . The descriptions of FD and T are provided below. Flow Decoder (FD ): Realizing that the flow at current step is related to the previous one, we perform recurrence on optical flows for consecutive frames. The design of FD is illustrated in Fig. 3. FD accepts the output of ConvLSTM unit at any time-step and generates a flow-map. For robust estimation, we further perform estimation of flow at multiple scales using deconvolution (deconv) layers which “unpool” the feature maps and increase the spatial dimensions by a factor of 2. Inspired by [34], we make use of skip connections between the layers of flow encoder and FD . All deconv operations use 4 × 4 filters and the convolutional operations use 3 × 3 filters. The output of the ConvLSTM cell is passed through a convolutional layer to estimate the flow fn,1 . The cell output is also passed through a deconv layer before being concatenated with the upsampled fn,1 and the corresponding feature-map coming from the encoder, to obtain a hybrid feature map at that scale. As shown in Fig. 3, this process is repeated 3 more times to obtain the flow maps at subsequently higher scales (fn,2...4 ). Figure 3. Our Recurrent Video Decoder (RVD). This module recurrently generates optical flows which are warped to transform the sharp frame. Flows are estimated at 4 different scales. 2.3. Blurred Image Encoder (BIE) We make use of the trained encoder-decoder couplet to solve the task of extracting video from a blurred image. We advocate a novel strategy of utilizing spatio-temporal embeddings to guide the training of a CNN. The trained decoder has learnt to generate optical flow for all time-steps from the encoder’s hidden state. We make use of this proxy network to solve the task of blurred image to video generation. The use of optical flow recurrence enables our network to prefer temporally consistent sequences, which preempts it from returning arbitrarily ordered frames. However, directional ambiguity stays. For a scene with multiple objects, the ambiguity becomes more pronounced as each object can have its own independent motion. The BIE is connected with the pre-trained RVD and the pair is trained (RVD is fine-tuned) using a combination of ordering-invariant frame reconstruction loss and spatial motion smoothness loss over Method Input image Deblurred image [49] [47] [40] 24.5 0.851 1500 54.1 PSNR(dB) 21 24.6 SSIM 0.740 0.845 Time (s) 3800 700 Size(MB) - [6] 26.4 0.863 1200 41.2 [19] 28.9 0.911 6 300 [14] 27.2 0.905 0.8 45.6 [41] 30.10 0.933 0.4 27.5 Ours 30.58 0.941 0.02 17.9 Table 1. Performance comparison of our deblurring network with existing methods on the benchmark dataset [19]. Convolution Layer Stride 1 Convolution Layer Stride 2 Residual Dense Block Deconvolution Layer Bottleneck Block Figure 4. An overview of our dense deblurring architecture which we utilize to estimate the central sharp frame. It follows an encoder-decoder design with residual-dense blocks, bottleneck blocks, and skip connections present at 3 different sub-scales. the RVD outputs (described later). No such ambiguity exists in the video autoencoder since the RVD has to exactly reproduce the video which is fed to RVE. The BIE is implemented as a CNN which specializes in extracting motion features from a blurred image (we experimentally found that feeding the central sharp frame along with the blurred image improves its performance). The BIE is tasked to extract the sequential motion in the image by capturing local motion, e.g. at the smeared edges in the image. Moreover, the generated encoding should be such that the RVD can reconstruct motion trajectories. The BIE has 7 convolutional layers with kernel sizes as shown in Fig. 2(b). Each layer (except the last) is followed by batchnormalization and leaky ReLU non-linearity. 2.4. Deblurring Module (DM) We propose an independent network for deblurring the motion blurred observation. The estimated sharp frame is fed to both BIE and RVD during testing. Recent works on image restoration have proposed endto-end trainable networks which require labeled pairs of degraded and sharp images. Among them, [19, 41] have achieved promising results using multi-scale CNN composed of residual connections. We explore a more effective network architecture which is inspired by prior methods that use multi-level and multi-scale features. Our high-level design is similar to that of U-Net [34], which has been used extensively for preserving global context information in various image-to-image tasks [7]. Based on the observation that increase in number of layers and connections across them leads to a boost in feature extraction capability, the encoder structure of our network utilizes a cascade of Residual Dense Blocks (RDB) [50] instead of convolutional layers. An RDB is a cascade of convolutional layers connected through a rich set of residual and concatenation connections which immensely improves feature extraction capability by reusing features across multiple layers. Inclusion of such connections maximizes information flow along the intermediate layers and results in better convergence. These units efficiently learn deeper and more complex features than a network with residual connections (which have been used extensively in recent deblurring methods[19, 14, 41, 8]), while requiring fewer parameters. Our proposed deblurring architecture is depicted in Fig. 4. The decoder part of our network contains 3 pairs of upsampling blocks to gradually enlarge the spatial resolution of feature maps. Each up-sampling block contains a bottleneck layer [10] followed by a deconvolution layer. Each convolution layer (except the last) is followed by a nonlinearity. Similar to U-Net, features corresponding to the same dimension in encoder and decoder are merged with the help of projection layers. The output of the final upsampling block is passed through two additional convolutional layers to reconstruct the output sharp image. Our network uses an asymmetric encoder-decoder architecture, where the network capacity becomes higher benefiting from the dense connections. 3. Experiments In this section, we carry out quantitative and qualitative comparisons of our approach with state-of-the-art methods for deblurring as well as video extraction tasks. 3.1. Results and Comparisons for Video Extraction In Fig 6, we give results on standard test blurred images from the dataset of [19]. Note that some of them suffer from significant blur. Fig. 6(a) shows an image of a planar scene which is blurred due to dominant camera motion. Fig. 6(b) shows a 3D scene blurred due to camera motion. Figs. 6(c-f) show results on blurred images with dynamic object motion. Observe that the videos generated by our approach are realistic and qualitatively consistent with the blur and depth of the scene, even when the foreground incurs large motion. Our network is able to reconstruct videos from blurred images with diverse motion and scene content. In comparison, the results of [8] suffer from local errors in deblurring, inconsistent motion estimation, as well as color distortions. We have observed that in general the method of [8] fails in cases involving high blur as direct im- Blurred Image Blurred patch Whyte et al. [47] Nah et al. [19] DelurGAN [14] SRN [41] Ours Figure 5. Visual comparisons of deblurring results on test dataset [19] (best view in high resolutions). (a) (b) (c) (d) (e) (f) Figure 6. Comparisons of our video extraction results with [8] on motion blurred images obtained from the test dataset of [19]. The first row shows the blurred images while the second and third rows show videos generated by our method and [8], respectively. Videos can be viewed by clicking on the image, when document is opened in Adobe Reader. age regression becomes difficult for large motion. In contrast, we divide the overall problem into two sub-tasks of deblurring and motion extraction. This simplifies learning and yields improvement in deblurring quality as well as motion estimation. The color issue in [8] can be attributed to the design of their networks, wherein feature extraction and reconstruction branches are different for different color channels. Our method applies the same motion to each color channel. By having a single recurrent network to generate the video, our network can be directly trained to extract even higher number of frames (> 9) without any design change or additional parameters. In contrast, [8] requires training of an additional network for each new pair of frames. Our overall architecture is more compact (34 MB vs 70 MB) and much faster (0.02s vs 0.45s for deblurring and 0.39s vs 1.10s for video generation) as compared to [8]. To perform quantitative comparisons with [8], we also trained another version of our network on the restricted case of blurred images produced by averaging 7 successive sharp frames. For testing, 250 blurred images of resolution 1280 × 704 were created using the 11 test videos from the dataset of [19]. We compared the videos estimated by the two methods using the ambiguity invariant loss function. The average error was found to be 49.06 for [8] and 44.12 for our method. Thus, even for the restricted case of small blur, our method performs favorably. Repeating the same experiment for 9 frames (i.e. for large blur from the same test videos) led to an error of 48.24 for our method, which is still less than the 7-frame error of [8]. We could not compute the 9-frame error for [8] as their network is rigidly designed for 7 frames only. 4. Conclusions We introduced a new methodology for video generation from a single blurred image. We proposed a spatio-temporal video auto-encoder based on an end-to-end differentiable architecture that learns motion representation from sharp videos in a self-supervised manner. The network predicts a sequence of optical flows and employs them to transform a sharp central frame and return a smooth video. Using the trained video decoder, we trained a blurred image encoder to extract a representation from a single blurred image, that mimics the representation returned by the video encoder. This when fed to the decoder returns a plausible sharp video (a) (b) (c) (d) (e) (f) Figure 7. Video generation from images blurred with global camera motion from datasets of [6],[11] and [15]. First row shows the blurred images. The generated videos using our method are shown in second row. 8 Figure 8. Video generation results on real motion blurred images from dataset of [38]. The first row shows the blurred images. Second row contains the extracted videos with our method. representing the action within the blurred image. We also proposed an efficient deblurring architecture composed of densely connected layers that yields state-of-the-art results. The potential of our work can be extended in a variety of directions including blur-based segmentation, video deblurring, video interpolation, action recognition etc. Refined and complete version of this work appeared in CVPR 2019. [7] [8] References [1] A. Chakrabarti. A neural approach to blind motion deblurring, 2016. 1 [2] P. Chandramouli and A. Rajagopalan. Inferring image transformation and structure from motion-blurred images. In BMVC, pages 73–1, 2010. 1 [3] S. Cho and S. Lee. Fast motion deblurring. ACM Trans. Graph., 28(5):1–8, dec 2009. 1 [4] R. Fergus, B. Singh, A. Hertzmann, S. T. Roweis, and W. T. Freeman. Removing camera shake from a single photograph. ACM Trans. Graph., 25(3):787–794, jul 2006. 1 [5] J. Flynn, I. Neulander, J. Philbin, and N. Snavely. Deep stereo: Learning to predict new views from the world’s imagery. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5515–5524, 2016. 1 [6] D. Gong, J. Yang, L. Liu, Y. Zhang, I. Reid, C. Shen, A. Van Den Hengel, and Q. Shi. From motion blur to motion flow: [9] [10] [11] A deep learning solution for removing heterogeneous motion blur. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3806–3815, 2017. 1, 4, 6 P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks, 2018. 4 M. Jin, G. Meishvili, and P. Favaro. Learning to extract a video sequence from a single motion-blurred image. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6334–6342, 2018. 2, 4, 5 N. Joshi, R. Szeliski, and D. J. Kriegman. Psf estimation using sharp edge prediction. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2008. 1 S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio. The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1175–1183, 2017. 4 R. Köhler, M. Hirsch, B. Mohler, B. Schölkopf, and S. Harmeling. Recording and playback of camera shake: Benchmarking blind deconvolution with a real-world database. In A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, editors, Computer Vision – ECCV 2012, pages 27–40, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. 6 [12] D. Krishnan and R. Fergus. Fast image deconvolution using hyper-laplacian priors. NIPS’09, page 1033–1041, Red Hook, NY, USA, 2009. Curran Associates Inc. 1 [13] D. Krishnan, T. Tay, and R. Fergus. Blind deconvolution using a normalized sparsity measure. In CVPR 2011, pages 233–240, 2011. 1 [14] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and J. Matas. Deblurgan: Blind motion deblurring using conditional adversarial networks, 2018. 1, 4, 5 [15] W.-S. Lai, J.-B. Huang, Z. Hu, N. Ahuja, and M.-H. Yang. A comparative study for single image blind deblurring. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1701–1709, 2016. 6 [16] Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala. Video frame synthesis using deep voxel flow, 2017. 1 [17] M. Mohan, S. Girish, and A. Rajagopalan. Unconstrained motion deblurring for dual-lens cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7870–7879, 2019. 1 [18] M. M. Mohan, G. Nithin, and A. Rajagopalan. Deep dynamic scene deblurring for unconstrained dual-lens cameras. IEEE Transactions on Image Processing, 30:4479– 4491, 2021. 1 [19] S. Nah, T. H. Kim, and K. M. Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 257–265, 2017. 1, 4, 5 [20] T. Nimisha, A. Rajagopalan, and R. Aravind. Generating high quality pan-shots from motion blurred videos. Computer Vision and Image Understanding, 171:20–33, 2018. 1 [21] T. M. Nimisha, A. K. Singh, and A. N. Rajagopalan. Blurinvariant deep learning for blind-deblurring. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 4762–4770, 2017. 1 [22] T. M. Nimisha, K. Sunil, and A. Rajagopalan. Unsupervised class-specific deblurring. In Proceedings of the European Conference on Computer Vision (ECCV), pages 353–369, 2018. 1 [23] J. Pan, Z. Hu, Z. Su, and M.-H. Yang. Deblurring text images via l0-regularized intensity and gradient prior. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 2901–2908, 2014. 1 [24] J. Pan, Z. Lin, Z. Su, and M.-H. Yang. Robust kernel estimation with outliers handling for image deblurring. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2800–2808, 2016. 1 [25] J. Pan, D. Sun, H. Pfister, and M.-H. Yang. Deblurring images via dark channel prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(10):2315–2328, 2018. 1 [26] C. Paramanand and A. Rajagopalan. Shape from sharp and motion-blurred image pair. International journal of computer vision, 107(3):272–292, 2014. 1 [27] C. Paramanand and A. N. Rajagopalan. Depth from motion and optical blur with an unscented kalman filter. IEEE Transactions on Image Processing, 21(5):2798–2811, 2011. 1 [28] C. Paramanand and A. N. Rajagopalan. Non-uniform motion deblurring for bilayer scenes. In CVPR, pages 1115–1122, 2013. 1 [29] K. Purohit and A. Rajagopalan. Region-adaptive dense network for efficient motion deblurring. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11882–11889, 2020. 1 [30] K. Purohit, A. Shah, and A. Rajagopalan. Bringing alive blurred moments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6830–6839, 2019. 1 [31] K. Purohit, A. B. Shah, and A. Rajagopalan. Learning based single image blur detection and segmentation. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 2202–2206. IEEE, 2018. 1 [32] M. P. Rao, A. Rajagopalan, and G. Seetharaman. Harnessing motion blur to unveil splicing. IEEE transactions on information forensics and security, 9(4):583–595, 2014. 1 [33] M. P. Rao, A. Rajagopalan, and G. Seetharaman. Inferring plane orientation from a single motion blurred image. In ICPR, pages 2089–2094. IEEE, 2014. 1 [34] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation, 2015. 3, 4 [35] C. J. Schuler, H. C. Burger, S. Harmeling, and B. Schölkopf. A machine learning approach for non-blind image deconvolution. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 1067–1074, 2013. 1 [36] C. J. Schuler, M. Hirsch, S. Harmeling, and B. Schölkopf. Learning to deblur. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(7):1439–1451, 2016. 1 [37] Q. Shan, J. Jia, and A. Agarwala. High-quality motion deblurring from a single image. ACM Trans. Graph., 27(3):1–10, aug 2008. 1 [38] J. Shi, L. Xu, and J. Jia. Discriminative blur detection features. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 2965–2972, 2014. 6 [39] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-k. Wong, and W.-c. Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 802–810, Cambridge, MA, USA, 2015. MIT Press. 1 [40] J. Sun, W. Cao, Z. Xu, and J. Ponce. Learning a convolutional neural network for non-uniform motion blur removal. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 769–777, 2015. 1, 4 [41] X. Tao, H. Gao, X. Shen, J. Wang, and J. Jia. Scale-recurrent network for deep image deblurring. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8174–8182, 2018. 1, 4, 5 [42] I. Vasiljevic, A. Chakrabarti, and G. Shakhnarovich. Examining the impact of blur on recognition by convolutional networks, 2017. 1 [43] S. Vasu, V. R. Maligireddy, and A. Rajagopalan. Non-blind deblurring: Handling kernel uncertainty with cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3272–3281, 2018. 1 [44] S. Vasu and A. N. Rajagopalan. From local to global: Edge profiles to camera motion in blurred images. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 558–567, 2017. 1 [45] C. S. Vijay, C. Paramanand, A. N. Rajagopalan, and R. Chellappa. Non-uniform deblurring in hdr image reconstruction. IEEE Transactions on Image Processing, 22(10):3739–3750, 2013. 1 [46] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics, 2016. 1 [47] O. Whyte, J. Sivic, A. Zisserman, and J. Ponce. Non-uniform deblurring for shaken images. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 491–498, 2010. 4, 5 [48] L. Xu, S. Zheng, and J. Jia. Unnatural l0 sparse representation for natural image deblurring. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 1107– 1114, 2013. 1 [49] L. Xu, S. Zheng, and J. Jia. Unnatural l0 sparse representation for natural image deblurring. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 1107– 1114, 2013. 4 [50] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu. Residual dense network for image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(7):2480– 2495, 2021. 4 [51] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros. View synthesis by appearance flow, 2017. 1