(Visualization) Quality-Driven 360 Video Extrapolation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

360-GAN: Quality Driven Cycle-Consistent GAN

for Extrapolating 360-Degree Field-of-View

Abstract—Generative adversarial networks (GANs) are power- Recently, Generative Adversarial Networks (GANs) showed
ful nowadays to generate realistic high-resolution images, where promising results [2] [16] [18] [19] [20] and have emerged
they have already been successfully applied to out-painting tasks as a method of choice. However, these methods consider
and for the generation of masked regions in images. Thus, GANs
have the potential to remove the need for complex camera systems the image scene to be planar, which is not the case for a
to shoot 360 environments. As such, from only a small RGB realistic scene. We consider a scene to be spherical which
crop, the full environment could be generated. However, GANs means the edges of the planar representation will be merging
can fail to blend the input crop with the generated extrapolated perfectly when represented as a 360-degree view. GAN-based
region by introducing sharp vertical edges that disrupt the overall methods like CoModGAN [7] have been proposed for in-
visual coherence and degrade the Quality of Experience (QoE). In
this work, we present 360-GAN a quality-driven cycle-consistent painting, which involves training a generator network to fill
GAN model to generate 360-degree omnidirectional images from in the missing pixels inside an image, while the discriminator
small RGB crops, consisting of two GANs. To maintain the network evaluates the realism of the generated images and
spherical consistency of the generated 360 panoramic images, provides feedback to the generator to improve its output. For
our method used the quality metric (Structural Similarity Index- instance, CoModGAN [7] generated impressive results for
SSIM) as a loss function. We evaluate our approach through
quantitative and qualitative measurements, benchmarking them image in-painting as the missing regions are constrained within
against other state-of-the-art approaches. Our method generates the boundaries of the input image, providing a clear target for
realistic results maintaining the spherical consistency of the the generator to fill in. Out-painting, which involves generating
omnidirectional images with a Fréchet Inception Distance (FID) new content beyond the boundaries of an input image, is more
of 46.59, nearly 6 points better than the most current state-of- challenging as it aims at filling in missing regions within an
the-art methods.
Index Terms—Deep Generative Networks, Cycle-GAN, 360- image.
Degree image, SSIM, Quality of Experience Several GAN methods have been proposed based on
Pix2pix [12], which requires cropped images and original
I. I NTRODUCTION images paired for training. Im2Pano3D [13] predicts a compre-
Omnidirectional 360 images, also known as spherical or hensive 360-degree segmentation map from a regular image,
panoramic images, capture a complete 360-degree Field-of- providing valuable clues about the surrounding content cap-
View (FoV). One of the key benefits of 360-degree images tured by the camera. Li et al. [1] utilizes a VAE-GAN (Vari-
is their ability to provide a sense of presence and immersion, ational Autoencoder GAN) structure that generates edges and
allowing viewers to feel like they are physically present in the edge transformation for FoV extrapolation. Akimoto et al. [15]
captured environment. Users can navigate through the image proposed a method that is based on a two-stage conditional
by panning, tilting, and zooming to explore the entire scene GAN to generate 360°panoramic images. Recently, Akimoto
from different perspectives. This makes them ideal for creating et al. [14] uses a transformer-based architecture to predict
virtual tours, interactive experiences, and virtual reality (VR) 360°FoV extrapolation for generating 3DCG backgrounds.
applications. Creating panoramic images, however, typically Based on the CoModGAN [7] architecture, ImmerseGAN [8]
requires specialized cameras, multiple image stitching, and generates plausible results for image out-painting as the gen-
post-processing techniques. erator needs to generate new content that is visually coherent
As a regular camera lens has a Field of View (FoV) of 72 and semantically meaningful while maintaining consistency
degrees, the question is if we, as humans, somehow imagine with the existing content in the input image. However, the
the remaining part of the scene by predictions. As such, we generator has little to no contextual information beyond the
imagine the remaining parts of the panoramic view. With the input image boundaries to guide the generation process. This
recent immense progress of computer vision and artificial lack of context makes it difficult to generate realistic and
intelligence (AI), mimicking human imaginary predictions visually coherent content that extends beyond the input im-
becomes more of a reality. Image out-painting, or extrapolating age boundaries. Moreover, low-resolution images [12] with
the content outside the regular FoV of an image, allows a small pixelation or blurriness can reduce the Quality of Experience
crop RGB image (i.e. the FoV), to generate the full 360-degree (QoE) as the lack of visual clarity can negatively impact the
view. In this paper, we will introduce a method based on Deep sense of presence and engagement.
Generative Networks for 360-degree FoV extrapolation. The purpose of this paper is to provide a model tackling
Earlier, different texture synthesis methods [3] [4] were used both the blurriness and the spherical inconsistency. Herein, we
by extending the FoV of the image with the specific textures. present 360-GAN, a cycle-GAN [6]-based model, where the
two GANs tend to perform domain adaptation. As described in
Figure 1, the first GAN will generate 360-degree images from
small RGB crops and the second GAN model will generate
small RGB crops from 360-degree images. Both of the models
have individual adversarial losses with cycle consistency loss.
This procedure aims to reduce the blurriness of the output 360-
degree image. Moreover, to maintain the spherical consistency,
a quality metric (Structural Similarity Index-SSIM) is used as
a new loss function. We compare our method with state-of-
the-art algorithms, where 360-GAN outperforms in all cases
both quantitatively and qualitatively, by generating 360-degree
images resulting in better QoE.
II. C YCLE -GAN FOR PANORAMIC IMAGES Fig. 1. Our presented method to extrapolate 360-degree Field-of-View (FoV)
using 360-GAN based on cycle-GAN architecture.
Our work builds on the cycle-GAN [6] architecture. It
typically consists of two generator networks (G1 , G2 ), and two
discriminator networks (D1 , D2 ), one for each domain. It in- is the cycle-consistency loss and γ is the cycle-loss param-
cludes two mapping functions G1 : X → Y and G2 : Y → X. eter.
The generator networks learn to generate fake images, while
the discriminator networks learn to distinguish between fake III. 360-GAN
and real images. The generator and discriminator networks are
trained in an adversarial manner, where the generator tries to Figure 1 presented our approach. We introduce an end-
generate realistic images to fool the discriminator, while the to-end trainable pipeline, meticulously designed to cater to
discriminator tries to correctly classify fake and real images. the task of a 360-degree FoV extrapolation that generates
So the generator and the discriminator are trained iteratively top-notch panoramas from a single RGB crop image with
in a two-player minimax game setup. The adversarial loss limited FoV. Based on the cycle-GAN architecture, we have
L (G1 , D1 ) is defined as: built our 360-GAN with the addition of Structural Similarity
Index (SSIM) Loss. It consists of two GANs: one to learn the
L (G1 , D1 ) = min max {Ey [log D1 (y)] features from small RGB crops to 360-degree images (forward
ΦG ΦD
(1)
+Ex [log (1 − D1 (G1 (x)))]} consistency), and the other to learn the features from 360-
degree images to small RGB crops (backward consistency)
where, ΦG and ΦD are the respective parameters of G1 and till the point they reach cycle consistency. The SSIM [5] is
D1 , and x ∈ X and y ∈ Y shows the unpaired training data a widely used image quality assessment metric that measures
in both domains. A similar adversarial loss is defined for the the structural similarity between two images. Mathematically,
reverse mapping L (G2 , D2 ). the SSIM loss can be defined as:
The training data in Cycle-GAN is unpaired where X
represents the small RGB crops and Y represents the 360- LossSSIM = 1 − SSIM (I1 , I2 ) (4)
degree images. The key idea behind Cycle-GAN is the use of
cycle consistency, which is achieved by training the generator where SSIM is the Structural Similarity Index between the
networks to not only generate images from X to Y but also to reference image (I1 ) and the distorted image (I2 ). To maintain
be able to reverse the translation and reconstruct the original the spherical consistency of the generated 360-degree images,
image. This is done by introducing cycle consistency loss, we have added the SSIM loss between 10 edge pixels of the
which penalizes the difference between the original image and left and right sides. This added loss ensures the blending of the
the image reconstructed after going through both generator right and left edge pixels of the generated 360-degree image,
networks in a cycle. Thus the training process of Cycle-GAN hence, removing discontinuities. It encourages the generator
involves a cycle-consistency loss term in addition to the stan- model to produce 360-degree images with no discontinuities
dard adversarial loss. The cycle consistency loss encourages at the edges.
the generators to produce images that are consistent when The total loss is then calculated as:
translated back and forth between the two domains X and Y, LT OT AL = L (G1 , D1 ) + L (G2 , D2 )
ensuring that the generated images are plausible and maintain (5)
+ γLcycle (G1 , G2 ) + λLossSSIM
the original content.
L (G1 , G2 , D1 , D2 ) = L (G1 , D1 ) where λ is the SSIM Loss parameter.
(2) As shown in Figure 1, the Generator networks (G1 , G2 ) are
+ L (G2 , D2 ) + γLcycle (G1 , G2 )
based on the U-Net [9] architecture. To mitigate anisotropic
where, upsampling artifacts while mapping the output to the equirect-
Lcycle (G1 , G2 ) = ∥G2 (G1 (x)) − x∥1 angular representation, based on [8], we made adjustments
(3)
+ ∥G1 (G2 (y)) − y∥1 to the U-Net architecture to ensure a 2:1 aspect ratio. This
as the GAN model fails into mode collapse. ImmerseGAN
[8], which is based on CoModGAN [7], generates results
with discontinuities, degrading the QoE. Our method, 360-
GAN generates realistic omnidirectional images with plausible
environments maintaining the spherical consistency. As shown
in Figure 2, the generated 360°images using our method 360-
GAN ensures color accuracy, sharpness, texture, and overall
visual appearance elevating the overall QoE.
Moreover, to include a more quantitative analysis, we cal-
culated the Fréchet Inception Distance (FID) [17] score for
each method as shown in Table I. FID is a measure of the
similarity between the distribution of real images and the
Fig. 2. Qualitative Field-of-View (FoV) extrapolation results. distribution of generated images, where lower values indicate
better similarity. The FID score is computed by first passing
a set of real images and a set of generated images through
modification helps maintain consistent proportions and pre- a pre-trained Inception-v3 neural network to obtain feature
vents distortions that may occur during upsampling, resulting representations. Then, the mean and covariance of these fea-
in improved visual fidelity in the final equirectangular repre- ture representations are calculated for both sets of images,
sentation. The Discriminator networks (D1 , D2 ) are based on and the distance between these statistics is measured using
the PatchGAN architecture [21] which outputs a grid of scalar the Fréchet distance. The FID score of pix2pixHD [16] is the
values, where each scalar value corresponds to a small patch worst among the three methods tested. We assume that the FID
of the input image. The discriminator’s job is to distinguish score of the state-of-the-art method ImmerseGAN [8] is not
between the real output image and the generated output image. that good compared to the original paper as we have trained
IV. E XPERIMENTS on our dataset till the point it starts over-fitting. Based on the
This section provides a description of the dataset used for quantitative analysis, our method 360-GAN outperforms the
the evaluation of 360-GAN and an analysis of the results. state-of-the-art method ImmerseGAN [8].

A. Datasets and Training TABLE I


FID COMPUTED ON THE TEST SET FOR DIFFERENT METHODS .
Our 360-GAN model was trained on a dataset composed of
the 360-Indoor [10] and Matterport3D [11] datasets. While the Method FID
360-Indoor dataset consists of a total of 3,335 panoramic RGB pix2pixHD [16] 143.27
ImmerseGAN [8] 52.93
images, the Matterport3D dataset consists of 10,800 indoor 360-GAN (ours) 46.59
panoramic images. The total dataset is split into 80% train,
10% validation, and 10% test subsets. While training, random
crops are computed with FoVs between 70°and 80°to ensure
V. D ISCUSSION
as diverse a set as possible.
To enhance the diversity of our dataset, we implemented Our contributions can be outlined as follows. To begin
data augmentation techniques including random scaling and with, we introduce a novel method 360-GAN based on the
translations, which allows for variations of up to 15% in cycle-GAN architecture to extend the field-of-view (FOV) of a
size compared to the original image. Additionally, we applied camera to a complete 360-degree panorama with quality metric
random adjustments to the exposure and saturation of the SSIM as an added loss function. This method effectively
image, with a maximum factor of 1.2. These techniques controls the appearance of the extrapolated content, resulting
introduce variability and augment the training data, enhancing in spherical consistent omnidirectional images that surpass
the model’s ability to generalize and learn robust features from the current state-of-the-art in both visual quality and the
different image variations. We trained our 360-GAN model for standard FID metric, thus enhancing the Quality of Experience
200 epochs with a decaying learning rate starting with 0.0002. (QoE). We are confident that the results of our method 360-
GAN can be improved if trained on a much larger dataset
B. Results as the cycle-GAN model can learn more domain knowledge.
To provide realistic benchmarks, we selected the state- Moreover, our method can be used to generate realistic 360-
of-the-art methods pix2pixHD [16] and ImmerseGAN [8]. VR scenarios which enhance the quality and user experience
Therefore, we trained both pix2pixHD [16] and ImmerseGAN of omnidirectional images, making them more immersive,
(unguided) [8] on our dataset and tested them under the same visually appealing, and interactive. In the future, we would
conditions as for 360-GAN. The FoV of the RGB crops like to perform some user-based studies by conducting sur-
ranges between 70°and 80°which is also the FoV of a normal veys where participants are asked to rate the visual quality
camera lens. Qualitative results are presented in Figure 2. and realism of the generated 360°images which will provide
They show that pix2pixHD [16] generates bad quality results valuable feedback on the overall QoE.
R EFERENCES Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021,
pp. 11293-11301, doi: 10.1109/CVPR46437.2021.01114.
[1] Li, X., Zhang, H., Feng, L., Hu, J., Zhang, R., Qiao, Q.: Edge-aware [19] A. Nair, J. Deshmukh, A. Sonare, T. Mishra and R. Joseph, ”Image
image outpainting with attentional generative adversarial networks. IET Outpainting using Wasserstein Generative Adversarial Network with
Image Process. 16, 1807– 1821 (2022). Gradient Penalty,” 2022 6th International Conference on Computing
[2] Lin, C.H., Lee, H., Cheng, Y., Tulyakov, S., Yang, M. (2021). Infinity- Methodologies and Communication (ICCMC), Erode, India, 2022, pp.
GAN: Towards Infinite-Pixel Image Synthesis. International Conference 1248-1255, doi: 10.1109/ICCMC53470.2022.9753713.
on Learning Representations. [20] Y. -C. Cheng, C. H. Lin, H. -Y. Lee, J. Ren, S. Tulyakov and M.
[3] A. A. Efros and T. K. Leung, ”Texture synthesis by non-parametric -H. Yang, ”InOut: Diverse Image Outpainting via GAN Inversion,”
sampling,” Proceedings of the Seventh IEEE International Conference 2022 IEEE/CVF Conference on Computer Vision and Pattern Recog-
on Computer Vision, Kerkyra, Greece, 1999, pp. 1033-1038 vol.2. nition (CVPR), New Orleans, LA, USA, 2022, pp. 11421-11430, doi:
10.1109/CVPR52688.2022.01114.
[4] Alexei A. Efros and William T. Freeman. 2001. Image quilting for tex-
[21] P. Isola, J. Zhu, T. Zhou and A. Efros, ”Image-to-Image Translation
ture synthesis and transfer. In Proceedings of the 28th annual conference
with Conditional Adversarial Networks,” in 2017 IEEE Conference on
on Computer graphics and interactive techniques (SIGGRAPH ’01).
Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA,
Association for Computing Machinery, New York, NY, USA, 341–346.
2017 pp. 5967-5976, doi: 10.1109/CVPR.2017.632.
[5] Zhou Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, ”Image
quality assessment: from error visibility to structural similarity,” in IEEE
Transactions on Image Processing, vol. 13, no. 4, pp. 600-612, April
2004.
[6] J. -Y. Zhu, T. Park, P. Isola and A. A. Efros, ”Unpaired Image-to-Image
Translation Using Cycle-Consistent Adversarial Networks,” 2017 IEEE
International Conference on Computer Vision (ICCV), Venice, Italy,
2017, pp. 2242-2251.
[7] Shengyu Zhao, Jonathan Cui, Yilun Sheng, Yue Dong, Xiao Liang, Eric
I Chang, and Yan Xu. Large scale image completion via co-modulated
generative adversarial networks. In International Conference on Learning
Representations, 2021.
[8] M. Dastjerdi, Y. Hold-Geoffroy, J. Eisenmann, S. Khodadadeh and
J. Lalonde, ”Guided Co-Modulated GAN for 360° Field of View
Extrapolation,” in 2022 International Conference on 3D Vision (3DV),
Prague, Czech Republic, 2022 pp. 475-485.
[9] Ronneberger, O., Fischer, P., Brox, T. (2015). U-Net: Convolutional Net-
works for Biomedical Image Segmentation. In: Navab, N., Hornegger,
J., Wells, W., Frangi, A. (eds) Medical Image Computing and Computer-
Assisted Intervention – MICCAI 2015. MICCAI 2015. Lecture Notes
in Computer Science(), vol 9351. Springer, Cham.
[10] S. -H. Chou, C. Sun, W. -Y. Chang, W. -T. Hsu, M. Sun and J. Fu,
”360-Indoor: Towards Learning Real-World Objects in 360° Indoor
Equirectangular Images,” 2020 IEEE Winter Conference on Applications
of Computer Vision (WACV), Snowmass, CO, USA, 2020, pp. 834-842.
[11] A. Chang et al., ”Matterport3D: Learning from RGB-D Data in Indoor
Environments,” 2017 International Conference on 3D Vision (3DV),
Qingdao, China, 2017, pp. 667-676, doi: 10.1109/3DV.2017.00081.
[12] Naoki Kimura and Jun Rekimoto. 2018. ExtVision: Augmentation of
Visual Experiences with Generation of Context Images for a Peripheral
Vision Using Deep Neural Network. In Proceedings of the 2018 CHI
Conference on Human Factors in Computing Systems (CHI ’18). As-
sociation for Computing Machinery, New York, NY, USA, Paper 427,
1–10.
[13] S. Song, et al., ”Im2Pano3D: Extrapolating 360° Structure and Seman-
tics Beyond the Field of View,” in 2018 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT,
USA, 2018 pp. 3847-3856.
[14] N. Akimoto, Y. Matsuo and Y. Aoki, ”Diverse Plausible 360-Degree
Image Outpainting for Efficient 3DCG Background Creation,” in 2022
IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), New Orleans, LA, USA, 2022 pp. 11431-11440.
[15] N. Akimoto, S. Kasai, M. Hayashi and Y. Aoki, ”360-Degree Image
Completion by Two-Stage Conditional Gans,” 2019 IEEE International
Conference on Image Processing (ICIP), Taipei, Taiwan, 2019, pp. 4704-
4708, doi: 10.1109/ICIP.2019.8803435.
[16] T. -C. Wang, M. -Y. Liu, J. -Y. Zhu, A. Tao, J. Kautz and B. Catanzaro,
”High-Resolution Image Synthesis and Semantic Manipulation with
Conditional GANs,” 2018 IEEE/CVF Conference on Computer Vision
and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 8798-8807,
doi: 10.1109/CVPR.2018.00917.
[17] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard
Nessler, and Sepp Hochreiter. 2017. GANs trained by a two time-scale
update rule converge to a local nash equilibrium. In Proceedings of the
31st International Conference on Neural Information Processing Systems
(NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6629–6640.
[18] G. Somanath and D. Kurz, ”HDR Environment Map Estimation for Real-
Time Augmented Reality,” 2021 IEEE/CVF Conference on Computer

You might also like