(Visualization) Quality-Driven 360 Video Extrapolation
(Visualization) Quality-Driven 360 Video Extrapolation
(Visualization) Quality-Driven 360 Video Extrapolation
Abstract—Generative adversarial networks (GANs) are power- Recently, Generative Adversarial Networks (GANs) showed
ful nowadays to generate realistic high-resolution images, where promising results [2] [16] [18] [19] [20] and have emerged
they have already been successfully applied to out-painting tasks as a method of choice. However, these methods consider
and for the generation of masked regions in images. Thus, GANs
have the potential to remove the need for complex camera systems the image scene to be planar, which is not the case for a
to shoot 360 environments. As such, from only a small RGB realistic scene. We consider a scene to be spherical which
crop, the full environment could be generated. However, GANs means the edges of the planar representation will be merging
can fail to blend the input crop with the generated extrapolated perfectly when represented as a 360-degree view. GAN-based
region by introducing sharp vertical edges that disrupt the overall methods like CoModGAN [7] have been proposed for in-
visual coherence and degrade the Quality of Experience (QoE). In
this work, we present 360-GAN a quality-driven cycle-consistent painting, which involves training a generator network to fill
GAN model to generate 360-degree omnidirectional images from in the missing pixels inside an image, while the discriminator
small RGB crops, consisting of two GANs. To maintain the network evaluates the realism of the generated images and
spherical consistency of the generated 360 panoramic images, provides feedback to the generator to improve its output. For
our method used the quality metric (Structural Similarity Index- instance, CoModGAN [7] generated impressive results for
SSIM) as a loss function. We evaluate our approach through
quantitative and qualitative measurements, benchmarking them image in-painting as the missing regions are constrained within
against other state-of-the-art approaches. Our method generates the boundaries of the input image, providing a clear target for
realistic results maintaining the spherical consistency of the the generator to fill in. Out-painting, which involves generating
omnidirectional images with a Fréchet Inception Distance (FID) new content beyond the boundaries of an input image, is more
of 46.59, nearly 6 points better than the most current state-of- challenging as it aims at filling in missing regions within an
the-art methods.
Index Terms—Deep Generative Networks, Cycle-GAN, 360- image.
Degree image, SSIM, Quality of Experience Several GAN methods have been proposed based on
Pix2pix [12], which requires cropped images and original
I. I NTRODUCTION images paired for training. Im2Pano3D [13] predicts a compre-
Omnidirectional 360 images, also known as spherical or hensive 360-degree segmentation map from a regular image,
panoramic images, capture a complete 360-degree Field-of- providing valuable clues about the surrounding content cap-
View (FoV). One of the key benefits of 360-degree images tured by the camera. Li et al. [1] utilizes a VAE-GAN (Vari-
is their ability to provide a sense of presence and immersion, ational Autoencoder GAN) structure that generates edges and
allowing viewers to feel like they are physically present in the edge transformation for FoV extrapolation. Akimoto et al. [15]
captured environment. Users can navigate through the image proposed a method that is based on a two-stage conditional
by panning, tilting, and zooming to explore the entire scene GAN to generate 360°panoramic images. Recently, Akimoto
from different perspectives. This makes them ideal for creating et al. [14] uses a transformer-based architecture to predict
virtual tours, interactive experiences, and virtual reality (VR) 360°FoV extrapolation for generating 3DCG backgrounds.
applications. Creating panoramic images, however, typically Based on the CoModGAN [7] architecture, ImmerseGAN [8]
requires specialized cameras, multiple image stitching, and generates plausible results for image out-painting as the gen-
post-processing techniques. erator needs to generate new content that is visually coherent
As a regular camera lens has a Field of View (FoV) of 72 and semantically meaningful while maintaining consistency
degrees, the question is if we, as humans, somehow imagine with the existing content in the input image. However, the
the remaining part of the scene by predictions. As such, we generator has little to no contextual information beyond the
imagine the remaining parts of the panoramic view. With the input image boundaries to guide the generation process. This
recent immense progress of computer vision and artificial lack of context makes it difficult to generate realistic and
intelligence (AI), mimicking human imaginary predictions visually coherent content that extends beyond the input im-
becomes more of a reality. Image out-painting, or extrapolating age boundaries. Moreover, low-resolution images [12] with
the content outside the regular FoV of an image, allows a small pixelation or blurriness can reduce the Quality of Experience
crop RGB image (i.e. the FoV), to generate the full 360-degree (QoE) as the lack of visual clarity can negatively impact the
view. In this paper, we will introduce a method based on Deep sense of presence and engagement.
Generative Networks for 360-degree FoV extrapolation. The purpose of this paper is to provide a model tackling
Earlier, different texture synthesis methods [3] [4] were used both the blurriness and the spherical inconsistency. Herein, we
by extending the FoV of the image with the specific textures. present 360-GAN, a cycle-GAN [6]-based model, where the
two GANs tend to perform domain adaptation. As described in
Figure 1, the first GAN will generate 360-degree images from
small RGB crops and the second GAN model will generate
small RGB crops from 360-degree images. Both of the models
have individual adversarial losses with cycle consistency loss.
This procedure aims to reduce the blurriness of the output 360-
degree image. Moreover, to maintain the spherical consistency,
a quality metric (Structural Similarity Index-SSIM) is used as
a new loss function. We compare our method with state-of-
the-art algorithms, where 360-GAN outperforms in all cases
both quantitatively and qualitatively, by generating 360-degree
images resulting in better QoE.
II. C YCLE -GAN FOR PANORAMIC IMAGES Fig. 1. Our presented method to extrapolate 360-degree Field-of-View (FoV)
using 360-GAN based on cycle-GAN architecture.
Our work builds on the cycle-GAN [6] architecture. It
typically consists of two generator networks (G1 , G2 ), and two
discriminator networks (D1 , D2 ), one for each domain. It in- is the cycle-consistency loss and γ is the cycle-loss param-
cludes two mapping functions G1 : X → Y and G2 : Y → X. eter.
The generator networks learn to generate fake images, while
the discriminator networks learn to distinguish between fake III. 360-GAN
and real images. The generator and discriminator networks are
trained in an adversarial manner, where the generator tries to Figure 1 presented our approach. We introduce an end-
generate realistic images to fool the discriminator, while the to-end trainable pipeline, meticulously designed to cater to
discriminator tries to correctly classify fake and real images. the task of a 360-degree FoV extrapolation that generates
So the generator and the discriminator are trained iteratively top-notch panoramas from a single RGB crop image with
in a two-player minimax game setup. The adversarial loss limited FoV. Based on the cycle-GAN architecture, we have
L (G1 , D1 ) is defined as: built our 360-GAN with the addition of Structural Similarity
Index (SSIM) Loss. It consists of two GANs: one to learn the
L (G1 , D1 ) = min max {Ey [log D1 (y)] features from small RGB crops to 360-degree images (forward
ΦG ΦD
(1)
+Ex [log (1 − D1 (G1 (x)))]} consistency), and the other to learn the features from 360-
degree images to small RGB crops (backward consistency)
where, ΦG and ΦD are the respective parameters of G1 and till the point they reach cycle consistency. The SSIM [5] is
D1 , and x ∈ X and y ∈ Y shows the unpaired training data a widely used image quality assessment metric that measures
in both domains. A similar adversarial loss is defined for the the structural similarity between two images. Mathematically,
reverse mapping L (G2 , D2 ). the SSIM loss can be defined as:
The training data in Cycle-GAN is unpaired where X
represents the small RGB crops and Y represents the 360- LossSSIM = 1 − SSIM (I1 , I2 ) (4)
degree images. The key idea behind Cycle-GAN is the use of
cycle consistency, which is achieved by training the generator where SSIM is the Structural Similarity Index between the
networks to not only generate images from X to Y but also to reference image (I1 ) and the distorted image (I2 ). To maintain
be able to reverse the translation and reconstruct the original the spherical consistency of the generated 360-degree images,
image. This is done by introducing cycle consistency loss, we have added the SSIM loss between 10 edge pixels of the
which penalizes the difference between the original image and left and right sides. This added loss ensures the blending of the
the image reconstructed after going through both generator right and left edge pixels of the generated 360-degree image,
networks in a cycle. Thus the training process of Cycle-GAN hence, removing discontinuities. It encourages the generator
involves a cycle-consistency loss term in addition to the stan- model to produce 360-degree images with no discontinuities
dard adversarial loss. The cycle consistency loss encourages at the edges.
the generators to produce images that are consistent when The total loss is then calculated as:
translated back and forth between the two domains X and Y, LT OT AL = L (G1 , D1 ) + L (G2 , D2 )
ensuring that the generated images are plausible and maintain (5)
+ γLcycle (G1 , G2 ) + λLossSSIM
the original content.
L (G1 , G2 , D1 , D2 ) = L (G1 , D1 ) where λ is the SSIM Loss parameter.
(2) As shown in Figure 1, the Generator networks (G1 , G2 ) are
+ L (G2 , D2 ) + γLcycle (G1 , G2 )
based on the U-Net [9] architecture. To mitigate anisotropic
where, upsampling artifacts while mapping the output to the equirect-
Lcycle (G1 , G2 ) = ∥G2 (G1 (x)) − x∥1 angular representation, based on [8], we made adjustments
(3)
+ ∥G1 (G2 (y)) − y∥1 to the U-Net architecture to ensure a 2:1 aspect ratio. This
as the GAN model fails into mode collapse. ImmerseGAN
[8], which is based on CoModGAN [7], generates results
with discontinuities, degrading the QoE. Our method, 360-
GAN generates realistic omnidirectional images with plausible
environments maintaining the spherical consistency. As shown
in Figure 2, the generated 360°images using our method 360-
GAN ensures color accuracy, sharpness, texture, and overall
visual appearance elevating the overall QoE.
Moreover, to include a more quantitative analysis, we cal-
culated the Fréchet Inception Distance (FID) [17] score for
each method as shown in Table I. FID is a measure of the
similarity between the distribution of real images and the
Fig. 2. Qualitative Field-of-View (FoV) extrapolation results. distribution of generated images, where lower values indicate
better similarity. The FID score is computed by first passing
a set of real images and a set of generated images through
modification helps maintain consistent proportions and pre- a pre-trained Inception-v3 neural network to obtain feature
vents distortions that may occur during upsampling, resulting representations. Then, the mean and covariance of these fea-
in improved visual fidelity in the final equirectangular repre- ture representations are calculated for both sets of images,
sentation. The Discriminator networks (D1 , D2 ) are based on and the distance between these statistics is measured using
the PatchGAN architecture [21] which outputs a grid of scalar the Fréchet distance. The FID score of pix2pixHD [16] is the
values, where each scalar value corresponds to a small patch worst among the three methods tested. We assume that the FID
of the input image. The discriminator’s job is to distinguish score of the state-of-the-art method ImmerseGAN [8] is not
between the real output image and the generated output image. that good compared to the original paper as we have trained
IV. E XPERIMENTS on our dataset till the point it starts over-fitting. Based on the
This section provides a description of the dataset used for quantitative analysis, our method 360-GAN outperforms the
the evaluation of 360-GAN and an analysis of the results. state-of-the-art method ImmerseGAN [8].