DDGAN
DDGAN
DDGAN
A BSTRACT
arXiv:2112.07804v2 [cs.LG] 4 Apr 2022
A wide variety of deep generative models has been developed in the past decade.
Yet, these models often struggle with simultaneously addressing three key require-
ments including: high sample quality, mode coverage, and fast sampling. We call
the challenge imposed by these requirements the generative learning trilemma, as
the existing models often trade some of them for others. Particularly, denoising
diffusion models have shown impressive sample quality and diversity, but their ex-
pensive sampling does not yet allow them to be applied in many real-world appli-
cations. In this paper, we argue that slow sampling in these models is fundamen-
tally attributed to the Gaussian assumption in the denoising step which is justified
only for small step sizes. To enable denoising with large steps, and hence, to re-
duce the total number of denoising steps, we propose to model the denoising distri-
bution using a complex multimodal distribution. We introduce denoising diffusion
generative adversarial networks (denoising diffusion GANs) that model each de-
noising step using a multimodal conditional GAN. Through extensive evaluations,
we show that denoising diffusion GANs obtain sample quality and diversity com-
petitive with original diffusion models while being 2000× faster on the CIFAR-10
dataset. Compared to traditional GANs, our model exhibits better mode coverage
and sample diversity. To the best of our knowledge, denoising diffusion GAN is
the first model that reduces sampling cost in diffusion models to an extent that al-
lows them to be applied to real-world applications inexpensively. Project page and
code: https://nvlabs.github.io/denoising-diffusion-gan.
1 I NTRODUCTION
In the past decade, a plethora of deep generative models has High
been developed for various domains such as images (Kar- Generative Quality Denoising
Adversarial Samples Diffusion
ras et al., 2019; Razavi et al., 2019), audio (Oord et al., Networks Models
2016a; Kong et al., 2021), point clouds (Yang et al., 2019)
and graphs (De Cao & Kipf, 2018). However, current gener-
ative learning frameworks cannot yet simultaneously satisfy Fast Mode
Sampling Coverage /
three key requirements, often needed for their wide adop- Diversity
tion in real-world problems. These requirements include (i)
high-quality sampling, (ii) mode coverage and sample diver-
Variational Autoencoders,
sity, and (iii) fast and computationally inexpensive sampling. Normalizing Flows
For example, most current works in image synthesis focus on
high-quality generation. However, mode coverage and data Figure 1: Generative learning trilemma. 1
diversity are important for better representing minorities and for reducing the negative social im-
pacts of generative models. Additionally, applications such as interactive image editing or real-time
speech synthesis require fast sampling. Here, we identify the challenge posed by these requirements
as the generative learning trilemma, since existing models usually compromise between them.
Fig. 1 summarizes how mainstream generative frameworks tackle the trilemma. Generative adver-
sarial networks (GANs) (Goodfellow et al., 2014; Brock et al., 2018) generate high-quality samples
rapidly, but they have poor mode coverage (Salimans et al., 2016; Zhao et al., 2018). Conversely,
∗
Work done during an internship at NVIDIA.
1
Published as a conference paper at ICLR 2022
variational autoencoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) and normalizing
flows (Dinh et al., 2016; Kingma & Dhariwal, 2018) cover data modes faithfully, but they often
suffer from low sample quality. Recently, diffusion models (Sohl-Dickstein et al., 2015; Ho et al.,
2020; Song et al., 2021c) have emerged as powerful generative models. They demonstrate surpris-
ingly good results in sample quality, beating GANs in image generation (Dhariwal & Nichol, 2021;
Ho et al., 2021). They also obtain good mode coverage, indicated by high likelihood (Song et al.,
2021b; Kingma et al., 2021; Huang et al., 2021). Although diffusion models have been applied to
a variety of tasks (Dhariwal & Nichol; Austin et al.; Mittal et al.; Luo & Hu), sampling from them
often requires thousands of network evaluations, making their application expensive in practice.
In this paper, we tackle the generative learning trilemma by reformulating denoising diffusion mod-
els specifically for fast sampling while maintaining strong mode coverage and sample quality. We
investigate the slow sampling issue of diffusion models and we observe that diffusion models com-
monly assume that the denoising distribution can be approximated by Gaussian distributions. How-
ever, it is known that the Gaussian assumption holds only in the infinitesimal limit of small denoising
steps (Sohl-Dickstein et al., 2015; Feller, 1949), which leads to the requirement of a large number of
steps in the reverse process. When the reverse process uses larger step sizes (i.e., it has fewer denois-
ing steps), we need a non-Gaussian multimodal distribution for modeling the denoising distribution.
Intuitively, in image synthesis, the multimodal distribution arises from the fact that multiple plausi-
ble clean images may correspond to the same noisy image.
Inspired by this observation, we propose to parametrize the denoising distribution with an expressive
multimodal distribution to enable denoising for large steps. In particular, we introduce a novel gen-
erative model, termed as denoising diffusion GAN, in which the denoising distributions are modeled
with conditional GANs. In image generation, we observe that our model obtains sample quality and
mode coverage competitive with diffusion models, while taking only as few as two denoising steps,
achieving about 2000× speed-up in sampling compared to the predictor-corrector sampling by Song
et al. (2021c) on CIFAR-10. Compared to traditional GANs, we show that our model significantly
outperforms state-of-the-art GANs in sample diversity, while being competitive in sample fidelity.
In summary, we make the following contributions: i) We attribute the slow sampling of diffusion
models to the Gaussian assumption in the denoising distribution and propose to employ complex,
multimodal denoising distributions. ii) We propose denoising diffusion GANs, a diffusion model
whose reverse process is parametrized by conditional GANs. iii) Through careful evaluations, we
demonstrate that denoising diffusion GANs achieve several orders of magnitude speed-up compared
to current diffusion models for both image generation and editing. We show that our model over-
comes the deep generative learning trilemma to a large extent, making diffusion models for the first
time applicable to interactive, real-world applications at a low computational cost.
2 BACKGROUND
In diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020), there is a forward process that
gradually adds noise to the data x0 ∼ q(x0 ) in T steps with pre-defined variance schedule βt :
Y p
q(x1:T |x0 ) = q(xt |xt−1 ), q(xt |xt−1 ) = N (xt ; 1 − βt xt−1 , βt I), (1)
t≥1
where q(x0 ) is a data-generating distribution. The reverse denoising process is defined by:
Y
pθ (x0:T ) = p(xT ) pθ (xt−1 |xt ), pθ (xt−1 |xt ) = N (xt−1 ; µθ (xt , t), σt2 I), (2)
t≥1
where µθ (xt , t) and σt2 are the mean and variance for the denoisingR model and θ denotes its param-
eters. The goal of training is to maximize the likelihood pθ (x0 ) = pθ (x0:T )dx1:T , by maximizing
the evidence lower bound (ELBO, L ≤ log pθ (x0 )). The ELBO can be written as matching the true
denoising distribution q(xt−1 |xt ) with the parameterized denoising model pθ (xt−1 |xt ) using:
X
L=− Eq(xt ) [DKL (q(xt−1 |xt )kpθ (xt−1 |xt ))] + C, (3)
t≥1
where C contains constant terms that are independent of θ and DKL denotes the Kullback-Leibler
(KL) divergence. The objective above is intractable due to the unavailability of q(xt−1 |xt ). Instead,
2
Published as a conference paper at ICLR 2022
xt
Data Distributions
True Denoising
xt
Distributions
Figure 2: Top: The evolution of 1D data distribution q(x0 ) through the diffusion process. Bottom:,
The visualization of the true denoising distribution for varying step sizes conditioned on a fixed x5 .
The true denoising distribution for a small step size (i.e., q(x4 |x5 = X)) is close to a Gaussian
distribution. However, it becomes more complex and multimodal as the step size increases.
Sohl-Dickstein et al. (2015) show that L can be written in an alternative form with tractable distribu-
tions (see Appendix A of Ho et al. (2020) for details). Ho et al. (2020) show the equivalence of this
form with score-based models trained with denoising score matching (Song & Ermon, 2019; 2020).
Two key assumptions are commonly made in diffusion models: First, the denoising distribution
pθ (xt−1 |xt ) is modeled with a Gaussian distribution. Second, the number of denoising steps T
is often assumed to be in the order of hundreds to thousands of steps. In this paper, we focus on
discrete-time diffusion models. In continuous-time diffusion models (Song et al., 2021c), similar
assumptions are also made at the sampling time when discretizing time into small timesteps.
We first discuss why reducing the number of denoising steps requires learning a multimodal denois-
ing distribution in Sec. 3.1. Then, we present our multimodal denoising model in Sec. 3.2.
Our goal is to reduce the number of denoising diffusion steps T required in the reverse process of
diffusion models. Inspired by the observation above, we propose to model the denoising distribution
3
Published as a conference paper at ICLR 2022
with an expressive multimodal distribution. Since conditional GANs have been shown to model
complex conditional distributions in the image domain (Mirza & Osindero, 2014; Ledig et al., 2017;
Isola et al., 2017), we adopt them to approximate the true denoising distribution q(xt−1 |xt ).
Specifically, our forward diffusion is set up similarly to the diffusion models in Eq. 1 with the main
assumption that T is assumed to be small (T ≤ 8) and each diffusion step has larger βt . Our training
is formulated by matching the conditional GAN generator pθ (xt−1 |xt ) and q(xt−1 |xt ) using an
adversarial loss that minimizes a divergence Dadv per denoising step:
X
min Eq(xt ) [Dadv (q(xt−1 |xt )kpθ (xt−1 |xt ))] , (4)
θ
t≥1
where fake samples from pθ (xt−1 |xt ) are contrasted against real samples from q(xt−1 |xt ). The first
expectation requires
R sampling from q(xt−1 |xRt ) which is unknown. However, we use the identity
q(xt , xt−1 ) = dx0 q(x0 )q(xt , xt−1 |x0 ) = dx0 q(x0 )q(xt−1 |x0 )q(xt |xt−1 ) to rewrite the first
expectation in Eq. 5 as:
Eq(xt )q(xt−1 |xt ) [− log(Dφ (xt−1 , xt , t))] = Eq(x0 )q(xt−1 |x0 )q(xt |xt−1 ) [− log(Dφ (xt−1 , xt , t))].
P
Given the discriminator, we train the generator by maxθ t≥1 Eq(xt ) Epθ (xt−1 |xt ) [log(Dφ (xt−1 , xt , t))],
which updates the generator with the non-saturating GAN objective (Goodfellow et al., 2014).
Parametrizing the implicit denoising model: Instead of directly predicting xt−1 in the denoising
step, diffusion models (Ho et al., 2020) can be interpreted as parameterizing the denoising model
by pθ (xt−1 |xt ) := q(xt−1 |xt , x0 = fθ (xt , t)) in which first x0 is predicted using the denoising
model fθ (xt , t), and then, xt−1 is sampled using the posterior distribution q(xt−1 |xt , x0 ) given xt
and the predicted x0 (See Appendix B for details). The distribution q(xt−1 |x0 , xt ) is intuitively the
distribution over xt−1 when denoising from xt towards x0 , and it always has a Gaussian form for
the diffusion process in Eq. 1, independent of the step size and complexity of the data distribution
(see Appendix A for the expression of q(xt−1 |x0 , xt )). Similarly, we define pθ (xt−1 |xt ) by:
Z Z
pθ (xt−1 |xt ) := pθ (x0 |xt )q(xt−1 |xt , x0 )dx0 = p(z)q(xt−1 |xt , x0 = Gθ (xt , z, t))dz, (6)
where pθ (x0 |xt ) is the implicit distribution imposed by the GAN generator Gθ (xt , z, t) : RN ×RL ×
R → RN that outputs x0 given xt and an L-dimensional latent variable z ∼ p(z) := N (z; 0, I).
Our parameterization has several advantages: Firstly, our pθ (xt−1 |xt ) is formulated similar to
DDPM (Ho et al., 2020). Thus, we can borrow some inductive bias such as the network struc-
ture design from DDPM. The main difference is that, in DDPM, x0 is predicted as a deterministic
mapping of xt , while in our case x0 is produced by the generator with random latent variable z. This
is the key difference that allows our denoising distribution pθ (xt−1 |xt ) to become multimodal and
complex in contrast to the unimodal denoising model in DDPM. Secondly, note that for different
t’s, xt has different levels of perturbation, and hence using a single network to predict xt−1 directly
at different t may be difficult. However, in our case the generator only needs to predict unperturbed
x0 and then add back perturbation using q(xt−1 |xt , x0 ). Fig. 3 visualizes our training pipeline.
1
At early stages, we examined training a conditional VAE as pθ (xt−1 |xt ) by minimizing
PT
t=1 Eq(t) [DKL (q(xt−1 |xt )kpθ (xt−1 |xt ))] for small T . However, conditional VAEs consistently resulted
in poor generative performance in our early experiments. In this paper, we focus on conditional GANs and we
leave the exploration of other expressive conditional generators for pθ (xt−1 |xt ) to future work.
4
Published as a conference paper at ICLR 2022
why not just train a GAN that can gen- q(xt - 1 | x0) q(xt | xt - 1)
erate samples in one shot using a tra- x0 xt - 1 xt
ditional setup, in contrast to our model Real / Fake? D(xt - 1, xt, t)
that generates samples by denoising it-
Conditioning
eratively? Our model has several advan-
Posterior sampling
tages over traditional GANs. GANs are
known to suffer from training instability q(xt - 1 | xt , x’0)
and mode collapse (Kodali et al., 2017; z
x ’0 x ’t - 1
Salimans et al., 2016), and some possi- G(xt, z, t)
ble reasons include the difficulty of di-
rectly generating samples from a com-
plex distribution in one-shot, and the
overfitting issue when the discriminator Figure 3: The training process of denoising diffusion GAN.
only looks at clean samples. In contrast, our model breaks the generation process into several con-
ditional denoising diffusion steps in which each step is relatively simple to model, due to the strong
conditioning on xt . Moreover, the diffusion process smoothens the data distribution (Lyu, 2012),
making the discriminator less likely to overfit. Thus, we expect our model to exhibit better training
stability and mode coverage. We empirically verify the advantages over traditional GANs in Sec. 5.
4 R ELATED W ORK
Diffusion-based models (Sohl-Dickstein et al., 2015; Ho et al., 2020) learn the finite-time reversal
of a diffusion process, sharing the idea of learning transition operators of Markov chains with Goyal
et al. (2017); Alain et al. (2016); Bordes et al. (2017). Since then, there have been a number of im-
provements and alternatives to diffusion models. Song et al. (2021c) generalize diffusion processes
to continuous time, and provide a unified view of diffusion models and denoising score matching
(Vincent, 2011; Song & Ermon, 2019). Jolicoeur-Martineau et al. (2021b) add an auxiliary adversar-
ial loss to the main objective. This is fundamentally different from ours, as their auxiliary adversarial
loss only acts as an image enhancer, and they do not use latent variables; therefore, the denoising
distribution is still a unimodal Gaussian. Other explorations include introducing alternative noise
distributions in the forward process (Nachmani et al., 2021), jointly optimizing the model and noise
schedule (Kingma et al., 2021) and applying the model in latent spaces (Vahdat et al., 2021).
One major drawback of diffusion or score-based models is the slow sampling speed due to a large
number of iterative sampling steps. To alleviate this issue, multiple methods have been proposed,
including knowledge distillation (Luhman & Luhman, 2021), learning an adaptive noise schedule
(San-Roman et al., 2021), introducing non-Markovian diffusion processes (Song et al., 2021a; Kong
& Ping, 2021), and using better SDE solvers for continuous-time models (Jolicoeur-Martineau et al.,
2021a). In particular, Song et al. (2021a) uses x0 sampling as a crucial ingredient to their method,
but their denoising distribution is still a Gaussian. These methods either suffer from significant
degradation in sample quality, or still require many sampling steps as we demonstrate in Sec. 5.
Among variants of diffusion models, Gao et al. (2021) have the closest connection with our method.
They propose to model the single-step denoising distribution by a conditional energy-based model
(EBM), sharing the high-level idea of using expressive denoising distributions with us. However,
they motivate their method from the perspective of facilitating the training of EBMs. More impor-
tantly, although only a few denoising steps are needed, expensive MCMC has to be used to sample
from each denoising step, making the sampling process slow with ∼180 network evaluations. Im-
ageBART (Esser et al., 2021a) explores modeling the denoising distribution of a diffusion process
on discrete latent space with an auto-regressive model per step in a few denoising steps. However,
the auto-regressive structure of their denoising distribution still makes sampling slow.
Since our model is trained with adversarial loss, our work is related to recent advances in improving
the sample quality and diversity of GANs, including data augmentation (Zhao et al., 2020; Karras
et al., 2020a), consistency regularization (Zhang et al., 2020; Zhao et al., 2021) and entropy regu-
larization (Dieng et al., 2019). In addition, the idea of training generative models with smoothed
distributions is also discussed in Meng et al. (2021a) for auto-regressive models.
5
Published as a conference paper at ICLR 2022
5 E XPERIMENTS
In this section, we evaluate our proposed denoising diffusion GAN for the image synthesis problem.
We begin with briefly introducing the network architecture design, while additional implementation
details are presented in Appendix C. For our GAN generator, we adopt the NCSN++ architecture
from Song et al. (2021c) which has a U-net structure (Ronneberger et al., 2015). The conditioning
xt is the input of the network, and time embedding is used to ensure conditioning on t. We let the
latent variable z control the normalization layers. In particular, we replace all group normalization
layers (Wu & He, 2018) in NCSN++ with adaptive group normalization layers in the generator,
similar to Karras et al. (2019); Huang & Belongie (2017), where the shift and scale parameters in
group normalization are predicted from z using a simple multi-layer fully-connected network.
One major highlight of our model is that it excels at all three criteria in the generative learning
trilemma. Here, we carefully evaluate our model’s performances on sample fidelity, sample diversity
and sampling time, and benchmark against a comprehensive list of models on the CIFAR-10 dataset.
Evaluation criteria: We adopt the commonly used Fréchet inception distance (FID) (Heusel et al.,
2017) and Inception Score (IS) (Salimans et al., 2016) for evaluating sample fidelity. We use the
training set as a reference to compute the FID, following common practice in the literature (see Ho
et al. (2020); Karras et al. (2019) as an example). For sample diversity, we use the improved recall
score from Kynkäänniemi et al. (2019), which is an improved version of the original precision and
recall metric proposed by Sajjadi et al. (2018). It is shown that an improved recall score reflects
how the variation in the generated samples matches that in the training set (Kynkäänniemi et al.,
2019). For sampling time, we use the number of function evaluations (NFE) and the clock time
when generating a batch of 100 images on a V100 GPU.
Results: We present our quantitative results in Table 1. We observe that our sample quality is com-
petitive among the best diffusion models and GANs. Although some variants of diffusion models
obtain better IS and FID, they require a large number of function evaluations to generate samples
(while we use only 4 denoising steps). For example, our sampling time is about 2000× faster than
the predictor-corrector sampling by Song et al. (2021c) and ∼20× faster than FastDDPM (Kong &
Ping, 2021). Note that diffusion models can produce samples in fewer steps while trading off the
sample quality. To better benchmark our method against existing diffusion models, we plot the FID
score versus sampling time of diffusion models by varying the number of denoising steps (or the er-
ror tolerance for continuous-time models) in Figure 4. The figure clearly shows the advantage of our
model compared to previous diffusion models. When comparing our model to GANs, we observe
that only StyleGAN2 with adaptive data augmentation has slightly better sample quality than ours.
However, from Table 1, we see that GANs have limited sample diversity, as their recall scores are
below 0.5. In contrast, our model obtains a significantly better recall score, even higher than several
advanced likelihood-based models, and competitive among diffusion models. We show qualitative
samples of CIFAR-10 in Figure 5. In summary, our model simultaneously excels at sample quality,
sample diversity, and sampling speed and tackles the generative learning trilemma by a large extent.