Mind Your Step: Continuous Conditional Gans With Generator Regularization
Mind Your Step: Continuous Conditional Gans With Generator Regularization
Mind Your Step: Continuous Conditional Gans With Generator Regularization
Generator Regularization
Abstract
1 Introduction
Conditional Generative Adversarial Networks (cGANs) [1] are a powerful class of generative
models where the goal is to learn a mapping from input to output distributions conditioned on
some auxiliary information, such as class labels [1, 2], images [3, 4], or text [5, 6]. While
cGANs have demonstrated outstanding capabilities in a wide range of conditional generation
tasks, they are also known to be difficult to train since the optimization objective must con-
sider various input conditions and is cast as a min-max game between the generator network
and the discriminator network. Much past work has been devoted to stabilizing the train-
ing of GANs. For example, [7] introduces Wasserstein-GAN (WGAN) that uses the Earth
Mover distance as a more explicit measure of the distribution divergence in the loss function.
∗
The authors contribute equally.
†
Currently works at Apple.
2 Method
Challenges of Continuous, Multi-Dimensional Conditions. cGAN commonly suffers from two
problems, especially when the conditions are supported in a multi-dimensional and continuous space.
(P1) Since the condition space X is continuous and multi-dimensional, xi ’s in the training set are very
likely to be different from each other. To make it worse, at each forward propagation only one noise
z j is sampled from Z ∼ pz (z), which is common in the practice training of cGAN. Therefore, for a
certain xi , the discriminator can only get the information of G(xi , z j ) and may find it particularly
challenging to generalize to the general distribution of G(xi , Z). (P2) As we increase the number of
dimensions for X , the conditions observed {xi }N i=1 become more sparse and more gaps are created.
For most conditions x ∈ X , few or even no samples can be observed during training. In most cGAN
literature, when training cGANs, the generator are only given and trained on the conditions observed
in the training set. As a result, the generator might extend poorly when given a new condition from
X that has never been observed in the training set.
GR-cGAN. To address the aforementioned issues, we propose a novel regularization of the generator
and name the resulting model as Generator Regularized-cGAN (GR-cGAN). We first present the
expression of the regularization term, and then discuss how it can remedy these problems.
The generator regularization is based on a continuity assumption of the conditional distribution
pr (y|x), where y denotes the dependent term. For a wide range of applications, but not all, it is
2
natural to assume that a minor perturbation to the condition x will only slightly disturb the conditional
distribution pr (y|x). On a high level, we hope that the distribution of G(x, z), which is used to
approximate the distribution of pr (y|x), shifts smoothly as we change x. Since directly regularizing
the generator from a distribution perspective can be challenging, we instead regularize the gradient of
G(x, z) with respect to x. Specifically, we add the following regularization term to the generator
loss, to encourage the optimized generator G to obtain a smaller loss on this regularization term:
LGR (G) = Ez∼q(z), ||∇x G(x, z)||, (1)
x∼p̃(x)
where ∇x G(x, z) is the Jacobian matrix. The empirical distribution p̃(x) indicates the locations
where we regularize the Jacobian matrix ∇x G(x, z), and can be implicitly defined by sampling
uniformly along straight lines between pairs of conditions sampled from the training set. This
sampling method allows us not only to take small steps on the conditions observed in the training
set, but also to take small steps on the conditions that have not been observed in the training set
but might occur while testing or applying the trained cGAN generator. If the conditions fall in a
high-dimensional space where linear interpolation is not feasible, we can first project the conditions
onto another vector space - for example, onto the latent space of variational autoencoders (VAEs) -
before interpolations [14, 15]. See Figure 4 for details of applying GR-cGAN to conditional time
series generation. Intuitively, when LGR (G) takes a small value, for any fixed z = z 0 , the output of
the generator G(x, z 0 ) will only shift moderately and continuously as x changes.
Finally, the cGAN objective with generator regularization now becomes
min max L(D, G)
G D
:= E(x,y)∼p̂(x,y) [log D(x, y)] + Ez∼pz (z),x∼p̂(x) [log(1 − D(G(x, z)))] + λ · LGR (G).
The term λ controls the degree of regularization. In other words, a larger λ discourages the model
from reacting drastically to small perturbations in the input conditions.
How does generator regularization overcome (P1) and (P2)?
For (P1), during training, a batch of (xi , y i ) pairs are sampled from the training set. For any xi from
this batch of data, when the generator regularization is applied, the samples in the vicinity of xi are
encouraged to facilitate the training of the generator and the discriminator. In the case where the
generated distribution of G(xi , z) with z ∼ pz (z) is concentrated on a pathological mode collapse
distribution (in other words, the generator always underestimate the variance of the ground-truth
conditional distribution pr (y | xi )), the discriminator can better detect local mode collapse and learn
to classify such pathological distribution as fake, thus improving the generator in return.
For (P2), when given a new condition x0 that does not exist in the train set, the conditional distribution
given by the generator in GR-cGAN on x0 is similar to the conditional distribution given on the
conditions in the vicinity of x0 in the training set. If we penalize the gradient from being too large,
we are effectively encouraging the model to learn a smooth transition between each pair of samples
from the training set and thus generalize to close these gaps.
We compare the proposed GR-cGAN with related works in Appendix B.
Following the definition of Lipschitz continuity for functions, we first deliver a formal definition of
continuous conditional distribution named Lipschitz continuous conditional distribution. Next, we
present its connection with the proposed generator regularization.
Definition 3.1 (K-Lipschitz Continuous Conditional Distribution) Let X and Y be random vari-
ables with support RX and RY respectively. Denote the distribution induced by X | Y = y as
Fy . We say X has a K-Lipschitz continuous conditional distribution with respect to Y , if for all
y1 , y2 ∈ RY , the Wasserstein distance between Fy1 and Fy2 satisfies
W (Fy1 , Fy2 ) ≤ K · ∥y1 − y2 ∥,
where W (·, ·) denotes the Wasserstein distance between two distributions, and || · || indicates a norm.
3
Note that when the Wasserstein distance is used to evaluate the distance between two probability
distributions, cGANs can be extended to conditional Wasserstein GANs. Other distance measures to
quantify the distance between two conditional distributions can also be adapted.
Given two arbitrary conditions x1 and x2 , suppose that the generator satisfies ||G(x1 , z0 ) −
G(x2 , z0 )|| ≤ K0 · ||x1 − x2 || for any z 0 . The conditional distribution given by the generator
on x1 and x2 are G(x1 , z) and G(x2 , z) with z ∼ pz (z) respectively. Notice that when the genera-
tor regularization is applied, the term K0 will be pushed to a smaller level. We prove in Appendix
C that W (G(x1 , z), G(x2 , z)) ≤ K0 · ||x1 − x2 ||, which indicates that the conditional distribution
learned by the generator is a K0 -Lipschitz continuous conditional distribution with respect to x. With
the use of generator regularization, the conditional distribution given by the generator is encouraged
to be more continuous from the perspective of K-Lipschitz continuous conditional distributions.
4 Experiments
To demonstrate the effectiveness of the proposed generator regularization term, we begin our experi-
ments with a simple 2D synthetic dataset and then proceed with a real-world conditional time series
generation task. Our code is here3 .
We generate a synthetic dataset from 2-D Gaussians with different means following the experiments
in CcGAN [13]. The condition x has a dimension of one which measures the polar angle of a given
data point and the dependency y has a dimension of two. Given x ∈ [0, 2π], we construct y such
that it follows a 2-D Gaussian distribution, specifically,
2
R · sin(x) 2 σ̃ 0
y ∼ N (µx , Σ) with µx = , and Σ = σ̃ I2×2 = .
R · cos(x) 0 σ̃ 2
For a thorough analysis, we study several different settings for x when generating the dataset. In
Section 4.1.1, we choose a subset of [0, 2π] for training and evaluate how well the models can
generalize to the gaps that are absent during training. In Appendix E.1.1, we further report the
experiment results when x is evenly distributed in the range of [0, 2π].
0.5 0.5 4
three test labels. The positions of the train labels
cos(x)
0.0 0.0
y2
3
x
and test labels are shown in Figure 2(a). To gener- 0.5 0.5
2
ate a training set, for each x in the train labels, 10 1.0 1.0
1
1.5
samples are generated. We denote this training set 2.0
1.5
2.0 0
as a partial dataset. For R and σ̃ 2 , we set R = 1 2.0 1.5 1.0 0.5 0.0 0.5
sin(x)
1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0
y1
0.5 1.0 1.5 2.0
4
zero. For the GR-cGAN model, we use the loss
term given in Equation 1 where λ = 0.02. We gen-
erate 100 fake samples for each test label. We plot
these fake samples in Figure 3. For each test label
x, a circle that covers about 90% of the volume inside the pdf of N (µx , Σ) is also plotted. GR-cGAN
achieves visually reasonable results. It is a good property that the generator can give reasonable
fake samples even when given a label on a gap. GR-cGAN can be used in cases where there are
missing labels in the training set. For example, in the task of generating photos of people with a given
character description, if we only have samples of “young and happy” and “old and sad”, we can use
GR-cGAN to generate “old and happy” images.
More quantitative evaluation metrics are given in Appendix E.1.2, with unsurprising results that our
GR-cGAN behaves better.
(a) CcGAN (HVDL) (b) CcGAN (SVDL) (c) Deg.GR-cGAN (d) GR-cGAN
Figure 3: Visual results of the Circular 2-D Gaussians experiments on partial dataset. For each
subfigure, we generate 100 fake samples using each GAN model at each of 3 labels in the test labels.
The blue dots represent the fake samples. For each mean x in the test labels, a circle that can cover
about 90% of the volume inside the pdf of N (µx , Σ) is plotted.
In this section, we reformulate the settings defined in TimeGAN [16] to better study the effect of our
generator penalty when the conditions are high-dimensional and more complex. Specifically, given
multivariate sequential data of T time steps [yi ]Ti=1 and a distribution of latent noise [zT +i ]τi=1 , our
goal is to generate its evolution in the next τ time steps p [yT +i ]τi=1 | [yi ]Ti=0 . In our experiments,
we always use the first one-third of the window as the condition, but can be easily extended to support
conditions of arbitrary lengths.
Model Architecture. For a fair comparison, we avoid altering the original TimeGAN architecture
unless noted and only make minimal changes to let TimeGAN support conditional generation. Here,
we only highlight the differences we made. For more details on TimeGAN, we refer readers to [16].
As shown in Figure 8 in the Appendix, the generator G is changed to an encoder-decoder network,
where the conditions [y0:T ] are first fed into the encoder. The decoder takes in the hidden states
from the encoder as well as latent noise [z0:τ ]. Due to the nature of high-dimensional time series, a
simple linear interpolation of the conditions might not be feasible. Therefore, we first project the
conditions into a latent space where an interpolation makes sense. A simple and readily available
option is variational autoencoders (VAE). More details can be found in Appendix E.2.2. The VAE
conditioning module is illustrated in Figure 4.
Experiment Setup. We adapt the stock4 dataset from TimeGAN which consists of daily histor-
ical Google stocks data from 2004 to 2019. In addition, we also use the Electricity Transformer
Temperature dataset[17] at a minute-level (ETTm15 ) due to its more intricate and yet predicable
dynamics. Data generated by different models are compared using two metrics - 1) predictive
score (↓): normalized-MAE where an off-the-shelf transformer forecasting model is trained on the
generated dataset and then used to predict on the real dataset, and 2) discriminative score (↓): a GRU
4
https://github.com/jsyoon0823/TimeGAN/tree/master/data.
5
https://github.com/zhouhaoyi/ETDataset/blob/main/ETT-small/ETTm1.csv.
5
Figure 4: Interpolation between two conditions with a VAE module. The plotted windows are real
model outputs on the ”LULL” feature of ETTm1. Upon reconstructing the interpolated condition
as well as its perturbed correspondence, we feed both into the generator and enforce the generated
windows to be similar.
discriminator trained to classify whether a given sequence is generated or real, and the final score is
the classification accuracy minus 0.5. More details on how we tailored the evaluation protocol for
conditional generation compared to TimeGAN can be found in Appendix E.2.1.
Table 1: Evaluation of time series generation using the first 1/3 of the window as conditions. Disc
and Pred denote discriminative score and predictive score, respectively. Results for TimeGAN are
averaged across 2 runs, while the results for our model are averaged across 5-7 runs. For CcGAN, we
report the result from the best ϵ ∈ {0.1, 0.01, 0.001}.
6
expect our performance to improve further if an appropriate model is used to generate more realistic
interpolation conditions for generator regularization.
5 Conclusion
In this work, we show a promising method to address the issues arisen in training conditional
generative adversarial networks (cGANs) when the conditions are continuous and high-dimensional.
We propose a simple generator regularization term on the GAN generator loss in the form of Lipschitz
penalty. More specifically, the regularization term encourages the generator to take smaller steps
when a small perturbation is applied to the conditions. Meanwhile, we acknowledge that a more
sophisticated method is desired for interpolating a pair of conditions in terms of both visual quality
and the ease of training.
References
[1] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint
arXiv:1411.1784, 2014.
[2] Takeru Miyato and Masanori Koyama. cgans with projection discriminator, 2018.
[3] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro.
High-resolution image synthesis and semantic manipulation with conditional gans, 2018.
[4] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Globally and Locally Consistent Image
Completion. ACM Transactions on Graphics (Proc. of SIGGRAPH 2017), 36(4):107:1–107:14,
2017.
[5] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak
Lee. Generative adversarial text to image synthesis, 2016.
[6] Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. Mirrorgan: Learning text-to-image
generation by redescription, 2019.
[7] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan, 2017.
[8] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville.
Improved training of wasserstein gans, 2017.
[9] Dingdong Yang, Seunghoon Hong, Yunseok Jang, Tianchen Zhao, and Honglak Lee. Diversity-
sensitive conditional generative adversarial networks, 2019.
[10] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.
Improved techniques for training gans, 2016.
[11] Han Zhang, Zizhao Zhang, Augustus Odena, and Honglak Lee. Consistency regularization for
generative adversarial networks, 2020.
[12] Vincent Dutordoir, Hugh Salimbeni, Marc Deisenroth, and James Hensman. Gaussian process
conditional density estimation, 2018.
[13] Xin Ding, Yongwei Wang, Zuheng Xu, William J Welch, and Z. Jane Wang. Ccgan: Continuous
conditional generative adversarial networks for image generation. In International Conference
on Learning Representations, 2021.
[14] Georgios Arvanitidis, Lars Kai Hansen, and Søren Hauberg. Latent space oddity: on the
curvature of deep generative models. In International Conference on Learning Representations,
2018.
[15] Nutan Chen, Alexej Klushyn, Richard Kurle, Xueyan Jiang, Justin Bayer, and Patrick van der
Smagt. Metrics for deep generative models, 2018.
7
[16] Jinsung Yoon, Daniel Jarrett, and Mihaela van der Schaar. Time-series generative adversarial
networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett,
editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates,
Inc., 2019.
[17] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai
Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In
The Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Conference,
volume 35, pages 11106–11115. AAAI Press, 2021.
[18] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization
for generative adversarial networks, 2018.
[19] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-
image translation. In Proceedings of the European Conference on Computer Vision (ECCV),
September 2018.
[20] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A. Efros, Oliver Wang,
and Eli Shechtman. Toward multimodal image-to-image translation, 2018.
[21] Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport: With applications to data
science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
[22] Yunkai Zhang, Qiao Jiang, Shurui Li, Xiaoyong Jin, Xueying Ma, and Xifeng Yan. You may
not need order in time series forecasting. CoRR, abs/1910.09620, 2019.
[23] Xiaodong Liu Jianfeng Gao Asli Celikyilmaz Lawrence Carin Hao Fu, Chunyuan Li. Cyclical
annealing schedule: A simple approach to mitigating KL vanishing. In NAACL, 2019.
A Problem Formulation
Let X ⊂ Rm , Y ⊂ Rn , Z ⊂ Rl be the condition space, the output space, and the latent space
respectively. Denote the underlying joint distribution for x ∈ X and y ∈ Y as pr (x, y). Thus, the
conditional distribution of y given x becomes pr (y|x). The training set consists of N observed
(x, y) pairs, denoted as {(xi , y i )}N
i=1 . Following the vanilla cGAN [1], we introduce a random noise
z ∈ Z and z ∼ pz (z), where pz (z) is a predetermined easy-to-sample distribution. The goal is to
train a conditional generator G : X × Z → Y, whose inputs are the condition x and latent noise z,
in order to imitate the conditional distribution pr (y|x). Our proposed regularization term is suitable
for most variants of cGAN losses, such as the vanilla cGAN loss [1], the Wasserstein loss [7], and
the hinge loss [18]. Without loss of generality, in Equation ?? we illustrate our regularization term
on the vanilla cGAN loss. Below is the vanilla cGAN loss, where the conditional generator G and
discriminator D are learned by jointly optimizing the following objective:
min max LcGAN (D, G)
G D
=E(x,y)∼p̂(x,y) [log D(x, y)] (2)
+ Ez∼pz (z),x∼p̂(x) [log(1 − D(x, G(x, z)))],
where p̂(x, y) is the empirical distribution of {(xi , y i )}N
i=1 , and p̂(x) is the empirical distribution of
{xi }N
i=1 . A natural choice of the norm in Equation (1) is a Frobenius norm, computed by
v
u n X m
uX 2
∂Gi (x, z)
||∇x G(x, z)|| = t .
i=1 j=1
∂xj
8
or the generator architecture, but the connections with small perturbations on the conditions are
relatively less studied. Notably, CcGAN [13] attempts to address the continuous condition issues by
adding Gaussian noises to the input conditions. This implies that the model could lose more granular
information of the conditions, resulting in outputs that might be less faithful to the input conditions.
In particular, when there are large gaps in the dataset, CcGAN must choose large standard deviations
for Gaussian noises in order to cover these gaps, which further exacerbates the issue. On the other
hand, our proposed method relies on encouraging gradual changes of the output with respect to the
input conditions, which does not blur the input conditions themselves.
9
By the definition of Wasserstein distance,
W (Pf (Z), Pg (Z)) = inf E(x,y)∼γ [∥x − y∥]
γ∈Π(Pf ,Pg )
The proof of Theorem C.1 is obvious using Lemma C.1. In Theorem C.1, given the fixed x1 (or x2 ),
the generator G can be viewed as a function G(x1 , ·) (or G(x2 , ·)) that maps a random noise z to
G(x1 , z) (or G(x2 , z)). Take the random variable Z in Lemma C.1 as z. Take G(x1 , ·) and G(x2 , ·)
as the functions f and g in Lemma C.1. Then Theorem C.1 is evident.
where
||G(x + ∆x, z) − G(x, z)||
f (x, ∆x, z) = ,
||∆x||
∆x ∼ p∆x (∆x) is a small perturbation added to x and p∆x (∆x) is the distribution of ∆x. The
distribution p∆x (∆x) is designed to be a distribution centered close to zero and has a small variance,
such as a normal distribution. τ1 is a bound for ensuring numerical stability. We also impose a lower
bound τ2 on ∆x for the same reason. If the generator regularization takes the approximated form in
Equation 6, the training algorithm is given in Algorithm 2.
10
Algorithm 1 An algorithm for training GR-cGAN with generator regularization as in Equation 1
Require: The generator regularization coefficient λ, the training set {xi , y i }N i=1 , the batch size m,
the number of iterations of the discriminator per generator iteration n, Adam hyper-parameters
α, β1 and β2 , the number of iterations K.
Require: w0 , initial discriminator parameters. θ0 , initial generator’s parameters.
1: for k = 1 to K do
2: for t = 1, . . . , n do
3: Sample a batch of real samples from the training set, denote as {xj , y j }m j=1 .
4: Sample a batch of randomPnoises independently, z j ∼ pz (z), for j = 1, 2, . . . , m.
1 m
5: Discriminator loss ← m j=1 log D(x j , y j ) + log(1 − D(G(x j , z j )))
6: Update D.
7: end for
8: Sample two batches of real samples from the training set independently, denote as {xj , y j }m j=1
and {x′j , y ′j }m
j=1 .
9: Sample a batch of random noises independently, z j ∼ pz (z) for j = 1, 2, . . . , m.
10: Sample random numbers ϵj ∼ U [0, 1] for j = 1, 2, . . . , m.
11: x′′j ← ϵxj + (1 − ϵ)x′j for j = 1, 2, . . . , m.
1
Pm ′′
12: LGR (G) ← m j=1 ||∇xj G(xj , z j )||
′′
1
Pm
13: Generator loss ← m j=1 [log(1 − D(G(xj , z j )))] + λLGR (G)
14: Update G.
15: end for
Gaussian) is recovered if at least one high quality sample is generated. For the conditional distribution
given by the generator, (i.e., the distribution of G(x, z) with z ∼ pz (z)), we assume this distribution
G
is Gaussian and estimate its mean and covariance using 100 fake samples, denoted by µG x and Σx
respectively. We compute the 2-Wasserstein Distance (W2) [21] between the true conditional
distribution and the distribution given by the generator, in other words, the 2-Wasserstein Distance
11
2.0 2.0
Train labels 6
1.5 1.5
5
1.0 1.0
0.5 0.5 4
cos(x)
0.0 0.0
y2
3
x
0.5 0.5
2
1.0 1.0
1
1.5 1.5
2.0 2.0 0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
sin(x) y1
(a) (b)
Figure 6: (a) plots the locations of the means of the 120 Gaussians. (b) illustrates 1,200 randomly
chosen samples from the training set.
between
R · sin(x) G
N , σ̃ 2 I2×2 and N (µG
x , Σx ).
R · cos(x)
The whole experiment is repeated three times and the averaged values of the metrics are reported in
Table 2 over three repetitions. We see that GR-cGAN demonstrates competitive performance against
CcGAN, especially in terms of the 2-Wasserstein distance.
Visual results. We select 8 angles that do not exist in the training set. For each angles x selected, we
use all the models to generate 100 fake samples. Furthermore, we plot the circle with (sin(x), cos(x))
as the center and 2.15σ̃ as the radius to indicate the true conditional distribution N (µx , Σ). The
results are given in Figure 7. Fake samples from our method better match the true samples when
compared to the other methods.
(a) CcGAN (HVDL) (b) CcGAN (SVDL) (c) Deg.GR-cGAN (d) GR-cGAN
Figure 7: Visual results of the Circular 2-D Gaussians experiments on the full dataset. For each
subfigure, we generate 100 fake samples using each model at each of the 8 means that are absent from
the training set. The blue dots represent the fake samples. For each mean x given, the circle locates
at (sin(x), cos(x)) and has a radius of 2.15σ̃, which can cover about 90% of the volume inside the
pdf of N (µx , Σ).
12
E.1.2 Partial Dataset
We provide the quantitative evaluation metrics for the experiment in Section 4.1, as shown in Table 3.
Training steps: The training steps is also the same as in [13]. All GANs are trained for 6000
iterations on the training set with the Adam (Kingma & Ba, 2015 ) optimizer (with β1 = 0.5 and
β2 = 0.999 ), a constant learning rate of 5 × 10−5 and a batch size of 128. The hyperparameters of
CcGAN takes the same value as in Section S.VI.B of [13]. The λ of GR-cGAN is set to 0.02, with
the generator regularization term computed by Equation 1.
13
Figure 8: The modified TimeGAN architecture in order to support conditional inputs, which we
denote as cTimeGAN. Note that ĥ0:T is not generated by Genc , as we can directly leverage the
Embedding network (E) trained via direct supervision in an autoencoder manner.
In contrast to TimeGAN which uses a simple GRU with minimal hidden units, we argue that it is
essential to use a more powerful model for predictive score. Otherwise, even if the generation model
was able to generate complex real-world dynamics, a simple forecasting model could easily fail to
capture it. We also report the predictive score under multi-step forecasting, where the forecast horizon
is always set to the last one-third of the window. On the other hand, a simpler model is desired for
discriminative score, as the task of discriminator is considered to be much simpler than generation.
Lastly, to highlight the robustness of our model to unseen results, we also report the results when the
test set is included in the training of the generator under the dataset name ETTm1all .
As for the discriminative score model, we use a two-layer GRU with 5 hidden units and a dropout
of 0.1. For the forecasting model, we use a two-layer encoder-decoder transformer model with
sinusoidal positional attention. d_model = 20, df f = 30, num_heads = 2. Dropout is also 0.1.
We notice that the datasets could be small if we train the forecasting model only using generated
data of the same size as the original data. Thus, we let the generated data to be 10 times the size of
the original data for ETTm1 and 100 times for stock. Even so, the amount of patterns in the two
datasets are still limited. Therefore, we also randomly mask out[22] 10% of the data during training
for ETTm1 and 20% of the data for stock in order to obtain more stable forecasting performance.
Note that the output of the G-encoder is not used directly to generate the hidden states [ĥ0:T ],
which are instead generated by directly employing the E module. This leads to better results in
our experiments, as we hypothesize that mapping y to ĥ is considerably different from extracting
meaningful information to assist the mapping from z to ĥ. The rest of the architecture remains the
same as TimeGAN’s.
Before training GR-cGAN, we first train a VAE to map a pair of conditions [y1,i ]Ti=0 , [y2,i ]Ti=0
into a latent space with a prior of N (0, I) to get (µ1 , µ2 ). Then, we randomly sample
α ∈ (0, 1) and feed µ̃ = αµ1 + (1 − α)µ2 back into the VAE decoder to get the recon-
structed conditions [ỹµ̃,i ]Ti=0 , [ỹµ̃+ϵ,i ]Ti=0 . At the end, our gradient penalty becomes LGR (G) =
||G([ỹµ̃,i ]Ti=0 , [zT +i ]τi=1 ) − G([ỹµ̃+ϵ,i ]Ti=0 , [zT +i ]τi=1 )||. Note that we omit σ1 , σ2 from the VAE
encoder for simplicity, but noticed little impact on performance.
We use three-layer GRUs for all the submodules in cTimeGAN. For ETTm1, the hidden dimension
is set to 30, z dimension is 12, and dropout is 0.1. For stock, the hidden dimension is set to 24, z
14
dimension is 24, and dropout is 0. The rest of the hyperparameters stay the same as TimeGAN’s.
As for the VAE module, we adapted a simple implementation from Github6 and replaced its 2D-
convolutional layers with stacked 1D-convolutional layers. To improve its stability, we further use a
cyclical annealing schedule [23] as well as a discriminator to facilitate more realistic reconstructions.
KL-regularization weight is always set to 0.0001. All of our hyperparameters can be found in the
released code. All the hyperparameters except λ are chosen heuristically.
We first show in Figure 5 and Figure 9 the sensitivity of our model under different generator penalty
weights λ. For CcGAN, when ϵ = 0.1, discriminative = 0.421 and predictive = 0.071. When ϵ = 0.01,
discriminative = 0.468 and predictive = 0.052. When ϵ = 0.001, discriminative = 0.390, predictive =
0.0599.
(a) (b)
Figure 9: Sensitivity analysis of the generator regularizor weight λ on (a) ETTm1. (b) stock. The
horizontal dashed lines are the discriminative scores (blue) and predictive scores (orange) for the
baseline cTimeGAN. The confidence intervals are constructed by evaluating the generated data 10
times. Lower scores indicate better results. For example, on the left-hand side, we see that the red box
corresponding to λ = 30 falls entirely under the red dashed line. This means that in all ten runs our
model surpasses the averaged baseline result. Meanwhile, we also note that there are large variances
in the discriminative score. We leave the redesign of the discriminative score evaluation framework
to future studies.
6
https://github.com/AntixK/PyTorch-VAE.
15
Figure 10: Our model with λ = 100 on ETTm1. The vertical green dashed lines separate the
conditions from the generated results. The first two rows are the generation results on conditions seen
during training, where the first row shows the generated data and the corresponding figures in the
second row show the corresponding original data (e.g. The image in the first column and the first row
uses the same condition as the image in the first column and the second row, where the former is the
generated window and the latter is the ground truth). The last two rows are the generation results on
conditions not seen during training. Our model is still able to capture the general trend and output
realistic samples.
16
Figure 11: A forecasting model trained with the data generated by our model with λ = 100 on
ETTm1. The forecasting model is able to predict the overall trend, indicating that the generated data
has properties that are faithful to the original data.
17
Figure 12: The baseline model cTimeGAN on ETTm1. The vertical green dashed lines separate the
conditions from the generated results. The first two rows are the generation results on conditions seen
during training, where the first row shows the generated data and the corresponding figures in the
second row show the corresponding original data (e.g. The image in the first column and the first row
uses the same condition as the image in the first column and the second row, where the former is the
generated window and the latter is the ground truth). The last two rows are the generation results on
conditions not seen during training. The model behaves considerably worse on unseen conditions.
18
Figure 13: A forecasting model trained with the data generated by cTimeGAN on ETTm1. The
forecasting model tends to give roughly flat predictions, indicating that the generated data is less
consistent with the original version.
19