SC VAE
SC VAE
SC VAE
Keywords: gene expression; scRNA-seq; count data; machine learning; deep learning; Bayesian
inference; generative modelling; variational auto-encoder.
1
bioRxiv preprint doi: https://doi.org/10.1101/318295; this version posted May 16, 2018. The copyright holder for this preprint (which was
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
veloped new generative models based on the VAE not necessarily result in a high Rand index. For in-
framework for directly modelling raw counts from stance were the lower bounds and Rand indices for
RNA-seq data; 2) we show that our Gaussian- the full data set not strongly correlated with, e.g.,
mixture VAE (GMVAE) learns biologically plau- a sample correlation coefficient of r = −0.0903
sible groupings of scRNA-seq data of higher qual- (see Additional file 1: Figure S2).
ity than previous methods; and 3) we provide To test whether a more complex model is re-
a publicly available framework for unsupervised quired or a simpler model is sufficient in mod-
modelling of count data from RNA-seq experi- elling the data, the network architecture and like-
ments. lihood function yielding the highest marginal log-
likelihood lower bound for the VAE model for each
data set were used to train GMVAE models as
Results well as factor-analysis (FA) models. The result-
Both the standard VAE and the GMVAE have ing lower bounds and Rand indices are listed in
been used to model both single-cell and tradi- Table 2. The FA models had worse lower bounds
tional RNA-seq data sets (Table 1). The perfor- compared to their VAE equivalents, but the Rand
mance of different likelihood functions have been indices are significantly higher (except for the
investigated, and the latent spaces of the models PBMC-L data set), whereas the GMVAE models
have been examined. improved both the lower bounds and the Rand
indices. This shows that non-linear models really
makes a difference for the marginal log-likelihood
scRNA-seq data of purified immune cells of the data. The latent space of the GMVAE-NB
The first data set we modelled was of single- model for the test set of the full PBMC data set
cell gene expression levels of peripheral blood can be seen in Figure 1, which shows different
mononuclear cells (PBMC) [23]. We also modelled clusters corresponding to distinct cell types, while
two subsets of the PBMC data set: one of lympho- more similar cell types are clustered together.
cytes (PBMC-L) and another of T cells (PBMC- To see if we can achieve better results by lim-
T). iting the genes we use in our models, we also
To assess the optimal network architectures for trained and tested models using only the 100 most
these data sets, we first trained VAE models us- varying genes for all data sets and the 800 most
ing a negative binomial (NB) likelihood function varying genes for the PBMC and the PBMC-
(VAE-NB models) with different network archi- T data sets using the same procedure as above.
tecture on the three data sets with all genes in- Different network architectures were examined
cluded. This was carried out as grid searches of (see Additional file 1: Figures S3–S4), and dif-
number of hidden units and latent dimensions (see ferent likelihood functions and models were in-
Additional file 1: Figure S1). We found that us- vestigated using the optimal network architec-
ing two smaller hidden layers (of 100 units each) tures (see Additional file 1: Tables S1–S2). Using
in the generative and inference processes yielded a constrained Poisson distribution as the likeli-
better results than only using one hidden layer hood function produced the highest lower bounds,
for all data sets. For the full PBMC data set, a whereas different likelihood functions produced
high-dimensional latent space of 100 dimensions the best Rand indices. The GMVAE models have
result in the highest test marginal log-likelihood again higher lower bounds than their correspond-
lower bound, whereas for the two smaller subsets, ing VAE and FA models, but also higher Rand
a lower-dimensional latent space of 25 dimensions indices.
gave the best lower bound. Sun et al. [5] have reported adjusted Rand in-
With these network architectures different dis- dices for both PBMC-L and PBMC-T data sets
crete likelihood functions was examined for each using variations of their DIMM-SC model as well
data set. The marginal log-likelihood lower bound as the Seurat [3], cellTree [4], and k-means clus-
as well as the adjusted Rand index for the test tering models (see Additional file 1: Table S1–S2
sets are listed in Table 2. From this table it is for performance of these models). For the PBMC-
clear that the using the negative binomial distri- L data set, the highest Rand index achieved by
bution yielded the highest lower bound for all data these models was a DIMM-SC model was 0.990
sets followed closely by its zero-inflated version using the 100 most varying genes. Our GMVAE-
(ZINB). For the Rand indices, however, using the CP model yielded a Rand index of 0.9990 using
constrained Poisson (CP) distribution gave the the same genes (see Additional file 1: Figure S5 for
highest values for the subsets, especially for the its latent space), and two models, the VAE-ZINB
T cells, while a straight Poisson distribution yield and the VAE-CP models, resulted in Rand indices
the highest index for the full data set with the of 1.000 using all genes. The highest Rand index
constrained version as a close second. It should that the other models achieved for the PBMC-T
be noted, however, that a high lower bound does data set using the 100 most varying genes was
2
bioRxiv preprint doi: https://doi.org/10.1101/318295; this version posted May 16, 2018. The copyright holder for this preprint (which was
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
Examples refer to cells or cell samples, features refer to gene names or gene IDs, while classes refer to either cell
types or tissue sites. The T cells data subset contains CD8+/CD45RA+ naïve cytotoxic T cells, CD4+/CD25+
regulatory T cells, and CD4+/CD45RA+/CD25- naïve T cells, while the lymphocytes data subset contains
CD56+ natural killer cells, CD19+ B cells, and CD4+/CD25+ regulatory T cells.
Table 2: Test metrics for the purified immune cells data sets.
The test marginal log-likelihood lower bounds (in nats), Ltest , and the adjusted Rand indices, Radj
test
, of the PBMC
data sets evaluated using different likelihood functions for the standard VAE and using the most promising
likelihood function for the FA and GMVAE models. The highest lower bounds and Rand indices for each data set
and model have been highlighted in bold.
Figure 1: Latent space of the GMVAE-NB trained and evaluated on the PBMC data set. The encoding of the
cells in the latent space has been embedded in two dimensions using t-SNE and are colour-coded using their cell
types. Clear separations can be seen corresponding to different cell types, but some similar cell types are also
clustered close together or mixing together.
3
bioRxiv preprint doi: https://doi.org/10.1101/318295; this version posted May 16, 2018. The copyright holder for this preprint (which was
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
Table 3: Test metrics for the mouse brain cells data Table 4: Test metrics for the TCGA data set.
sets. Model Likelihood Ltest /105 test
Radj
test
Data set Model Likelihood L VAE P −337.84 0.6517
MB-20k VAE P −5553.01 NB −1.6187 0.6176
NB −5517.0 ZIP −338.08 0.6470
ZIP −5538.3 ZINB −1.6220 0.6673
ZINB −5516.4 PCP −337.98 0.6421
PCP −5586.7 CP −72.615 0.6995
CP −5595.3 GMVAE NB −1.6194 0.3611
GMVAE ZINB −5508.6
The test marginal log-likelihood lower bound (in nats),
MB-1M VAE ZINB −5398.2
Ltest , as well as the adjusted Rand index, Radj
test
, for the
GMVAE ZINB −5396.3
feature-mapped TCGA data set evaluated using
The test marginal log-likelihood lower bound (in nats), different likelihood distributions for the standard VAE
Ltest , for the MB-1M data sets evaluated using and using the most promising likelihood function for
different likelihood distributions for the standard VAE the GMVAE. The highest lower bound and Rand index
and using the most promising likelihood function for have been highlighted in bold.
the GMVAE. The highest lower bounds for each data
set and model have been highlighted in bold.
each cluster and even some separation for some
clusters. To train each model one epoch on the
0.408 by a DIMM-SC model. Our GMVAE-CP full data set took approximately 8 minutes for the
model yielded a Rand index of 0.7076 using the VAE model and approximately 26 minutes for the
same genes. Using the 800 most varying genes, the GMVAE model on a GeForce GTX 1080 Ti graph-
highest Rand index attained by the other models ics card.
was 0.556 for the best DIMM-SC model, while it
was 0.8452 for the GMVAE-CP model (see Addi- Traditional RNA-seq data of human cancer
tional file 1: Figure S6 for its latent space). cells
Lastly, we modelled the traditional RNA-seq data
scRNA-seq data of mouse brain cells set of human cancer cells from TCGA [25]. As
with the previous data sets different network ar-
Next, we considered the second single-cell gene
chitectures (see Additional file 1: Figure S9) and
expression data set [24], which consists of gene
likelihood functions (see Table 4) were investi-
expression levels for 1.3 million mouse brain cells
gated. There was a clear difference in marginal
(MB-1M). Since no cell labels are available for this
log-likelihood lower bound between using either
data set, the adjusted Rand index cannot be eval-
of the negative binomial distributions from any of
uated.
the Poisson distributions. The constrained Pois-
We can scale the model to train on the full
son distribution, however, achieved a significantly
MB-1M data set without any modifications. How-
higher lower bound than the other Poisson distri-
ever, because of the large number of cells, we used
butions. The adjusted Rand indices were much
the smaller subset of 20 000 cells (MB-20k) to
closer together in value for the different likeli-
perform a network architecture grid search (see
hood functions with the constrained Poisson dis-
Additional file 1: Figure S7) as well as to test
tribution having the highest index. A GMVAE
the different likelihood functions (see Table 3)
model with negative binomial distribution was
following the same procedure as in the previous
also trained, but it did not produce better re-
section. The highest test marginal log-likelihood
sults than its VAE equivalent, especially not for
lower bound was achieved using the zero-inflated
the Rand index. The latent space of the GMVAE-
negative binomial distribution with the regular
NB model can be seen in Figure 3, where samples
negative binomial as a close second. A GMVAE
belonging to multiple tissue sites are clearly sep-
model with 10 components (clusters) was also
arated. The latent space for the VAE-NB model
trained on the subset using this configuration, and
looks very similar despite the difference in Rand
this model improved upon the lower bound (see
index (see Additional file 1: Figure S10).
Additional file 1: Figure S8 for its latent space).
For the full data set a VAE and a GMVAE
model were trained using the optimal configura- Discussion
tion found for the subset, and the lower bounds We show that VAEs can be used to model scRNA-
using these models are listed in Table 3. The GM- seq data. The GMVAE model achieves better
VAE model achieved the highest lower bound of marginal log-likelihood lower bounds as well as
the two, and its latent space have been plotted in higher Rand indices compared to both a corre-
Figure 2, which shows somewhat clear regions for sponding VAE model as well as previous results,
4
bioRxiv preprint doi: https://doi.org/10.1101/318295; this version posted May 16, 2018. The copyright holder for this preprint (which was
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
Figure 2: Latent space of the GMVAE-ZINB trained and evaluated on the MB-1M data set. Ten clusters was
used in the model, and the encoding of the cells in the latent space has been embedded in two dimensions using
t-SNE and are colour-coded using the clusters to which they belong. Different regions can be seen with even some
separation for some clusters.
Figure 3: Latent space of the GMVAE-NB trained and evaluated on the feature-mapped TCGA data set. The
encoding of the cells in the latent space has been embedded in two dimensions using t-SNE and are colour-coded
using the tissue sites to which they belong. Many of the clusters are clearly separated and they correspond to
different tissue sites.
5
bioRxiv preprint doi: https://doi.org/10.1101/318295; this version posted May 16, 2018. The copyright holder for this preprint (which was
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
and the latent spaces for the GMVAE also show applied both models successfully to both single-
clear separation according to cell types. The GM- cell and traditional RNA-seq data sets. Building
VAE model, however, struggles with traditional clustering into the Gaussian-mixture variational
RNA-seq data, but this might be explained by auto-encoder, we have a model that can sort cells
the diversity within each data sample. into cell populations with little human interaction
In modelling three different data sets and sub- compared to earlier more involved models. Hav-
sets thereof, we have found the following guide- ing both an inference process and a generate pro-
lines help in achieving a higher marginal log- cess, makes it possible to project new data onto an
likelihood lower bound score on a data set. The existing latent space, or even generate new data
network architecture should adapt to the data from samples in the latent space. This means that
set: The more cells in the data set, the larger the new cells can be introduced to an already trained
network should be to better capture more subtle model, and it could enable combining the latent
co-variation between genes. The network should representations of two cells to generate a cell and
also be larger if several cell types are present to the transitional states in between.
capture this variability, or if preprocessing leaves In the future, we would like to make the models
the data set less sparse, which is the case for, more flexible by adding more latent variables [10]
e.g., selecting the most varying genes. The mod- and making the models adaptable to the number
els for all data sets also benefited from deeper of clusters [28]. We could also use semi-supervised
network architectures. For the likelihood func- learning and active learning to better classify cells
tion, the negative binomial distribution (or in one and identify cell populations. This would also help
case, its zero-inflated version) yielded the highest with transfer learning enabling modelling multiple
values of the lower bound when using all genes, data sets with the same model. Since these models
especially for the traditional RNA-seq data set, are generative, it should also be possible to com-
while the constrained Poisson distribution pre- bine encodings of the cells in the latent space a
vailed when limiting to the most varying genes. produce in-between cells like Campbell and Kautz
We have also shown that using a simpler linear- [29], who used a Gaussian-process latent-variable
factor model for the generative process lead to model on fonts [30]. We note that it is possible
significantly worse lower bounds, demonstrating to apply these models to data sets with multiple
that non-linear transformations can more easily modalities such as RNA-seq and exome sequenc-
express more subtle co-variation patterns in the ing [31].
data sets.
Its built-in clustering helps the GMVAE achieve
a better log likelihood lower bound than the stan- Methods
dard VAE. Compared to the DIMM-SC model We have developed generative models for directly
by Sun et al. [5], the GMVAE models and some modelling the raw read counts from scRNA-seq
VAE models achieve a higher adjusted Rand in- data. In this section, we describe the models as
dex. We found, however, that the correlation be- well as the different data likelihood (link) func-
tween lower bounds and Rand indices to be weak. tions for this task.
We also showed that both models are able to scale
up to very large data sets like the data set of 1.3
million mouse brain cells from 10x Genomics. Latent-variable models
During the submission process and independent We take a generative approach to modelling the
of our work, two recent articles [26, 27] have pro- data-generating process of the count data vector
posed methodology quite similar to ours. Both fo- x. We assume that x is drawn from the distribu-
cus on the zero-inflated negative binomial likeli- tion pθ (x) = p(x | θ) parameterised by θ. In this
hood function. One [26] uses variational inference case, x represents a single cell and its components
like we propose here, and the other [27] uses a xn , which are also called features, correspond to
bottleneck non-probabilistic auto-encoder. the gene expression count for gene n. The num-
ber of features, N , is very high, but we would still
want to be able to model co-variation between the
Conclusions features. We achieve this by introducing a stochas-
We show that the two proposed variational auto- tic latent variable z, with fewer dimensions than
encoders are able to model gene expression counts x, and condition the data-generating process on
using appropriate discrete probability distribution this. The joint probability distribution of x and z
as likelihood (link) functions and provide a soft- is then
ware implementation. These models are proba- pθ (x, z) = pθ (x | z)pθ (z), (1)
bilistic and put a lower bound on the marginal
likelihood of the data sets, enabling us to exam- where pθ (x | z) is the likelihood function and pθ (z)
ine and compare different link functions. We have is the prior probability distribution of z. We have
6
bioRxiv preprint doi: https://doi.org/10.1101/318295; this version posted May 16, 2018. The copyright holder for this preprint (which was
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
7
bioRxiv preprint doi: https://doi.org/10.1101/318295; this version posted May 16, 2018. The copyright holder for this preprint (which was
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
As before, the latent variables, y and z, can be where ρ is the probability of excessive zeros. The
marginalised out to give the marginal likelihood: zero-inflated negative binomial (ZINB) distribu-
XZ tion have an analogous expression. Both these
pθ (x) = pθ (x, y, z) dz. (6) zero-inflated version have also been examined.
y The other method is to model low counts using
a discrete distribution and using a second prob-
For z to follow a Gaussian-mixture model, the
ability distribution to model the higher counts.
mean and variance should be dependent on y, and
Using the Poisson distribution in the latter case,
y should follow a categorical distribution. Using a
leads to the following distribution:
Poisson likelihood again, the generative process
(
becomes τk , x = k, 0 ≤ x < K,
f (x; τ , λ) =
pθ (x | z) = f (x; λθ (z)), (7a) τK g(x − K; λ), x ≥ K,
pθ (z | y) = N z; µθ (y), σ 2θ (y)I
, (7b) (9)
where K is the cut-off between low and high
pθ (y) = Cat y; π . (7c) counts, and τ is a K + 1-dimensional probability
Here, π is a K-dimensional probability vector, vector with components τk quantifying the prob-
where K is the number of components in the ability of a count being equal to k for low counts
Gaussian-mixture model. The component πk of or K for high counts. This distribution we call a
π is the mixing coefficient of the kth Gaussian piece-wise categorical Poisson (PCP) distribution,
distribution, quantifying how much this distribu- and it has been used to predict demand for large
tion contributes to the overall probability distri- quantities [35].
bution. Without any prior knowledge the cate- Since the cell depth varies for cells [34], a vari-
gorical distribution is fixed to be uniform. Fig. 4 ation of the Poisson distribution called the con-
shows the generative process for the Gaussian- strained Poisson distribution is also examined.
mixture variational auto-encoder (GMVAE). The This distribution has been used to model word
reconstruction x̃ is computed for each cluster as counts in documents of varying length [36], which
before and then averaged corresponds to gene expression levels in cells. For
using the mixing coef- the constrained Poisson distribution, the rate pa-
ficients: x̃ = Ey Ex | z x . The variance of the
reconstruction can again be computed in similar rameter λn for each count xn of gene n is repa-
fashion. rameterised as
λn = Dpn , (10)
Modelling gene expression count data where D is the cell depth and pn is the propor-
tional contribution for gene n to the cell depth.
Instead of normalising and transforming the gene
This ensures the cell depth is reconstructed cor-
expression data, the original transcript counts are
rectly and builds
P normalisation directly into the
modelled directly to take into account the total
amount of genes expressed in each cell also called
model. Since n pn = 1, this also couples the
counts for different genes in each cells, whereas the
the cell depth. To model count data, the likeli-
other probability distributions model each count
hood function pθ (x | z) will need to be discrete
individually.
and only have non-negative support. We will con-
As described earlier, parameters for these prob-
sider a number of such distributions in the follow-
ability distributions are modelled using deep neu-
ing and investigate which ones are best in term
ral networks. Since we do not know the true con-
of likelihood on held-out data. The Poisson (P)
ditioning structure of the genes, we make the sim-
distribution is chosen as the first potential can-
plifying assumption that they are independent for
didate. Gene expression data are also generally
computational reasons and therefore use feedfor-
over-dispersed [33], so the negative binomial (NB)
ward neural networks.
distribution is also tested.
To properly handle the sparsity of the scRNA-
seq data [34], we test two approaches: a zero- Variational auto-encoders
inflated distribution and modelling of low counts In order to train and evaluate deep generative
using a discrete distribution. A zero-inflated dis- models, we need approximative inference. Here we
tribution adds an additional parameter, which will use variational auto-encoders which amounts
controls the amount of excessive zeros added to to replacing the intractable marginal likelihood
an existing probability distribution. For the Pois- with its variational lower bound and estimate
son distribution, f (x; λ), the zero-inflated version the intractable integrals with low-variance Monte
(ZIP) is defined as Carlo estimates. The lower bound is optimised
( for different models on the training set and af-
ρ + (1 − ρ)f (x; λ), x = 0, ter training evaluated on a test set in order to
f (x; ρ, λ) = (8)
(1 − ρ)f (x; λ), x > 0, perform model comparison.
8
bioRxiv preprint doi: https://doi.org/10.1101/318295; this version posted May 16, 2018. The copyright holder for this preprint (which was
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
Since the likelihood function pθ (x | z) for the N 0, I N 0, I
standard VAE is modelled using non-linear trans-
formations, the posterior probability distribu- zm zm
tion pθ (z | x) = pθ (x | z)pθ (z)/pθ (x) becomes in-
tractable. Variational auto-encoders use a vari-
µ+σ µ+σ
ational Bayesian approach where pθ (z | x) is re-
placed by an approximate probability distribu-
tion qφ (z | x) modelled using non-linear transfor-
mations parameterised by φ. Thus, the marginal µm σ 2m µm σ 2m
log-likelihood can be written as φ
log pθ (x) = KL qφ (z | x) pθ (z | x) + L(θ, φ; x). MLP φ ym MLP
(11)
Here, the first term is the Kullback–Leibler (KL)
divergence between the true and the approxi- xm xm
MLP
mate posterior distributions. The KL divergence m = 1, . . . , M m = 1, . . . , M
is a non-negative measure of the dissimilarity be-
tween two probability distributions, and it is only VAE GMVAE
zero for identical probability distributions. How- Figure 5: Model graphs of the inference processes.
ever, since the true posterior distribution is in- The inference processes of (left) the standard
tractable, this KL divergence cannot be evalu- variational auto-encoder and (right) the
ated, but because it is non-negative, it does put Gaussian-mixture variational auto-encoder. See Fig. 4
a lower bound on the marginal log-likelihood: for explanation of symbols. In addition, the rhombi
log pθ (x) ≥ L(θ, φ; x), which is why L(θ, φ; x) symbolise deterministic variables.
is called the marginal log-likelihood lower bound.
This can be decomposed into two terms as well:
As in equation (4), W(φ) and b(φ) are the weights
L(θ, φ; x) = Eqφ (z|x) log pθ (x|z) and biases of linear models, while u(·) and v(·)
(12)
− KL qφ (z|x) pθ (z) , are appropriate non-linear transformations. Equa-
tion (13) is called the inference process, and it is
where the first term is the expected negative re- illustrated in Fig. 5.
construction error, which measures how good the Both processes are optimised simultaneously
model can reconstruct x. The second term is the using a stochastic gradient descent algorithm,
relative KL divergence between the approximate and a reparameterisation trick for sampling from
posterior distribution qφ (z|x) and the prior distri- qφ (z | x) is used that allows back-propagation of
bution pθ (z) of z. Since both terms are integrals the gradients in the optimisation algorithm. The
over z, these are still analytically intractable, so reported marginal log-likelihood lower bound is
they are evaluated using Monte Carlo sampling in- the mean over all examples.
stead. Compared to estimating the marginal like- Because of the additional latent variable, the
lihood directly by for example importance sam- marginal log-likelihood lower bound for the GM-
pling, the variance of these Monte Carlo esti- VAE has added complexity:
mators have much lower variance because they
involve averages over logarithms of distributions L(θ, φ; x) = Eqφ (y|x) Eqφ (z|x,y) log pθ (x|z)
and not the distributions themselves. The KL di-
− KL qφ (z|x) pθ (z|y)
vergence can, however, be computed analytically
for two Gaussian distributions. For the standard − KL qφ (y|x) pθ (y) .
VAE, the approximate posterior distribution is (15)
chosen to be a multivariate Gaussian distribution
with a diagonal covariance matrix: The first term is the standard VAE lower bound
for each class y averaged over all classes, and the
qφ (z | x) = N z; µφ (x), σ 2φ (x)I ,
(13) second term is the KL divergence for y. The two
approximate posterior distributions in the infer-
where µφ (x) and σ 2φ (x) are non-linear transfor- ence process are
mations of x parameterised by φ. Using a single-
qφ (z|x, y) = N z; µφ (x, y), σ 2φ (x, y)I , (16a)
layer feedforward neural network for each, they
are computed as
qφ (y|x) = Cat y; π φ (x) . (16b)
(φ) (φ)
µφ (x) = u Wµ x + bµ , (14a) Here, the kth component of π (x) is the responsi-
φ
bility of the kth Gaussian distribution for x, quan-
2 (φ) (φ)
σ φ (x) = v Wσ2 x + bσ2 . (14b)
tifying how much this distribution contributes to
9
bioRxiv preprint doi: https://doi.org/10.1101/318295; this version posted May 16, 2018. The copyright holder for this preprint (which was
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
the overall probability of x. The inference process Genomics [24], and we again use the filtered gene–
for the GMVAE is shown in Figure 5. cell matrix. This data set contains gene expression
levels for 1.3 million mouse brain cells (MB-1M),
but in contrast to the PBMC data set, the cells
Adjusted Rand index
were not purified so this data set does not contain
The adjusted Rand index [37], Radj , is used to any information about cell types. A smaller subset
measure the similarity between two different clus- of 20 000 uniformly randomly selected cells (MB-
terings of the same data set. Two identical clus- 20k) is also provided by 10x Genomics [24].
terings have Radj = 1, while the expected value of
The last data set is a traditional RNA-seq data
the adjusted Rand index is 0 (it is not bounded
set made publicly available by TCGA [25, 49].
below).
This data set consists of RSEM expected gene ex-
Because the VAE have no built-in clustering,
pression counts for samples of human cancer cells
k-means clustering is used to cluster the latent
from 29 tissue sites. Gene IDs are used in this data
representations of the cells fro this model. For the
set, so the available mapping from TCGA [49] is
Gaussian-mixture VAE, each cell sample can di-
used to map the IDs to gene names. The difference
rectly be a assigned to a cluster by picking the
in number of genes and sparsity from the original
mixture component with the highest responsibil-
data set is not large as can be seen from Table 1.
ity.
We note that the adjusted Rand index is not
optimised during the optimisation algorithm and Experiment setup
is computed only at the end. Each data set is divided into training, validation,
and test sets using a 81 %-9 %-10 % split with uni-
Visualising the latent space formly random selection. The training sets are
To visualise the latent space in which the latent used to train the models, the validation sets are
representations reside, the mean values of the ap- used to validate the models during training, and
proximate posterior distribution qφ (z|x) are plot- the test sets are used to evaluate the models after
ted instead of the samples. This is done to better training.
get a representation of the probability distribution For the deep neural networks, we examine dif-
for the latent representation of each cell. For the ferent network architectures to find the optimal
standard VAE, this is just z̄ = µφ (x), and for the one for each data set. We test deep neural net-
works of both one and two hidden layers with 100,
GMVAE, z̄ = Ey µφ (x, y) . To visualise higher
250, 500, or 1000 units each. We also experiment
dimensional latent spaces, we projected to two di-
with a latent space of both 10, 25, 50, and 100
mensions using t-distributed stochastic neighbour
dimensions. A standard VAE with the negative
embeddings [38].
binomial distribution as the likelihood function
(a VAE-NB model) is used for these experiments.
RNA-seq data sets Using the optimal architecture, we test the prob-
We model three different RNA-seq data sets as ability functions introduced earlier as likelihood
summarised in Table 1. The first data set is function for the standard VAE.
of single-cell gene expression levels of peripheral The hidden units in the deep neural networks
blood mononuclear cells (PBMC) [23]. It is pub- use the rectifier function as their non-linear trans-
lished by 10x Genomics as ten data sets of dif- formation, while the latent units and the output
ferent purified cell types [39–48], and we use the units use a non-linear transformation appropriate
filtered gene–cell matrices for each data set. One for the parameters they model. For real, positive
of these data sets contains two different cell types, parameters (λ for the Poisson distribution, r for
but since there are no separation of the two in the the negative binomial distribution, σ in the Gaus-
data set, we only use the ten cell types assigned sian distribution), we model the natural logarithm
by 10x Genomics. We combine these ten data sets of the parameter. For the standard deviation σ
into one single data set of purified immune cells in the Gaussian-mixture model, however, we use
(PBMC). Following Sun et al. [5], we also make the softplus function, log(1 + ex ), to constrain the
two smaller subsets of the purified immune cell possible covariance matrices to be only positive-
types: one subset of lymphocytes with three dis- definite. The units modelling the probability pa-
tinct cell types (PBMC-L) and one subset of T rameters in the negative binomial distribution, p,
cells with three similar cell types (PBMC-T). The and the zero-inflated distributions, πk , use the sig-
specific cell types for each subset are listed in Ta- moid function, while for the categorical distribu-
ble 1. This table also shows the high sparsity of tions in the constrained Poisson distribution, the
the data sets. piece-wise categorical distributions, as well as for
The second data set is also a single-cell gene the Gaussian-mixture model, the probabilities are
expression data set made publicly available by 10x given as logits with linear functions, which can be
10
bioRxiv preprint doi: https://doi.org/10.1101/318295; this version posted May 16, 2018. The copyright holder for this preprint (which was
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
evaluated as probabilities using softmax normal- The TCGA data that support the find-
isation. Additionally for the piece-wise categori- ings of this study are available from UCSC
cal distributions, we choose the cut-off K to be 4, Xena, https://xenabrowser.net/datapages/
since this strikes a good balance between the num- ?dataset=tcga_gene_expected_count&host=https:
ber of low and high count for all examined data //toil.xenahubs.net.[49]
The scVAE software tool is platform independet, open
sets (see Additional file 1: Figure S11 for an ex-
source under the Apache 2.0 license (OSI-compliant)
ample of this). For the k-means clustering and the
and available at https://github.com/chgroenbech/
GMVAE model, the number of clusters is chosen scVAE. The results in this article were produced us-
to be equal to the number of classes, if cell types ing version 1.0 (https://doi.org/10.5281/zenodo.
were provided. 1229188).[53]
The models are trained using one Monte Carlo
sample for each example and using the Adam op-
Competing interests
timisation algorithm [50] with a mini-batch size
The authors declare that they have no competing inter-
of 100 and a learning rate of 10−4 . Additionally,
ests.
we use batch normalisation [51] to improve con-
vergence speed of the optimisation. We train all
models for 500 epochs and use early stopping with Funding
the validation marginal likelihood lower bound to OW gratefully acknowledge the support of NVIDIA Cor-
select parameters θ and φ. As a baseline model to poration with the donation of the Titan Xp GPU used
test whether the non-linearity in the neural net- for this research. THP acknowledges support from the
Lundbeck Foundation and Novo Nordisk Foundation.
work contributes to a better performance, we try
out models with the same link function, but with-
out the hidden layers in the generative process. Author’s contributions
This corresponds to factor analysis (FA) with a CHG, MFV, and OW designed the methods. CHG and
generalised linear model link function (denoted MFV implemented them. CHG modelled and analysed
FA models). the data sets with help from CKS for the TCGA data
set. CHG and OW drafted the manuscript. All authors
have read and approved the final manuscript.
Software implementation
The models described in this sections have been
Acknowledgements
implemented in Python using Google’s machine-
We would like to thank Lars Maaløe for discussions about
learning software library TensorFlow [52] in
multimodal VAE’s and technical help for implementing
an open-source and platform-independent soft- these. We also want to give thanks to Associate Profes-
ware tool called scVAE (single-cell (or sparse sor Aki Vehtari for discussions about discrete probability
count) variational auto-encoders), and the source distributions and model comparison.
code is freely available at https://github.com/
chgroenbech/scVAE along with a user guide and
References
examples. The results in this article were pro-
[1] Regev, A., Teichmann, S.A., Lander, E.S., Amit,
duced using version 1.0.[53]
I., Benoist, C., Birney, E., Bodenmiller, B., Camp-
The data sets presented in “Results” along with
bell, P., Carninci, P., Clatworthy, M., Clevers, H.,
others are also easily accessed using this tool. This Deplancke, B., Dunham, I., Eberwine, J., Eils, R.,
tool also includes extensive support for sparse Enard, W., Farmer, A., Fugger, L., Göttgens, B.,
data sets and makes use of this sparseness to re- Hacohen, N., Haniffa, M., Hemberg, M., Kim, S.,
duce its memory footprint. It can thus be used on Klenerman, P., Kriegstein, A., Lein, E., Linnars-
the largest data sets currently available as demon- son, S., Lundberg, E., Lundeberg, J., Majumder,
strated on the MB-1M data set [24]. P., Marioni, J.C., Merad, M., Mhlanga, M., Naw-
ijn, M., Netea, M., Nolan, G., Pe’er, D., Philli-
pakis, A., Ponting, C.P., Quake, S., Reik, W.,
Ethics approval and consent to participate
Rozenblatt-Rosen, O., Sanes, J., Satija, R., Schu-
Not applicable. macher, T.N., Shalek, A., Shapiro, E., Sharma, P.,
Shin, J.W., Stegle, O., Stratton, M., Stubbington,
Consent for publication M.J.T., Theis, F.J., Uhlen, M., van Oudenaarden,
Not applicable. A., Wagner, A., Watt, F., Weissman, J., Wold, B.,
Xavier, R., and, N.Y.: The human cell atlas. eLife
6 (2017). doi:10.7554/elife.27041
Availability of data and materials
The PBMC [39–48] and MB-1M [24] data that sup- [2] Davis, S., contributers: List of software pack-
port the findings of this study are available from ages for single-cell data analysis, including RNA-
10x Genomics, https://support.10xgenomics.com/ seq, ATAC-seq, etc. Accessed 23 April 2018
single-cell-gene-expression/datasets/ (see indi- (2018). doi:10.5281/zenodo.1169178. https://
vidual references for direct links to each data set). github.com/seandavi/awesome-single-cell
11
bioRxiv preprint doi: https://doi.org/10.1101/318295; this version posted May 16, 2018. The copyright holder for this preprint (which was
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
[3] Satija, R., Farrell, J.A., Gennert, D., Schier, A.F., [16] Tan, J., Ung, M., Cheng, C., Greene, C.S.:
Regev, A.: Spatial reconstruction of single-cell gene Unsupervised feature construction and knowl-
expression data. Nature Biotechnology 33(5), 495– edge extraction from genome-wide assays of
502 (2015). doi:10.1038/nbt.3192 breast cancer with denoising autoencoders. In:
Biocomputing 2015, pp. 132–143. World Sci-
[4] duVerle, D.A., Yotsukura, S., Nomura, S., Abu- entific, Kohala Coast, Hawaii, USA (2014).
ratani, H., Tsuda, K.: cellTree: an r/bioconductor doi:10.1142/9789814644730_0014
package to infer the hierarchical structure of cell
populations from single-cell RNA-seq data. BMC [17] Gupta, A., Wang, H., Ganapathiraju, M.: Learn-
Bioinformatics 17(1) (2016). doi:10.1186/s12859- ing structure in gene expression data using deep ar-
016-1175-6 chitectures, with an application to gene clustering.
bioRxiv e-prints (2015). doi:10.1101/031906
[5] Sun, Z., Wang, T., Deng, K., Wang, X.-F.,
Lafyatis, R., Ding, Y., Hu, M., Chen, W.: [18] Tan, J., Hammond, J.H., Hogan, D.A., Greene,
DIMM-SC: a dirichlet mixture model for clus- C.S.: ADAGE-based integration of publicly avail-
tering droplet-based single cell transcriptomic able Pseudomonas aeruginosa gene expression
data. Bioinformatics 34(1), 139–146 (2017). data with denoising autoencoders illuminates
doi:10.1093/bioinformatics/btx490 microbe-host interactions. mSystems 1(1), 00025–
15 (2016). doi:10.1128/msystems.00025-15
[6] Kingma, D.P., Welling, M.: Auto-encoding varia-
tional bayes. ArXiv e-prints (2013). 1312.6114 [19] Chen, L., Cai, C., Chen, V., Lu, X.: Learning a hier-
archical representation of the yeast transcriptomic
[7] Rezende, D.J., Mohamed, S., Wierstra, D.: machinery using an autoencoder model. BMC
Stochastic backpropagation and approximate infer- Bioinformatics 17(S1) (2016). doi:10.1186/s12859-
ence in deep generative models. In: Xing, E.P., Je- 015-0852-1
bara, T. (eds.) Proceedings of the 31st Interna-
tional Conference on Machine Learning. Proceed- [20] Cui, H., Zhou, C., Dai, X., Liang, Y., Paffen-
ings of Machine Learning Research, vol. 32, pp. roth, R., Korkin, D.: Boosting gene expression
1278–1286. PMLR, Bejing, China (2014). http:// clustering with system-wide biological information:
proceedings.mlr.press/v32/rezende14.html A robust autoencoder approach. bioRxiv e-prints
(2017). doi:10.1101/214122
[8] Kingma, D.P., Rezende, D.J., Mohamed, S.,
Welling, M.: Semi-supervised learning with [21] Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu,
deep generative models. ArXiv e-prints (2014). B., Warde-Farley, D., Ozair, S., Courville, A., Ben-
1406.5298 gio, Y.: Generative adversarial nets. ArXiv e-prints
(2014). 1406.2661
[9] Dilokthanakul, N., Mediano, P.A.M., Garnelo, M.,
Lee, M.C.H., Salimbeni, H., Arulkumaran, K., [22] Ghahramani, A., Watt, F.M., Luscombe, N.M.:
Shanahan, M.: Deep unsupervised clustering with Generative adversarial networks uncover epidermal
gaussian mixture variational autoencoders. ArXiv regulators and predict single cell perturbations.
e-prints (2016). 1611.02648 bioRxiv e-prints (2018). doi:10.1101/262501
[10] Sønderby, C.K., Raiko, T., Maaløe, L., Sønderby, [23] Zheng, G.X.Y., Terry, J.M., Belgrader, P., Ryvkin,
S.K., Winther, O.: Ladder variational autoencoders. P., Bent, Z.W., Wilson, R., Ziraldo, S.B., Wheeler,
ArXiv e-prints (2016). 1602.02282 T.D., McDermott, G.P., Zhu, J., Gregory, M.T.,
Shuga, J., Montesclaros, L., Underwood, J.G.,
[11] Maaløe, L., Fraccaro, M., Winther, O.: Semi- Masquelier, D.A., Nishimura, S.Y., Schnall-Levin,
supervised generation with cluster-aware generative M., Wyatt, P.W., Hindson, C.M., Bharadwaj, R.,
models. ArXiv e-prints (2017). 1704.00637 Wong, A., Ness, K.D., Beppu, L.W., Deeg, H.J.,
[12] Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., McFarland, C., Loeb, K.R., Valente, W.J., Eric-
Jozefowicz, R., Bengio, S.: Generating sentences son, N.G., Stevens, E.A., Radich, J.P., Mikkelsen,
from a continuous space. ArXiv e-prints (2016). T.S., Hindson, B.J., Bielas, J.H.: Massively par-
1511.06349 allel digital transcriptional profiling of single
cells. Nature Communications 8(14049) (2017).
[13] Gatys, L.A., Ecker, A.S., Bethge, M.: A neural doi:10.1038/ncomms14049
algorithm of artistic style. ArXiv e-prints (2015).
1508.06576 [24] 10x Genomics: 1.3 Million Brain Cells
from E18 Mice – Single Cell Gene Ex-
[14] Roberts, A., Engel, J., Eck, D. (eds.): Hierarchi- pression Dataset by Cell Ranger 1.3.0
cal Variational Autoencoders for Music (2017). (2017). https://support.10xgenomics.com/
https://nips2017creativity.github.io/doc/ single-cell-gene-expression/datasets/1.3.
Hierarchical_Variational_Autoencoders_ 0/1M_neurons
for_Music.pdf
[25] Weinstein, J.N., Collisson, E.A., Mills, G.B., Shaw,
[15] Way, G.P., Greene, C.S.: Extracting a biologi- K.R.M., Ozenberger, B.A., Ellrott, K., Shmulevich,
cally relevant latent space from cancer transcrip- I., Sander, C., Stuart, J.M.: The cancer genome
tomes with variational autoencoders. bioRxiv e- atlas pan-cancer analysis project. Nature Genetics
prints (2017). doi:10.1101/174474 45, 1113–1120 (2013). doi:10.1038/ng.2764
12
bioRxiv preprint doi: https://doi.org/10.1101/318295; this version posted May 16, 2018. The copyright holder for this preprint (which was
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
[26] Lopez, R., Regier, J., Cole, M.B., Jordan, M., [38] van der Maaten, L.J.P., Hinton, G.E.: Visualizing
Yosef, N.: Bayesian inference for a genera- high-dimensional data using t-sne. Journal of Ma-
tive model of transcriptome profiles from single- chine Learning Research 9, 545 (2008)
cell RNA sequencing. bioRxiv e-prints (2018).
doi:10.1101/292037 [39] 10x Genomics: CD14+ Monocytes – Single Cell
Gene Expression Dataset by Cell Ranger 1.1.0
[27] Eraslan, G., Simon, L.M., Mircea, M., Mueller, (2016). https://support.10xgenomics.com/
N.S., Theis, F.J.: Single cell RNA-seq denoising single-cell-gene-expression/datasets/1.1.
using a deep count autoencoder. bioRxiv e-prints 0/cd14_monocytes
(2018). doi:10.1101/300681
[40] 10x Genomics: CD19+ B Cells – Single Cell
[28] Rasmussen, C.E.: The infinite gaussian mixture Gene Expression Dataset by Cell Ranger 1.1.0
model. In: Solla, S.A., Leen, T.K., Müller, K.-R. (2016). https://support.10xgenomics.com/
(eds.) Advances in Neural Information Processing single-cell-gene-expression/datasets/1.1.
Systems 12, pp. 554–560. MIT Press, Denver, CO, 0/b_cells
USA (2000)
[41] 10x Genomics: CD34+ Cells – Single Cell
[29] Campbell, N.D.F., Kautz, J.: Learning a manifold of Gene Expression Dataset by Cell Ranger 1.1.0
fonts. ACM Transactions on Graphics (SIGGRAPH) (2016). https://support.10xgenomics.com/
33(4) (2014) single-cell-gene-expression/datasets/1.1.
0/cd34
[30] Campbell, N.D.F., Kautz, J.: Interactive 2D
Font Manifold Demonstration. Accessed 23 [42] 10x Genomics: CD4+ Helper T Cells – Single Cell
April 2018. http://vecg.cs.ucl.ac.uk/ Gene Expression Dataset by Cell Ranger 1.1.0
Projects/projects_fonts/projects_fonts. (2016). https://support.10xgenomics.com/
html#interactive_demo single-cell-gene-expression/datasets/1.1.
0/cd4_t_helper
[31] Brouwer, T., Lió, P.: Bayesian hybrid matrix fac-
torisation for data integration. In: Singh, A., Zhu, [43] 10x Genomics: CD4+/CD25+ Regulatory T Cells –
J. (eds.) Proceedings of the 20th International Single Cell Gene Expression Dataset by Cell Ranger
Conference on Artificial Intelligence and Statis- 1.1.0 (2016). https://support.10xgenomics.
tics. Proceedings of Machine Learning Research, com/single-cell-gene-expression/datasets/
vol. 54, pp. 557–566. PMLR, Fort Lauderdale, FL, 1.1.0/regulatory_t
USA (2017). http://proceedings.mlr.press/
v54/brouwer17a.html [44] 10x Genomics: CD4+/CD45RA+/CD25-
Naive T cells – Single Cell Gene Ex-
[32] Shu, R.: Gaussian Mixture VAE: Lessons in Vari-
pression Dataset by Cell Ranger 1.1.0
ational Inference, Generative Models, and Deep
(2016). https://support.10xgenomics.com/
Nets. Accessed 23 April 2018. http://ruishu.io/
single-cell-gene-expression/datasets/1.1.
2016/12/25/gmvae/
0/naive_t
[33] Oshlack, A., Robinson, M.D., Young, M.D.:
[45] 10x Genomics: CD4+/CD45RO+ Mem-
From RNA-seq reads to differential expression
ory T Cells – Single Cell Gene Expres-
results. Genome Biology 11(12), 220 (2010).
sion Dataset by Cell Ranger 1.1.0 (2016).
doi:10.1186/gb-2010-11-12-220
https://support.10xgenomics.com/
[34] Vallejos, C.A., Risso, D., Scialdone, A., Du- single-cell-gene-expression/datasets/
doit, S., Marioni, J.C.: Normalizing single-cell 1.1.0/memory_t
RNA sequencing data: challenges and opportu-
nities. Nature Methods 14(6), 565–571 (2017). [46] 10x Genomics: CD56+ Natural Killer Cells – Single
doi:10.1038/nmeth.4292 Cell Gene Expression Dataset by Cell Ranger 1.1.0
(2016). https://support.10xgenomics.com/
[35] Seeger, M.W., Salinas, D., Flunkert, V.: Bayesian single-cell-gene-expression/datasets/1.1.
intermittent demand forecasting for large invento- 0/cd56_nk
ries. In: Lee, D.D., Sugiyama, M., Luxburg, U.V.,
Guyon, I., Garnett, R. (eds.) Advances in Neural In- [47] 10x Genomics: CD8+ Cytotoxic T Cells – Single
formation Processing Systems 29, pp. 4646–4654. Cell Gene Expression Dataset by Cell Ranger 1.1.0
Curran Associates, Inc., Barcelona, Spain (2016) (2016). https://support.10xgenomics.com/
single-cell-gene-expression/datasets/1.1.
[36] Salakhutdinov, R., Hinton, G.: Semantic 0/cytotoxic_t
hashing. International Journal of Approx-
imate Reasoning 50(7), 969–978 (2009). [48] 10x Genomics: CD8+/CD45RA+ Naive Cy-
doi:10.1016/j.ijar.2008.11.006 totoxic T Cells – Single Cell Gene Ex-
pression Dataset by Cell Ranger 1.1.0
[37] Hubert, L., Arabie, P.: Comparing partitions. (2016). https://support.10xgenomics.com/
Journal of Classification 2(1), 193–218 (1985). single-cell-gene-expression/datasets/1.1.
doi:10.1007/bf01908075 0/naive_cytotoxic
13
bioRxiv preprint doi: https://doi.org/10.1101/318295; this version posted May 16, 2018. The copyright holder for this preprint (which was
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
Additional Files
Additional file 1 — Tables S1–S2 with descriptions;
Figures S1–S11 with descriptions
14