Deep Learning: New Computational Modelling Techniques For Genomics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Reviews

Deep learning: new computational


modelling techniques for genomics
Gökcen Eraslan   1,2,5, Žiga Avsec3,5, Julien Gagneur3* and Fabian J. Theis   1,2,4*
Abstract | As a data-driven science, genomics largely utilizes machine learning to capture
dependencies in data and derive novel biological hypotheses. However, the ability to extract new
insights from the exponentially increasing volume of genomics data requires more expressive
machine learning models. By effectively leveraging large data sets, deep learning has transformed
fields such as computer vision and natural language processing. Now , it is becoming the method
of choice for many genomics modelling tasks, including predicting the impact of genetic variation
on gene regulatory mechanisms such as DNA accessibility and splicing.

Feature
Genomics, in the broad sense, also referred to as func- or localization within an organ are not captured in cell
An individual, measurable tional genomics, aims to characterize the function counts, and this incomplete representation of the data
property or characteristic of a of every genomic element of an organism by using may reduce classification accuracy. Deep learning, a sub-
phenomenon being observed. genome-scale assays such as genome sequencing, tran- discipline of machine learning, addresses this issue by
Handcrafted features
scriptome profiling and proteomics1. Genomics arose embedding the computation of features into the machine
Features derived from raw as a data-driven science — it operates by discovering learning model itself to yield end-to-end models11. This
data (or other features) using novel properties from explorations of genome-scale data outcome has been realized through the development
manually specified rules. rather than by testing preconceived models and hypo­ of deep neural networks, machine learning models that
Unlike learned features, they
theses2. Applications of genomics include finding asso- consist of successive elementary operations, which com-
are specified upfront and
do not change during model
ciations between genotype and phenotype3, discover­ing pute increasingly more complex features by taking the
training. For example, the biomarkers for patient stratification4, predicting the results of preceding operations as input. Deep neural
GC content is a handcrafted function of genes5 and charting biochemically active networks are able to improve prediction accuracy by
feature of a DNA sequence. genomic regions such as transcriptional enhancers6. discovering relevant features of high complexity, such
Genomics data are too large and too complex to be as the cell morphology and spatial organization of cells
mined solely by visual investigation of pairwise corre- in the above example.
lations. Instead, analytical tools are required to support The construction and training of deep neural net-
the discovery of unanticipated relationships, to derive works have been enabled by the explosion of data,
1
Institute of Computational
novel hypotheses and models and to make predictions. algorithmic advances and substantial increases in
Biology, Helmholtz Zentrum Unlike some algorithms, in which assumptions and computational capacity, particularly through the use
München, Neuherberg, domain expertise are hard coded, machine learning of graphical processing units (GPUs)12. Over the past
Germany. algorithms are designed to automatically detect pat- 7 years, deep neural networks have led to multiple per-
2
School of Life Sciences terns in data7,8. Hence, machine learning algorithms formance breakthroughs in computer vision13–15, speech
Weihenstephan, Technical are suited to data-driven sciences and, in particular, recognition16 and machine translation17. Seminal studies
University of Munich,
Freising, Germany.
to genomics9,10. However, the performance of machine in 2015 demonstrated the applicability of deep neural
learning algorithms can strongly depend on how the networks to DNA sequence data18,19 and, since then,
3
Department of Informatics,
Technical University of
data are represented, that is, on how each variable (also the number of publications describing the application
Munich, Garching, Germany. called a feature) is computed. For instance, to classify a of deep neural networks to genomics has exploded.
4
Department of Mathematics, tumour as malign or benign from a fluorescent micro­ In parallel, the deep learning community has substan-
Technical University of scopy image, a preprocessing algorithm could detect tially improved method quality and expanded its reper-
Munich, Garching, Germany. cells, identify the cell type and generate a list of cell toire of modelling techniques, some of which are already
5
These authors contributed counts for each cell type. A machine learning model ­starting to impact genomics.
equally: Gökcen Eraslan, would then take these estimated cell counts, which are Here, we describe deep learning modelling tech-
Žiga Avsec.
examples of handcrafted features, as input features to niques and their existing genomic applications. We
*e-mail: gagneur@
classify the tumour. A central issue is that classification start by presenting four major classes of neural net-
in.tum.de; fabian.theis@
helmholtz-muenchen.de performance depends heavily on the quality and the works ( fully connected , convolutional , recurrent and
https://doi.org/10.1038/ relevance of these features. For example, relevant visual graph convolutional) for supervised machine learning
s41576-019-0122-6 features such as cell morphology, distances between cells and explain how they can be used to abstract patterns

NATure RevIeWS | GENETiCs volume 20 | JULY 2019 | 389


Reviews

End-to-end models
common in genomics. Next, we describe multitask Complex dependencies can be modelled with deep
Machine learning models learning and multimodal learning, two modelling tech- neural networks. For many supervised learning prob-
that embed the entire niques suited to integrating multiple data sets and data lems in computational biology, the input data can be rep-
data-processing pipeline to types. We then discuss transfer learning, a technique resented as a table with multiple columns, or features,
transform raw input data into
that enables rapid development of new models from each of which contains numerical or categorical data that
predictions without requiring
a preprocessing step. existing ones, and techniques to interpret deep learning are potentially useful for making predictions. Some input
models, which are both crucial for genomics. We finish data are naturally represented as features in a table (such
Deep neural networks with a discussion of two unsupervised learning tech- as temperature or time), whereas other input data need
A wide class of machine
niques, autoencoders and generative adversarial networks to be first transformed (such as DNA sequence into k-mer
learning models with a design
that is loosely based on
(GANs), which first found application in single-cell counts) using a process called feature extraction to fit a
biological neural networks. genomics. To facilitate the adoption of deep learning by tabular representation. For the intron-splicing prediction
the genomics community, we provide pointers to code problem, the presence or absence of the canonical splice
Fully connected that ease rapid prototyping. For further background site sequence, the location of the splicing branchpoint
Referring to a layer
that performs an affine
on deep learning, we refer readers to the deep learning and the intron length can be preprocessed features col-
transformation of a vector textbook11. As complementary reading, we recommend lected in a tabular format. Tabular data are standard for a
followed by application a hands-on primer20 and several reviews that provide a wide range of supervised machine learning models, rang-
of an activation function to broader perspective on deep learning, target computa- ing from simple linear models, such as logistic regression8,
each value.
tional biologists and cover applications of deep learning to more flexible nonlinear models, such as neural net-
Convolutional
beyond genomics21–25. works and many others26–29. Logistic regression is a
Referring to a neural network binary classifier, that is, a supervised learning model
layer that processes data stored Supervised learning that predicts a binary target variable. Specifically, logistic
in n-dimensional arrays, such The goal of supervised learning is to obtain a model regression predicts the probability of the positive class by
as images. The same fully
that takes features as input and returns a prediction for computing a weighted sum of the input features mapped
connected layer is applied to
multiple local patches of the a so-called target variable. An example of a supervised to the [0,1] interval using the sigmoid function, a type of
input array. When applied to learning problem is one that predicts whether an intron is activation function. The parameters of logistic regression,
DNA sequences, a convolutional spliced out or not (the target) given features on the RNA or other linear classifiers that use different activation
layer can be interpreted as a
such as the presence or absence of the canonical splice functions, are the weights in the weighted sum. Linear
set of position weight matrices
scanned across the sequence.
site sequence, the location of the splicing branchpoint classifiers fail when the classes, for instance, that of an
or intron length (FIg. 1). Training a machine learning intron spliced out or not, cannot be well discriminated
Recurrent model refers to learning its parameters, which typically with a weighted sum of input features (FIg. 1a).
Referring to a neural network involves minimizing a loss function on training data To improve predictive performance, new input fea­
layer that processes sequential
with the aim of making accurate predictions on unseen tures can be manually added by transforming or com-
data. The same neural network
is applied at each step of the data (BoX 1). bining existing features in new ways, for example, by
sequence and updates a
memory variable that is a Single-layer neural network (logistic regression) b Multilayer neural network
provided for the next step.
Input Hidden layers Output
Graph convolutional Input Output
Referring to neural networks RNA
that process graph-structured features 1 — spliced
data; they generalize
convolution beyond regular AG
structures, such as DNA 0 — unspliced
sequences and images, to
graphs with arbitrary
structures. The same neural Fully
connected
network is applied to each
layer
node and edge in the graph.
Intron length

Activation 2

Autoencoders
Unsupervised neural networks
trained to reconstruct the input.
One or more bottleneck layers
have lower dimensionality than Branchpoint Activation 1
the input, which leads to distance
compression of data and forces
the autoencoder to extract Fig. 1 | Neural networks with hidden layers used to model nonlinear dependencies. a | Shown is an example of splice
useful features and omit site classification based on two RNA features. Depicted is a single-layer neural network with sigmoid activation function,
unimportant features in the which corresponds to logistic regression. It predicts the probability of the output being class 1 using a weighted sum
reconstruction. (also called linear combination) of the input that is mapped to the [0,1] interval with a sigmoid function. In this example,
the aim is to discriminate spliced-out from not-spliced-out introns as a function of the length of the intron and of the
Generative adversarial
networks
distance of the branchpoint to the acceptor site. If either the intron length or the branchpoint distance is too short or
(GANs). Unsupervised learning too long, splicing will not occur. Hence, linear combinations of these two features, as implemented in logistic regression,
models that aim to generate cannot separate the spliced (blue) from unspliced (orange) data points. b | Neural networks with intermediate layers, also
data points that are called hidden layers, transform the inputs using intermediate nonlinear transformations into a space where the classes
indistinguishable from the become linearly separable. The depicted layers are said to be fully connected because every neuron receives input from
observed ones. all neurons of the upstream layer. Deep neural networks are neural networks with many hidden layers.

390 | JULY 2019 | volume 20 www.nature.com/nrg


Reviews

Target Box 1 | Training neural networks for supervised learning


The desired output used to
train a supervised model. Data partitioning and prediction goal
A supervised learning data set consists of input–target pairs split into three distinct sets (see the figure, part a): one for
Loss function optimizing the parameters of the model (training set), one for evaluating the model performance (validation set) and one
A function that is optimized for the final assessment of the best developed model (test set). During the model development phase, one only has access
during training to fit machine to the training and validation set. The goal is to develop a model with the most accurate predictions on the test set.
learning model parameters. In
The accuracy of predictions is measured by different evaluation metrics such as the Pearson correlation coefficient or
the simplest case, it measures
Spearman correlation coefficient for regression, area under the receiver operator curve for balanced binary classification
the discrepancy between
predictions and observations.
or area under the precision-recall curve for imbalanced binary classification157. We note that the validation set and test
In the case of quantitative set should be carefully chosen to represent truly unseen samples. For DNA-based models, this is typically implemented
predictions such as regression, by leaving out complete chromosomes or all measurements in new cell types rather than randomly sampling the regions
mean-squared error loss is from the genome.
frequently used, and for binary
classification, the binary
Fitting the parameters using the training set
cross-entropy, also called Parameters of the neural network are first randomly initialized and then iteratively refined using a method called the
logistic loss, is typically used. stochastic gradient descent or its variations158,159. Specifically, small random subsets, so-called batches, of input–target
pairs of the training data set are iteratively used to make small updates on model parameters in trying to minimize the loss
k-mer function between the predicted values and the observed targets (see the figure, part b). This minimization is performed by
Character sequence of a using the gradient of the loss function computed using the backpropagation algorithm160,161. There are two main benefits
certain length. For instance, to taking only a small random subset of the training set at each optimization step rather than the full training set. First, the
a dinucleotide is a k-mer for algorithm requires a constant amount of memory regardless of the data set size, which allows models to be trained on
which k = 2.
data sets much larger than the available memory. Second, the random fluctuations between batches were demonstrated
to improve the model performance by regularization162,163. As operations in neural networks including backpropagation
Logistic regression
A supervised learning involve matrix operations, graphical processing units (GPUs) can massively parallelize those operations and hence speed
algorithm that predicts the up model training by up to two orders of magnitude compared with normal central processing units12. In practice,
log-odds of a binary output specifying and training neural networks are achieved through the use of deep learning frameworks (BoX 2).
to be of the positive class as
Choosing the hyperparameters using the validation set
a weighted sum of the input
features. Transformation of
The training process is monitored by regularly evaluating the loss or the evaluation metric on the validation data
the log-odds with the sigmoid set (see the figure, part c). When the metric stops improving or even starts degrading, training is stopped as the
activation function leads to model starts to overfit the data. To improve the model performance on the validation data set, the modeller can
predicted probabilities. adjust different hyperparameters, such as the number of layers of the network or batch size, and train a new model.
This loop of experimenting with different hyperparameters can be automated using a simple random search164 or other
Sigmoid function hyperparameter optimization techniques165–168. Finally, after the modeller is satisfied with the performance on the
A function that maps real validation set, the generalization performance of the best model or an ensemble of best models is evaluated on a
numbers to [0,1], defined as completely separate test set.
1/(1 + e −x).
a Data Inputs Targets b Parameter update c Monitoring
Activation function
A function applied to an
...

intermediate value x within


Observed Predicted
a neural network. Activation Batch i Loss ( , )
functions are usually nonlinear Training Parameter
yet very simple, such as set update
the rectified-linear unit or the
...

sigmoid function. Validation


Model
Loss

Regularization
A strategy to prevent Training
Validation
overfitting that is typically set Input
achieved by constraining the
Number of parameter updates
model parameters during Test set
training by modifying the loss
function or the parameter We refer interested readers to the deep learning book for more details11.
optimization procedure.
For example, the so-called L2
regularization adds the sum
of the squares of the model
taking powers or pairwise products. Neural networks use suited to training models on very large data sets (BoX 1).
parameters to the loss function hidden layers to learn these nonlinear feature transforma- Implementation of neural networks using modern deep
to penalize large model tions automatically. Each hidden layer can be thought of learning frameworks enables rapid prototyping with
parameters. as multiple linear models with their output transformed different architectures and data sets (BoX 2).
by a nonlinear activation function, such as the sigmoid Fully connected neural networks have been used for a
Hidden layers
Layers are a list of artificial
function or the more popular rectified-linear unit (ReLU). number of genomics applications, which include predict-
neurons that collectively Together, these layers compose the input features into ing the percentage of exons spliced in for a given sequence
represents a function that take relevant complex patterns, which facilitates the task of from sequence features such as the presence of binding
as input an array of real distinguishing two classes (FIg. 1b). Deep neural networks motifs of splice factors or sequence conservation30,31; pri-
numbers and returns an array
of real numbers corresponding
use many hidden layers, and a layer is said to be fully con- oritizing potential disease-causing genetic variants32; and
to neuron activations. Hidden nected when each neuron receives inputs from all neu- predicting cis-regulatory elements in a given genomic
layers are between the input rons of the preceding layer. Neural networks are typically region using features such as chromatin marks, gene
and output layers. trained using stochastic gradient descent, an algorithm expression and evolutionary conservation33,34. Many of

NATure RevIeWS | GENETiCs volume 20 | JULY 2019 | 391


Reviews

Box 2 | Example code for training neural networks from tabular data, for which the ordering of the features
is arbitrary. Consider the problem of classifying genomic
Much of the success of deep learning can be attributed to deep learning frameworks such regions as bound versus unbound by a particular tran-
as Keras, TensorFlow102 or PyTorch101. Deep learning frameworks are software libraries that scription factor, in which bound regions are defined as
implement the operations required for building and training neural networks, including
high-confidence binding events in chromatin immuno­
matrix multiplication, convolution and automatic differentiation. This enables users to
precipitation following by sequencing (ChIP–seq)
specify the model architecture by composing multiple building blocks — layers — without
having to manually derive the gradients required during training (BoX 1). Below is an data35–39. Transcription factors bind to DNA by recog-
example that implements the architectures from Figures 1 and 2 using Keras. nizing sequence motifs. A fully connected layer based on
sequence-derived features, such as the number of k-mer
import keras.layers as kl
instances or the position weight matrix (PWM) matches in
from keras.models import Sequential
the sequence40,41, can be used for this task. As k-mer or
# Fully connected model architecture (Figure 1)
PWM instance frequencies are robust to shifting motifs
model = Sequential([ within the sequence, such models could generalize well
kl.Dense(3, activation='relu', input_shape=(2,)), to sequences with the same motifs located at different
kl.Dense(2, activation='relu'), positions. However, they would fail to recognize pat-
kl.Dense(1, activation='sigmoid')]) terns in which transcription factor binding depends
on a combination of multiple motifs with well-defined
# Convolutional neural network architecture (Figure 2) spacing. Furthermore, the number of possible k-mers
model = Sequential([ increases exponentially with k-mer length, which poses
kl.Conv1D(2, activation='relu', input_shape=(4, 30), padding='same'), both storage and overfitting challenges.
kl.MaxPool(6), A convolutional layer is a special form of fully con-
kl.Conv1D(3, activation='relu', padding='same'), nected layer in which the same fully connected layer
kl.GlobalMaxPool(), is applied locally, for example, in a 6 bp window, to all
kl.Dense(1, activation='sigmoid')]) sequence positions. This approach can also be viewed
as scanning the sequence using multiple PWMs42–44, for
# Specify optimizer, loss and evaluation metric example, for transcription factors GATA1 and TAL1
model.compile(optimizer='adam', (FIg. 2a,b). By using the same model parameters across
loss='binary_crossentropy', positions, the total number of parameters is drastically
metrics=['accuracy']) reduced, and the network is able to detect a motif at
positions not seen during training. Each convolutional
# Load the dataset
layer scans the sequence with several filters (FIg. 2b) by
x, y = load_dataset(...)
producing a scalar value at every position, which quanti-
fies the match between the filter and the sequence. As in
# Train the model for 10 epochs
fully connected neural networks, a nonlinear activation
model.fit(x, y, epochs=10)
function (typically ReLU) is applied at each layer (FIg. 2c).
Thanks to these frameworks, users can focus on designing the model architecture Next, a pooling operation is applied, which aggregates
without having to manually derive the optimization procedure, which makes prototyping
the activations in contiguous bins across the positional
new model architectures easy and decouples the model choice from the optimization
axis, typically taking the maximal or average activation
algorithm. Furthermore, the frameworks enable training of models on GPUs without
using extra code. Moreover, as the specification of the architecture is standardized, for each channel (FIg. 2d). Pooling reduces the effective
models and model components can be easily exchanged. sequence length and coarsens the signal. The subsequent
We refer the reader to DragoNN for end-to-end examples of how to implement, train, convolutional layer composes the output of the previ-
evaluate and interpret convolutional neural network models based on DNA sequence ous layer and is able to detect whether a GATA1 motif
using Keras. and TAL1 motif were present at some distance range
(FIg. 2e,f). Finally, the output of the convolutional layers
these methods report improved predictive performance can be used as input to a fully connected neural network
over methods such as linear regression, decision trees or to perform the final prediction task (FIg. 2 g,h). Hence,
random forests. However, it is important to note that, different types of neural network layers (for example,
in many problems with tabular data, other methods fully connected and convolutional) can be combined
such as gradient-boosted decision trees often outper- within a single neural network.
Rectified-linear unit form fully connected neural networks, as can be seen Three pivotal methods, DeepBind18, DeepSEA19 and
(ReLU). Widely used activation
function defined as max(0, x).
from the results of Kaggle machine learning competi- Basset45, were the first convolutional neural networks
tions. Nevertheless, fully connected layers constitute an (CNNs) applied to genomics data. In DeepBind, multiple
Neuron essential building block in the deep learning toolbox and single-task models (the median number of parameters
The elementary unit of a neural can be effectively combined with other neural network was 1,586) were trained to predict binarized in vitro and
network. An artificial neuron
aggregates the inputs from
layers, such as convolutional layers. in vivo binding affinities (that is, bound or not bound)
other neurons and emits an of a transcription factor and the in vitro binding affin-
output called activation. Inputs Convolutions discover local patterns in sequential ity of an RNA-binding protein (RBP). The method
and activations of artificial data. Local dependencies in spatial and longitudinal consistently performed better than existing non-deep
neurons are real numbers. The
data must be taken into account for effective predictions. learning approaches. The DeepSEA model (52,843,119
activation of an artificial neuron
is computed by applying a
For example, shuffling a DNA sequence or the pixels of parameters) predicted the presence or absence of 919
nonlinear activation function to an image severely disrupts informative patterns. These chromatin features, including transcription factor bind-
a weighted sum of its inputs. local dependencies set spatial or longitudinal data apart ing, DNA accessibility and histone modification given

392 | JULY 2019 | volume 20 www.nature.com/nrg


Reviews

a b c
Input Convolution Activation

TA TA1

TA TA1
L1

L1
A C G T

GA

GA
C
G
T Filters g
G d e f Global
h
T A C G T Max Fully
Convolution Activation max
C pooling connected
pooling
T
GATA1

T Filters
A

TA L1

1
T

TA TA1 TAL
GA +TA
C

+
TA TA1

GATA1
TA
L1

L1

L1
T

GA

GA

GA
TA
C GATA1+TAL1
C
T
C
C GATA1 TAL1
A TAL1
A GATA1
A
C
A
TAL1

G
A
T
Positions

G
C
A
C Negative Positive
T 0
Channels

Fig. 2 | Modelling transcription factor binding sites and spacing with convolutional neural networks. The depicted
convolutional neural network predicts the binding affinity of the TAL1–GATA1 transcription factor complex. a | One-hot
encoded representation of the DNA sequence. b | The first convolutional layer scans the input sequence using filters, which
are exemplified by position weight matrices of the GATA1 and TAL1 transcription factors. c | Negative values are truncated
to 0 using the rectified-linear unit (ReLU) activation function. d | In the max pooling operation, contiguous bins of the
activation map are summarized by taking the maximum value for each channel in each bin. e | The second convolutional
layer scans the sequence for pairs of motifs and for instances of individual motifs. f | Similarly to that of the first convolution,
ReLU activation function is applied. g | The maximum value across all positions for each channel is selected. h | A fully
connected layer is used to make the final prediction.

a 1,000 bp sequence. Basset (4,135,064 parameters) regulatory elements may be distantly located on the
predicted 164 binarized DNA accessibility features unfolded linear DNA sequence, these elements are often
(for example, as accessible or inaccessible) given a 600 bp proximal in the actual 3D chromatin conformation.
sequence. Both methods performed substantially better Hence, modelling molecular phenotypes from the lin-
than the k-mer-based approach gkm-SVM41. ear DNA sequence, albeit a crude approximation of the
Linear regression Since their initial applications, CNNs have been chromatin, can be improved by allowing for long-range
A supervised learning algorithm
applied to predict various molecular phenotypes on the dependencies and allowing the model to implicitly
that predicts the output as a
weighted sum of the input basis of DNA sequence alone and have become the new learn aspects of the 3D organization, such as promoter–
features. state-of-the-art models. Applications include classifying enhancer looping. In Basenji47 this is achieved by using
transcription factor binding sites46 and predicting mol­ dilated convolutions, which enabled a receptive field of
Decision trees ecular phenotypes such as chromatin features47, DNA 32 kb to be achieved. Dilated convolutions have also
Supervised learning algorithms
in which the prediction is made
contact maps48, DNA methylation49,50, gene expression51, allowed splice sites to be predicted from sequence using
by making a series of decisions translation effiency52, RBP binding53–55 and microRNA a receptive field of 10 kb, thereby enabling the integra-
of type ‘is feature i larger than (miRNA) targets56. In addition to predicting molecular tion of genetic sequence across distances as long as
x’ (internal nodes of the tree) phenotypes from the sequence, CNNs have been suc- typical human introns63.
and then predicting a constant
cessfully applied to more technical tasks traditionally
value for all points satisfying
the same decisions series addressed by handcrafted bioinformatics pipelines. For Recurrent neural networks model long-range depend-
(leaf nodes). example, they have been utilized to predict the specific- encies in sequences. Different types of neural net-
ity of guide RNA57, denoise ChIP–seq58, enhance Hi-C work can be characterized by their parameter-sharing
Random forests data resolution59, predict the laboratory of origin from schemes. For example, fully connected layers have no
Supervised learning algorithms
that train and average the
DNA sequences60 and call genetic variants61,62. parameter sharing (FIg. 3a), whereas convolutional lay-
predictions of many decision CNNs have also been employed to model long-range ers impose translational invariance by applying the
trees. dependencies in the genome47. Although interacting same filters at every position of their input (FIg. 3b).

NATure RevIeWS | GENETiCs volume 20 | JULY 2019 | 393


Reviews

a b c d
Network type Fully connected Convolutional Recurrent Graph convolutional

Output

Parameters

Input

Invariance – Translation Time Node index permutation

Input Predefined features such as • DNA sequence • DNA sequence • Protein–protein interaction network
example number of k-mer matches, • Amino acid sequence • Amino acid sequence • Citation network
total conservation • Image • Time series measurements • Protein structure

Parameterized information flow Link Node in the neural network (scalar or tensor)

Fig. 3 | Neural network layers and their parameter-sharing schemes. Neural network architectures can be categorized
into four groups based on their connectivity and parameter-sharing schemes. a | Fully connected layers assume that
input features do not have any particular ordering and hence apply different parameters across different input features.
b | Convolutional layers assume that local subsets of input features, such as consecutive bases in DNA , can represent
patterns. Therefore, the connectivity and parameter-sharing pattern of convolutional layers reflect locality. c | Recurrent
layers assume that the input features should be processed sequentially and that the sequence element depends on all the
previous sequence elements. At each sequence element, the same operation is applied (blue and orange arrows), and
the information from the next input sequence element is incorporated into the memory (orange arrows) and carried over.
d | Graph convolutional networks assume that the structure of the input features follows the structure of a known graph.
The same set of parameters is used to process all the nodes and thereby imposes invariance to node ordering. By exploiting
the properties of the raw data, convolutional neural networks, recurrent neural networks and graph convolutional layers
can have drastically reduced numbers of parameters compared with fully connected layers while still being able to
represent flexible functions. The same colours indicate shared parameters, and arrows indicate the flow of information.
Full lines indicate specific ordering or relationships between features represented as nodes (parts a–d).

Recurrent neural networks (RNNs)64,65 are an alterna- found applications in miRNA biology: deepTarget70 per-
tive to CNNs for processing sequential data, such as formed better than existing models at predicting miRNA
DNA sequences or time series, that implement a differ- binding targets from mRNA–miRNA sequence pairs,
ent parameter-sharing scheme. RNNs apply the same and deepMiRGene71 better predicted the occurrence of
operation to each sequence element (FIg. 3c). The opera- precursor miRNAs from the mRNA sequence and its
tion takes as input the memory of the previous sequence predicted secondary structure than existing methods
Gradient-boosted decision
element and the new input. It updates the memory and that use handcrafted features. Base calling from raw
trees
Supervised learning algorithms
optionally emits an output, which is either passed on DNA-sequencing data is another prediction task for
that train multiple decision to subsequent layers or is directly used as model pre- which RNNs have been applied. DeepNano72 accurately
trees in a sequential manner; dictions. By applying the same model at each sequence predicted base identity from changes in electric current
at each time step, a new element, RNNs are invariant to the position index in the measured by the Oxford Nanopore MinION sequencer73.
decision tree is trained on the
residual or pseudo-residual
processed sequence. For example, an RNN could detect Despite these numerous applications of RNNs, we note
of the previous decision tree. an open reading frame in a DNA sequence regardless that there is a lack of systematic comparison of recur-
of the position in the sequence. This task requires the rent and convolutional architectures for the common
Position weight matrix recognition of a certain series of inputs, such as the start sequence-modelling tasks in genomics.
(PWM). A commonly used
codon followed by an in-frame stop codon. The main
representation of sequence
motifs in biological sequences.
advantage of RNNs over CNNs is that they are, in theory, Graph-convolutional neural networks model depend-
It is based on nucleotide able to carry over information through infinitely long encies in graph-structured data. Graph-structured
frequencies of aligned sequences via memory. Furthermore, RNNs can natu­ data, including protein–protein interaction networks
sequences at each position rally process sequences of widely varying length, such and gene regulatory networks, are ubiquitous in geno­
and can be used for identifying
transcription factor binding
as mRNA sequences. However, recent systematic mics74,75. Graph convolutional neural (GCN) networks76–79
sites from DNA sequence. comparisons show that CNNs combined with various (FIg. 3d) use the individual features of nodes in a graph
tricks (such as dilated convolutions) are able to reach and the node connectivity to solve machine learning
Overfitting comparable or even better performances than RNNs on tasks on graphs. GCNs sequentially apply multiple graph
The scenario in which the
sequence-modelling tasks, such as audio synthesis and transformations (layers), whereby each graph transfor-
model fits the training set very
well but does not generalize
machine translation66. Moreover, because RNNs apply a mation aggregates features from the neighbouring nodes
well to unseen data. Very sequential operation, they cannot be easily parallelized or edges in a nonlinear manner and represents nodes or
flexible models with many and are hence much slower to compute than CNNs. edges with a new set of features. Tasks that GCNs can
free parameters are prone to In genomics, RNNs have been used to aggregate be trained for include node classification80,81, unsuper-
overfitting, whereas models
with many fewer parameters
the outputs of CNNs for predicting single-cell DNA vised node embedding (which aims to find informa-
than the training data do methylation states50, RBP binding67, transcription fac- tive, low-dimensional representation of nodes)80, edge
not overfit. tor binding and DNA accessibility68,69. RNNs have also classification and graph classification79.

394 | JULY 2019 | volume 20 www.nature.com/nrg


Reviews

GCNs have been applied to a number of biological modalities as inputs in order to leverage complementary
and chemical problems. For instance, one method used information between them. A simple way to integrate
an unsupervised approach to derive new features of pro- multiple data modalities is to concatenate features from
teins from protein–protein interaction networks in an each data set (often referred to as early integration). Such
unsupervised manner, and these features were then used concatenation might not be possible with raw data when
to predict protein function in different tissues82. GCNs the data modalities are very different (such as a DNA
have also been used for modelling polypharmacy side sequence combined with an image or gene expression).
effects83. In chemistry, graph convolutions have been Neural networks enable multiple data modalities to be
successfully used to predict various molecular properties integrated by first processing each data modality using
including solubility, drug efficacy and photovoltaic effi- dedicated layers, concatenating the outputs of dedi-
ciency84,85. Genomic applications of GCNs include pre- cated layers and then using further layers to integrate
dicting binarized gene expression given the expression the features extracted from each data modality (FIg. 4c).
of other genes86 or classification of cancer subtypes87. This approach, also known as intermediate integration,
GCNs provide promising tools for exploiting structural enables the most suitable dedicated layers to be used for
patterns of graphs for supervised and unsupervised each data modality and can hence extract more predic-
machine learning problems, and we expect to see more tive features. Both early integration and intermediate
genomics applications in the future. integration approaches (individually or in combination)
have been used by different neural network models in
Sharing information across tasks and integrating genomics. For example, DNA sequence, gene expres-
Filters data modalities. Genomic data often contain correlated sion and chromatin accessibility have been integrated
Parameters of a convolutional measurements of related biological activities. Correlated to predict transcription factor binding across cell types69.
layer. In the first layer of a measurements can occur within a single data type (such In addition, an RNA sequence has been integrated with
sequence-based convolutional as the expression of co-regulated genes) or across dif- RNA secondary structure55 or distances to key genomic
network, they can be
interpreted as position weight
ferent data types (such as ChIP–seq peaks and DNase I landmarks such as splice sites54 to predict in vivo affinity
matrices. hypersensitive sites sequencing (DNase-seq) peaks) and of RBPs. Another example is the prediction of the patho­
give rise to related prediction tasks. genicity of missense variants by integrating amino acid
Pooling operation Consider an example in which we would like to pre- sequences with multiple conservation scores92. We refer
A function that replaces the
dict transcription factor binding affinity for multiple the reader to Zitnik et al.93 for more information on data
output at a certain location
with a summary statistic of the transcription factors. Instead of building a single-task integration with machine learning models.
nearby outputs. For example, model for each prediction task (FIg. 4a), a multitask model
the max pooling operation can jointly predict binding of multiple transcription Training models on small data sets with transfer learn-
reports the maximum output factors (FIg. 4b). In such models, the majority of layers ing. In the scenario in which data are scarce, training
within a rectangular
neighbourhood.
are shared and branch out to task-specific layers at the a model from scratch might not be feasible. Instead,
end (FIg. 4b). Owing to co-binding and common protein the model can be initialized with the majority of para­
Channel domains of the modelled transcription factors, using the meters from another model trained on a similar task.
An axis other than one of the same layers to extract complex sequence features across This approach is called transfer learning94 and can be
positional axes. For images, the
multiple transcription factors might improve the predic- viewed as incorporating prior knowledge into the model
channel axis encodes different
colours (such as red, green and tive performance and require less data per transcription (FIg. 4d). In the simplest case, in which the parameters of
blue), for one-hot-encoded factor. Moreover, by sharing the computation between the source model are not modified during training, this
sequences (A: [1, 0, 0, 0], tasks, multitask models can make predictions faster than approach can be seen as building a separate model on top
C: [0, 1, 0, 0] and so on), it single-task models can. of features extracted by the source model. Transferred
denotes the bases (A, C, G
and T), and for the output of the
In multitask models, the overall loss function is simply models can learn new tasks more rapidly, require less
convolutions, it corresponds to the sum of the losses for each task. When losses are very data to train and generalize better to unseen data than
the outputs of different filters. different across tasks, a weighted sum can be used to bal- models trained from scratch using randomly initial-
ance the losses88. Training multitask models can be chal- ized parameters95. In biological image analysis, pre-
Dilated convolutions
lenging, as the network needs to simultaneously optimize trained models from the ImageNet competition96 were
Filters that skip some values
in the input layers. Typically, multiple losses and hence make trade-offs. For example, successfully adopted to classify skin lesions97, perform
each subsequent convolutional if class imbalance varies greatly across tasks, the network morpho­logical profiling98 and analyse in situ hybridi-
layer increases the dilation by might successfully learn only the well-balanced classes zation images99,100. In genomics, the utility of transfer
a factor of two, thus achieving and completely ignore the difficult imbalanced classes by learning has been demonstrated for sequence-based
an exponentially increasing
receptive field with each
always predicting the majority class. Various strategies predictive models of chromatin accessibility45. In this
additional layer. have been proposed to tackle this issue88–91. For example, study, researchers trained the multitask Basset model
GradNorm88 adopts task weights during training that for predicting binary chromatin accessibility profiles
Receptive field ensure the backpropagated gradients corresponding to of 149 cell types. They then trained single-task models of
The region of the input that
different tasks will be of equal magnitude. In genom- chromatin accessibility in 15 other cell types using
affects the output of a
convolutional neuron. ics, multitask models have been successfully used to parameters from the multitask model for initialization.
simultaneously predict multiple molecular pheno­types The predictive performance was greater for models
Memory such as those for transcription factor binding, different initialized with transferred parameters than for mod-
An array that stores the histone marks, DNA accessibility and gene expression in els initialized with random parameters45. We note that
information of the patterns
observed in the sequence
different tissues19,45,47,51. extensive evaluations of how many parameters to share
elements previously processed Analogously to multitask models, deep neural net- and which models to use for different tasks are still
by a recurrent neural network. works can be easily extended to take multiple data lacking and will require further investigation.

NATure RevIeWS | GENETiCs volume 20 | JULY 2019 | 395


Reviews

Inputs Model Outputs

a DNA
CGTGAGTTTGCAT
Single-task TGATCGAGGACGA
GTAGCTAGCTAGT

b DNA
CGTGAGTTTGCAT
Multitask TGATCGAGGACGA
GTAGCTAGCTAGT

c DNA
CGTGAGTTTGCAT
TGATCGAGGACGA
GTAGCTAGCTAGT
Multimodal

Chromatin accessibility

Source model

d DNA
CGTGAGTTTGCAT
TGATCGAGGACGA
GTAGCTAGCTAGT

Transfer learning
Transfer
parameters

New parameters
DNA
CGTGAGTTTGCAT
TGATCGAGGACGA
GTAGCTAGCTAGT

Target model

396 | JULY 2019 | volume 20 www.nature.com/nrg


Reviews

◀ Fig. 4 | Multitask models, multimodal models and transfer learning. a | Shown is a thereby help to explain why such a prediction was made
single-task model predicting the binding of a single transcription factor (green oval). (FIg. 5a). In DNA sequence-based models, the impor-
b | A multitask model is shown that simultaneously predicts binding for two transcription tance scores highlight sequence motifs and are hence
factors (green oval and red diamond). There are three submodels depicted: a common widely used in genomics18,45,47. They can also be used to
submodel and two task-specific submodels. c | A multimodal model is shown that takes
probe more complex epistatic interactions105. We refer
as input DNA sequence and chromatin accessibility. Each data modality is first processed
using a dedicated submodel, and the outputs are concatenated and processed using the to feature importance scores as scores generated per
shared submodel. Parameters of all submodels are trained jointly as shown in both parts b example, and they should not be confused with the fea-
and c. d | Transfer learning is presented. Parameters of the original model trained on a ture importance for supervised models based on tabular
large data set (top) are used for initialization for the second model trained on a related data like those of random forests, which are aggregated
task (target task) but with much less data available (bottom). In this example, the first task across the entire data set.
of the source model is similar to the target task (both are ovals); hence, the transferred Feature importance scores can be divided into two
submodel may contain features useful for the target task prediction. main categories on the basis of whether they are com-
puted using input perturbations or using backpropagation.
To realize the full potential of transfer and multitask Perturbation-based approaches systematically per-
learning, trained models must be easily shared. In the turb18,19,45,106 the input features and observe the change
fields of computer vision and natural language process- in the output (FIg. 5b). For DNA sequence-based mod-
ing, trained models are shared through repositories called els, the induced perturbation can be, for example, a
model zoos and are available for popular machine learning single-nucleotide substitution18,19 or an insertion of a regu­
frameworks, for example PyTorch model zoos101, Keras latory motif45. The main drawback of perturbation-based
model zoos and Tensorflow model zoos102. We and others importance scores is the high computational cost, which
recently developed Kipoi, a model zoo for genomics103, becomes notable when the importance scores for the whole
to address the lack of a platform for exchanging models. data set need to be computed. For example, a sequence of
Kipoi contains over 2,000 predictive models for genomics 1,000 nt requires an additional 3,000 model predictions
and allows the user to access, apply and interpret these to assess the effect of every possible single-nucleotide var-
predictive models through a unified interface as well as iant. Backpropagation-based approaches107,108, to the con-
score the effect of single-nucleotide variants for a subset trary, are much more computationally efficient. In these
of sequence-based models. As the size and the number of approaches, importance scores for all the input features
data sets grow and predictive models become more accu- are computed using a single backpropagation pass through
rate and essential, we expect to see a greater emphasis the network (FIg. 5c), and hence they require only twice the
Feature importance scores on model distribution, similar to the improvement amount of computation as a single prediction. The sim-
The quantification values of in data and software sharing over the past decade. plest backpropagation-based importance scores are
the contributions of features saliency maps107 and input-masked gradients108. As deep
to a current model prediction.
Explaining predictions learning frameworks support automatic differentiation
The simplest way to obtain this
score is to perturb the feature Although deep neural networks are not designed to (BoX 2), these scores can be efficiently implemented in a
value and measure the change highlight interpretable relationships in data or to guide few lines of code.
in the model prediction: the formulation of mechanistic hypotheses, they can One issue with saliency maps, input-masked gra-
the larger the change found, the nevertheless be interrogated for these purposes a pos- dients or perturbation-based methods is the so-called
more important the feature is.
teriori104. We refer to these interrogations of the models neuron saturation problem. Consider a neural net-
Backpropagation as model interpretation. In simple models such as linear work that classifies a sequence as positive if it observes
An algorithm for computing models, the parameters of the model often measure the a TAL1 transcription factor motif. If there are actu-
gradients of neural networks. contribution of an input feature to prediction. Therefore, ally two TAL1 motifs in the sequence, one of them
Gradients with respect to the
loss function are used to
they can be directly interpreted in cases where the could be deleted, and the model prediction would not
update the neural network input features are relatively independent. By contrast, change. In the case of perturbation-based gradients or
parameters during training. the parameters of a deep neural network are difficult input-masked gradients, the importance scores would
to interpret because of their redundancy and nonlinear be low for both TAL1 motifs, as they are individually
Saliency maps
relationship with the output. For example, although the not necessary for the prediction. To address this fail-
Feature importance scores
defined as the gradient CNN presented in FIg. 2 may be interpreted as multiple ure mode, so-called reference-based methods like
absolute values of the model PWMs scanning the sequence, the filters representing DeepLIFT108 and integrated gradients109 were developed.
output with respect to the the PWM in the first layer typically only represent parts These methods compare the input features with their
model input. of the motifs. The reason for this phenomenon is that ‘reference’ values and thereby avoid the saturation issue.
Input-masked gradients
individual filters are never forced to learn complete In the case of DNA sequences, a reasonable reference
Feature importance scores motifs. Rather, the network as a whole can detect motifs value is the dinucleotide-shuffled version of the original
defined as the gradient of the by assembling multiple filters in the downstream layers. sequence. We note that a rigorous benchmark of fea-
model output with respect ture importance scores and different reference values in
to the model input multiplied
Feature importance scores interrogate input–output genomics are currently lacking. Therefore, we recom-
by the input values.
relationships. In complex models, it is imperative to mend trying multiple methods and comparing them
Automatic differentiation inspect parameters indirectly by probing the input– with some well-understood examples or simulated data.
A set of techniques, which output relationships for each predicted example.
consist of a sequence of Feature importance scores, also called attribution scores, Sequence motif discovery. Motif discovery is an essen-
elementary arithmetic
operations, used to
relevance scores or contribution scores, can be used for tial component of the bioinformatics workflow when
automatically differentiate this purpose. They highlight the parts of a given input regulatory DNA sequences are analysed. Although fea-
a computer program. that are most influential for the model prediction and ture importance scores are able to highlight the instances

NATure RevIeWS | GENETiCs volume 20 | JULY 2019 | 397


Reviews

a Interpretation method Feature importance scores


Input
CGTGAGTTTGCATAACAACATA
TGATCGAGGACGAGCTGCATCG
GTAGCTAGCTAGTCGATGTGCA

b Perturbation-based Importance scores

p(Bound)
Reference C GTGAGTT... 0.9
A
Δ C
Perturbation1 A GTGAGTT... 0.2 G
T
...

perturb – reference
Position
Perturbation impact
Negative Positive

c Backpropagation-based
CGTGAGTT... Predict
0.9
Backpropagate

Fig. 5 | Model interpretation via feature importance scores. a | Feature importance scores highlight parts of the
input most predictive for the output. For DNA sequence-based models, these can be visualized as a sequence logo of
the input sequence, with letter heights proportional to the feature importance score, which may also be negative
(as visualized by letters facing upside down). There are two classes of feature importance scores: perturbation-based
approaches (part b) and backpropagation-based approaches (part c). b | Perturbation-based approaches perturb each
input feature (left) and record the change in model prediction (centre) in the feature importance matrix (right). For DNA
sequences, the perturbations correspond to single base substitutions18. Alternatively , the perturbation matrix can be
visualized as a sequence logo with the letter heights corresponding to the average per-base perturbation impact. c | Backpropagation-
based approaches compute the feature importance scores using gradients107 or augmented gradients such as DeepLIFT108
for the input features with respect to model prediction.

of different motifs18,45,47,110, they have so far been used the neural network correspond to molecular subsystems,
only to manually inspect individual sequences and not such as signalling pathways or large protein complexes,
to perform automated motif discovery. Simply aver- and connections between two nodes (systems) are only
aging the importance scores across multiple examples permitted if the upstream system (for example, a small
will not yield the desired results because the motif is protein complex) is part of the downstream system (such
not always located at the same position in the input as a large protein complex). The neurons in the neural
sequence. Owing to this issue, many studies45,49,50,54 have network correspond to known concepts; hence, their
derived motifs from sequences by aggregating sequences activations and parameters can be interpreted. We note
in the training set that strongly activated filters of the that this approach is only feasible for tasks in which
first convolutional layer or interpreted filters directly as the underlying entities and their hierarchical structure
motifs52. Recently, a promising approach to aggregate the are sufficiently well known; it may not be directly appli-
importance scores called TF-MoDISco was proposed111. cable to tasks for which the entities or their hierarchical
TF-MoDISco extracts, aligns and clusters the regions of structure are generally unknown, as in the case of trans­
Model architecture high importance into sequence motifs. Unlike classical cription factor binding. It will be interesting to see to
The structure of a neural
network independent of its
motif discovery, which relies only on plain sequences, what extent this approach can be applied in the future
parameter values. Important TF-MoDISco relies on the predictive model to highlight to other models and also how it can be combined with
aspects of model architecture the important regions within the sequence via feature modular modelling approaches (such as ExPecto51) to
are the types of layers, their importance scores, which guides motif discovery. tackle predicting and understanding more complex
dimensions and how they are
phenotypes such as disease.
connected to each other.
Neural networks with interpretable parameters
k-means and activations. An approach termed ‘visible neural Unsupervised learning
An unsupervised method for networks’ has recently been proposed with the DCell The goal of unsupervised learning is to character-
partitioning the observations model112 to improve the interpretability of internal ize unlabelled data by learning the useful properties
into clusters by alternating
between refining cluster
neural network activations. The model architecture of the data set. Classic unsupervised machine learn-
centroids and updating cluster of DCell corresponds to the hierarchical organization of ing methods include clustering algorithms such as
assignments of observations. known molecular subsystems within the cell. Nodes in k-means and dimensionality reduction methods such

398 | JULY 2019 | volume 20 www.nature.com/nrg


Reviews

a Input Loss ( , ) Reconstruction


Cells

Genes
Bottleneck
layer
Progenitor Differentiated Progenitor Differentiated
Encoder Decoder
Expression
Low High

b c Real data

Differentiated

Gene 1
Progenitors cells
Real or synthetic ?
Bottleneck 2

Gene 2
Synthetic data

Bottleneck 1 Parameter
Discriminator update
Decision
Noise boundary

Generator

Fig. 6 | Unsupervised learning. a | An autoencoder consists of two parts: an encoder and a decoder. The encoder
compresses the input data (depicted as gene expression of differentiating single cells) into a fewer (two shown here)
dimensions in the so-called bottleneck layer. The decoder tries to reconstruct the original input from the compressed
data in the bottleneck layer. Reconstruction accuracy is quantified by the loss function between the original data and
the reconstructed data. Although the pseudotime estimation is not a property of autoencoders, the denoising effect of
reconstruction can make the underlying structure of the data (for example cellular differentiation process) clearer130.
b | The bottleneck layer is a low-dimensional representation of the original input revealing the cell differentiation process.
c | Generative adversarial networks consist of generator and discriminator neural networks that are trained jointly.
The discriminator classifies whether a given data point was drawn from the real data (circles) or whether it was
synthetically generated (triangles). The generator aims to generate realistic samples and thereby tries to deceive the
discriminator into mistakenly classifying synthetic samples as real.

as principal component analysis, t-distributed stochastic single-cell genomics, autoencoders have been used for
neighbour embedding (t-SNE) or latent variable models. imputation, dimensionality reduction and representa-
Neural networks are able to generalize some of these tion learning125–130. Furthermore, prior biological knowl-
approaches. For example, autoencoders113,114,115,116 embed edge has been incorporated into the autoencoder
the data into a low-dimensional space with a hidden architecture in order to infer a new representation that
Principal component layer, called the bottleneck layer, and reconstruct the orig- improves clustering and visualization of cells from single-
analysis inal input data (FIg. 6a). This approach forces the network cell RNA sequencing (scRNA-seq) data131. Specific noise
An unsupervised learning to extract useful features of data, as the bottleneck layer characteristics of scRNA-seq data, such as sparse count
algorithm that linearly projects makes it infeasible to learn the perfect reconstruction. data, are also addressed with tailored loss functions
data from a high-dimensional
space to a lower-dimensional
Reconstructing the data is often interpreted as denois- within the autoencoder framework130.
space while retaining as much ing because the unimportant variations are automati- Neural networks have also greatly contributed to
variance as possible. cally left out (FIg. 6b). Principal component analysis is the toolbox of generative models. Unlike the approaches
equivalent to a linear autoencoder117,118,119, in which the described earlier, generative models aim to learn the
t-Distributed stochastic
principal components correspond to the representations data-generating process. Variational autoencoders132
neighbour embedding
(t-SNE). An unsupervised in the bottleneck layer. Multiple nonlinear layers gener- (VAEs) and GANs 133 are two powerful generative
learning algorithm that projects alize linear autoencoders to a nonlinear dimensionality approaches that have emerged in the deep learning
data from a high-dimensional reduction method. field. VAEs are autoencoders with additional distribu-
space to a lower-dimensional Autoencoders have been used to impute missing tion assumptions that enable them to generate new ran-
space (typically 2D or 3D) in a
nonlinear fashion while trying
data120, extract gene expression signatures121–123 and dom samples11, and they have been applied to single-cell
to preserve the distances detect expression outliers124 in microarray data and bulk and bulk RNA sequencing data to find meaningful pro­
between points. RNA sequencing gene expression data. In the field of babilistic latent representations134–137. These methods

NATure RevIeWS | GENETiCs volume 20 | JULY 2019 | 399


Reviews

demonstrate that either denoised reconstruction or predictions. Furthermore, state-of-the-art models for
low-dimensional representation of the single-cell data predicting novel splice site creation63 from sequence
improves commonly performed unsupervised learn- or quantitative effects of genetic variants on splicing145
ing tasks such as visualization and clustering. Another are deep learning models. Additionally, end-to-end
approach, which uses vector arithmetic with VAE latent approaches for variant effect predictions are beginning
representations, was reported to predict cell type-specific to appear and have been successfully applied to predict
and species-specific perturbation responses of single the pathogenicity of missense variants92 from protein
cells138. We note that the performances of VAEs and sequence and sequence conservation data.
other models based on neural networks strongly depend
on the choice of hyperparameters139. Deep learning as a fully data-driven refinement of
GANs were proposed as a radically different approach bioinformatics tools. Thanks to their flexibility, deep
to generative modelling that involves two neural net- neural networks can be trained to carry out tasks that
works, a discriminator and a generator network. They have traditionally been addressed by specific bioinfor-
are trained jointly, whereby the generator aims to gen- matics algorithms. Training computer programs instead
erate realistic data points, and the discriminator classi- of manually programming them has been shown to yield
fies whether a given sample is real or generated by the a significant increase in accuracy in tasks including vari-
generator (FIg. 6c). As a relatively new method, applica- ant calling61,62, base calling for novel sequencing techno­
tion of GANs is currently rather limited in genomics. logies72, denoising ChIP–seq data58 and enhancing Hi-C
They have been used to generate protein-coding DNA data resolution59. An additional advantage is that such
sequences140 and to design DNA probes for protein programs are able to leverage GPUs without the need to
binding micro­arrays. It has been reported that GANs write additional code.
are capable of generating sequences that are superior
to those in the training data set, as measured by higher Richer representations to reveal the structure of
protein binding affinity141. In the field of single-cell high-dimensional data. In addition to using deep
genomics, GANs have been used to simulate scRNA-seq learning as a powerful tool to make accurate predictions,
data and dimensionality reduction142. Furthermore, its use in unsupervised settings has given rise to some
the authors interpreted the internal representation important applications. Unlike other nonlinear dimen-
of GANs through perturbations. In MAGAN143, the sionality reduction methods, such as t-SNE, autoencoders
authors addressed the challenging problem of aligning are parametric and can therefore easily be applied to
Latent variable models data sets from different domains, that is, CyTOF data unseen data with similar distributions to the training
Unsupervised models and scRNA-seq data, using an architecture consisting set146. They are highly scalable because the training pro-
describing the observed of two GANs. cedure only requires a small subset of the data at every
distribution by imposing latent
(unobserved) variables for
training step (BoX 1), which is particularly important for
each data point. The simplest Impact in genomics fields such as single-cell genomics, in which the num-
example is the mixture of Deep learning methods, both supervised and unsuper- ber of training examples can now surpass hundreds of
Gaussian values. vised, have found various applications in genomics. Here, thousands147. In addition, unsupervised deep learning
we highlight three key areas in which we expect them to techniques can help to characterize data sets for which
Bottleneck layer
A neural network layer that have the largest impact now and in the near future. it is not trivial to obtain labels, for example, to enable a
contains fewer neurons than data-driven definition of cell identities and states from
previous and subsequent Predicting the effect of non-coding variants. Models scRNA-seq data. Finally, unsupervised methods can
layers. that can predict molecular phenotypes directly from also be used to integrate scRNA-seq data from different
Generative models
biological sequences can be used as in silico pertur- sources129,134,138,143,148, which is increasingly important not
Models able to generate points bation tools to probe the associations between genetic only because of growing data sets for similar tissues but
from the desired distribution. variation and phenotypic variation and have emerged also because of the generation of the first organ atlases,
Deep generative models are as new methods for quantitative trait loci identification such as the Human Cell Atlas project149.
often implemented by a neural
and variant prioritization. These approaches are of major
network that transforms
samples from a standard importance given that the majority of variants identi- Conclusions and future perspectives
distribution (normal and fied by genome-wide association studies of complex The uptake of deep learning in genomics has resulted
uniform) into samples from phenotypes are non-coding144, which makes it chal- in early applications with both scientific and economic
a complex distribution (gene lenging to estimate their effects and contribution to relevance. Multiple companies and industry research
expression levels or sequences
that encode a splice site).
phenotypes. Moreover, linkage disequilibrium results groups are being founded, often under the broader label
in blocks of variants being co-inherited, which creates of artificial intelligence, based on the anticipated eco-
Hyperparameters difficulties in pinpointing individual causal variants. nomic impact of genomic deep learning on diagnostics
Parameters specifying the Thus, sequence-based deep learning models that can be and drug development and on its easy integration with
model or the training
used as interrogation tools for assessing the impact of imaging data150. In particular, pharmacogenomics may
procedure that are not
optimized by the learning such variants offer a promising approach to find poten- profit from more efficient and automated identification
algorithm (for example, by the tial drivers of complex phenotypes. Examples include of novel regulatory variants in the genome and from
stochastic gradient descent DeepSEA19, Basenji47 and ExPecto51, which predict the more accurate predictions of drug responses and targets
algorithm). Examples of effect of non-coding single-nucleotide variants and short using epigenomics data151.
hyperparameters are the
number of layers, regularization
insertions or deletions (indels) indirectly from the differ- Regardless of their quantitative advantages (or dis-
strength, batch size and the ence between two variants in terms of transcription fac- advantages) over alternative methods, some of the
optimization step size. tor binding, chromatin accessibility or gene expression qualitative aspects of deep learning will remain relevant

400 | JULY 2019 | volume 20 www.nature.com/nrg


Reviews

for genomics. One of these qualitative advantages is optimization procedures, which lowers the entry barrier
end-to-end learning. Data preprocessing steps can to the development of new models.
be time-consuming and error prone, especially in the In the future, we expect deep learning to find new
genomics field because of the variety of experimental applications across multiple omics data types. We also
data sources. Being able to integrate multiple pre­ expect to see an increasing uptake of new techniques
processing steps into a single model and to ‘let the data from the deep learning research community. A particu-
speak’ when defining features are important advantages lar challenge in human genomics is data privacy. One
that often increase predictive power. We expect end-to- appealing direction is the development of federated
end learning approaches to become more widely used learning whereby machine learning model instances
across genomics, including protein structure predic- are deployed on distinct sites and trained on local data
tion152. Another qualitative advantage that is particularly while sharing common parameters155. By avoiding data
important for genomics is the ability of deep learning to transfer, federated learning can reduce total training
deal with multimodal data effectively. Genomics offers time and can facilitate the respect of genetic and medical
extremely heterogeneous data including sequence, data privacy. Another relevant technique for data privacy
counts, mass spectrometry intensity and images. An is generative models, which could be used to simulate
important application of multimodal modelling will be human genomics data that can be analysed by others
the development of machine learning models for spatial without privacy violation156. Another important area is
transcriptomics153 that integrate scRNA-seq and imag- the prediction of causal effects, which is highly relevant
ing data, allowing gene expression to be jointly ana- to medical and therapeutic applications. Substantial
lysed with the morphology of the cell and its position progress may occur on this front as, on the one hand,
in the tissue. Deep learning is the ideal approach for the field of machine learning is becoming increasingly
incorporating spatial patterns into analyses, as has been interested in causal modelling and, on the other hand,
shown extensively for microscopy data21,154. Last but not the field of genomics is increasingly generating pertur-
least, an important advantage is the abstraction of the bation data using massively parallel reporter assays or
mathematical details that is offered by deep learning systematic CRISPR screens at the bulk and single-cell
frameworks. Researchers in genomics often do not have level. Although the impact of these novel developments
the theoretical knowledge, nor do they have the time, remains to be seen, the magnitude and complexity of
to formulate statistical models and devise appropriate genomic data will ensure that deep learning will become
parameter fitting algorithms. Deep learning frameworks an everyday tool for its analysis.
abstract much of the mathematical and technical details,
such as the need for manually deriving gradients and Published online 10 April 2019

1. Hieter, P. & Boguski, M. Functional genomics: 15. Long, J., Shelhamer, E. & Darrell, T. in 2015 IEEE 27. Boser, B. E., Guyon, I. M. & Vapnik, V. N. A. in
it’s all how you read it. Science 278, 601–602 (1997). Conference on Computer Vision and Pattern Recognition Proceedings of the Fifth Annual Workshop on
2. Brown, P. O. & Botstein, D. Exploring the new world (CVPR) 3431–3440 (IEEE, 2015). Computational Learning Theory 144–152 (ACM,
of the genome with DNA microarrays. Nat. Genet. 21, 16. Hannun, A. et al. Deep speech: scaling up end-to-end 1992).
33–37 (1999). speech recognition. Preprint at arXiv https://arxiv.org/ 28. Breiman, L. Random forests. Mach. Learn. 45, 5–32
3. Ozaki, K. et al. Functional SNPs in the lymphotoxin-α abs/1412.5567 (2014). (2001).
gene that are associated with susceptibility to 17. Wu, Y. et al. Google’s neural machine translation 29. Friedman, J. H. Greedy function approximation: a
myocardial infarction. Nat. Genet. 32, 650–654 system: bridging the gap between human and machine gradient boosting machine. Ann. Stat. 29, 1189–1232
(2002). translation. Preprint at arXiv https://arxiv.org/abs/ (2001).
4. Golub, T. R. et al. Molecular classification of cancer: 1609.08144 (2016). 30. Xiong, H. Y. et al. RNA splicing. The human splicing
class discovery and class prediction by gene expression 18. Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. code reveals new insights into the genetic determinants
monitoring. Science 286, 531–537 (1999). Predicting the sequence specificities of DNA- and of disease. Science 347, 1254806 (2015).
5. Oliver, S. Guilt-by-association goes global. Nature RNA-binding proteins by deep learning. Nat. 31. Jha, A., Gazzara, M. R. & Barash, Y. Integrative deep
403, 601–603 (2000). Biotechnol. 33, 831–838 (2015). models for alternative splicing. Bioinformatics 33,
6. The ENCODE Project Consortium. An integrated This paper describes a pioneering convolutional i274–i282 (2017).
encyclopedia of DNA elements in the human genome. neural network application in genomics. 32. Quang, D., Chen, Y. & Xie, X. DANN: a deep learning
Nature 489, 57–74 (2012). 19. Zhou, J. & Troyanskaya, O. G. Predicting effects approach for annotating the pathogenicity of genetic
7. Murphy, K. P. Machine Learning: A Probabilistic of noncoding variants with deep learning-based variants. Bioinformatics 31, 761–763 (2015).
Perspective (MIT Press, 2012). sequence model. Nat. Methods 12, 931–934 33. Liu, F., Li, H., Ren, C., Bo, X. & Shu, W. PEDLA:
8. Bishop, C. M. Pattern Recognition and Machine (2015). predicting enhancers with a deep learning-based
Learning (Springer, New York, 2016). This paper applies deep CNNs to predict chromatin algorithmic framework. Sci. Rep. 6, 28517 (2016).
9. Libbrecht, M. W. & Noble, W. S. Machine learning features and transcription factor binding from 34. Li, Y., Shi, W. & Wasserman, W. W. Genome-wide
applications in genetics and genomics. Nat. Rev. DNA sequence and demonstrates its utility in prediction of cis-regulatory regions using supervised
Genet. 16, 321–332 (2015). non-coding variant effect prediction. deep learning methods. BMC Bioinformatics 19, 202
10. Durbin, R., Eddy, S. R., Krogh, A. & Mitchison, G. 20. Zou, J. et al. A primer on deep learning in genomics. (2018).
Biological Sequence Analysis: Probabilistic Models Nat. Genet. 51, 12–18 (2019). 35. Johnson, D. S., Mortazavi, A., Myers, R. M. & Wold, B.
of Proteins and Nucleic Acids (Cambridge Univ. Press, 21. Angermueller, C., Pärnamaa, T., Parts, L. & Stegle, O. Genome-wide mapping of in vivo protein-DNA
1998). Deep learning for computational biology. Mol. Syst. interactions. Science 316, 1497–1502 (2007).
11. Goodfellow, I., Bengio, Y. & Courville, A. Deep Biol. 12, 878 (2016). 36. Barski, A. et al. High-resolution profiling of histone
Learning (MIT Press, 2016). 22. Min, S., Lee, B. & Yoon, S. Deep learning in methylations in the human genome. Cell 129,
This textbook covers theoretical and practical bioinformatics. Brief. Bioinform. 18, 851–869 823–837 (2007).
aspects of deep learning with introductory sections (2017). 37. Robertson, G. et al. Genome-wide profiles of STAT1
on linear algebra and machine learning. 23. Jones, W., Alasoo, K., Fishman, D. & Parts, L. DNA association using chromatin immunoprecipitation
12. Shi, S., Wang, Q., Xu, P. & Chu, X. in 2016 7th Computational biology: deep learning. Emerg. Top. and massively parallel sequencing. Nat. Methods 4,
International Conference on Cloud Computing and Life Sci. 1, 257–274 (2017). 651–657 (2007).
Big Data (CCBD) 99–104 (IEEE, 2016). 24. Wainberg, M., Merico, D., Delong, A. & Frey, B. J. 38. Park, P. J. ChIP-seq: advantages and challenges of a
13. Krizhevsky, A., Sutskever, I. & Hinton, G. E. in Deep learning in biomedicine. Nat. Biotechnol. 36, maturing technology. Nat. Rev. Genet. 10, 669–680
Advances in Neural Information Processing Systems 829–838 (2018). (2009).
25 (NIPS 2012) (eds Pereira, F., Burges, C. J. C., 25. Ching, T. et al. Opportunities and obstacles for deep 39. Weirauch, M. T. et al. Evaluation of methods for
Bottou, L. & Weinberger, K. Q.) 1097–1105 learning in biology and medicine. J. R. Soc. Interface modeling transcription factor sequence specificity.
(Curran Associates, Inc., 2012). 15, 20170387 (2018). Nat. Biotechnol. 31, 126 (2013).
14. Girshick, R., Donahue, J., Darrell, T. & Malik, J. in 26. Morgan, J. N. & Sonquist, J. A. Problems in the 40. Lee, D., Karchin, R. & Beer, M. A. Discriminative
2014 IEEE Conference on Computer Vision and analysis of survey data, and a proposal. J. Am. Stat. prediction of mammalian enhancers from DNA
Pattern Recognition 580–587 (IEEE, 2014). Assoc. 58, 415–434 (1963). sequence. Genome Res. 21, 2167–2180 (2011).

NATure RevIeWS | GENETiCs volume 20 | JULY 2019 | 401


Reviews

41. Ghandi, M., Lee, D., Mohammad-Noori, M. & 62. Poplin, R. et al. A universal SNP and small-indel 87. Rhee, S., Seo, S. & Kim, S. in Proceedings of the
Beer, M. A. Enhanced regulatory sequence prediction variant caller using deep neural networks. Nat. Twenty-Seventh International Joint Conference on
using gapped k-mer features. PLOS Comput. Biol. 10, Biotechnol. 36, 983–987 (2018). Artificial Intelligence 3527–3534 (IJCAI, 2018).
e1003711 (2014). In this paper, a deep CNN is trained to call 88. Chen, Z., Badrinarayanan, V., Lee, C.-Y. & Rabinovich, A.
42. Stormo, G. D., Schneider, T. D., Gold, L. & genetic variants from different DNA-sequencing GradNorm: gradient normalization for adaptive loss
Ehrenfeucht, A. Use of the ‘Perceptron’ algorithm technologies. balancing in deep multitask networks. Preprint at arXiv
to distinguish translational initiation sites in E. coli. 63. Jaganathan, K. et al. Predicting splicing from primary https://arxiv.org/abs/1711.02257 (2017).
Nucleic Acids Res. 10, 2997–3011 (1982). sequence with deep learning. Cell 176, 535–548 89. Sung, K. & Poggio, T. Example-based learning for
43. Stormo, G. D. DNA binding sites: representation (2019). view-based human face detection. IEEE Trans. Pattern
and discovery. Bioinformatics 16, 16–23 (2000). 64. Elman, J. L. Finding structure in time. Cogn. Sci. 14, Anal. Mach. Intell. 20, 39–51 (1998).
44. D’haeseleer, P. What are DNA sequence motifs? 179–211 (1990). 90. Felzenszwalb, P. F., Girshick, R. B., McAllester, D. &
Nat. Biotechnol. 24, 423–425 (2006). 65. Hochreiter, S. & Schmidhuber, J. Long short-term Ramanan, D. Object detection with discriminatively
45. Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning memory. Neural Comput. 9, 1735–1780 (1997). trained part-based models. IEEE Trans. Pattern Anal.
the regulatory code of the accessible genome with 66. Bai, S., Zico Kolter, J. & Koltun, V. An empirical Mach. Intell. 32, 1627–1645 (2010).
deep convolutional neural networks. Genome Res. 26, evaluation of generic convolutional and recurrent 91. Guo, M., Haque, A., Huang, D.-A., Yeung, S. & Fei-Fei, L.
990–999 (2016). networks for sequence modeling. Preprint at arXiv in Computer Vision – ECCV 2018 (eds Ferrari, V.,
This paper describes the application of a deep https://arxiv.org/abs/1803.01271 (2018). Hebert, M., Sminchisescu, C. & Weiss, Y.) Vol. 11220
CNN to predict chromatin accessibility in 164 cell 67. Pan, X., Rijnbeek, P., Yan, J. & Shen, H.-B. Prediction 282–299 (Springer International Publishing, 2018).
types from DNA sequence. of RNA-protein sequence and structure binding 92. Sundaram, L. et al. Predicting the clinical impact of
46. Wang, M., Tai, C., E, W. & Wei, L. DeFine: deep preferences using deep convolutional and recurrent human mutation with deep neural networks. Nat. Genet.
convolutional neural networks accurately quantify neural networks. BMC Genomics 19, 511 (2018). 50, 1161–1170 (2018).
intensities of transcription factor-DNA binding and 68. Quang, D. & Xie, X. DanQ: a hybrid convolutional 93. Zitnik, M. et al. Machine learning for integrating data
facilitate evaluation of functional non-coding variants. and recurrent deep neural network for quantifying the in biology and medicine: principles, practice, and
Nucleic Acids Res. 46, e69 (2018). function of DNA sequences. Nucleic Acids Res. 44, opportunities. Inf. Fusion 50, 71–91 (2018).
47. Kelley, D. R. et al. Sequential regulatory activity e107 (2016). 94. Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. in
prediction across chromosomes with convolutional 69. Quang, D. & Xie, X. FactorNet: a deep learning Advances in Neural Information Processing Systems
neural networks. Genome Res. 28, 739–750 framework for predicting cell type specific transcription 27 (NIPS 2014) (eds Ghahramani, Z., Welling, M.,
(2018). factor binding from nucleotide-resolution sequential Cortes, C., Lawrence, N. D. & Weinberger, K. Q.)
In this paper, a deep CNN was trained to data. Preprint at bioRxiv https://doi.org/10.1101/ 3320–3328 (Curran Associates Inc., 2014).
predict more than 4,000 genomic measurements 151274 (2017). 95. Kornblith, S., Shlens, J. & Le, Q. V. Do better ImageNet
including gene expression as measured by cap 70. Lee, B., Baek, J., Park, S. & Yoon, S. in Proceedings of models transfer better? Preprint at arXiv https://arxiv.
analysis of gene expression (CAGE) for every the 7th ACM International Conference on Bioinformatics, org/abs/1805.08974 (2018).
150 bp in the genome using a receptive field Computational Biology, and Health Informatics 96. Russakovsky, O. et al. ImageNet large scale visual
of 32 kb. 434–442 (ACM, 2016). recognition challenge. Preprint at arXiv https://arxiv.
48. Schreiber, J., Libbrecht, M., Bilmes, J. & Noble, W. 71. Park, S., Min, S., Choi, H. & Yoon, S. deepMiRGene: org/abs/1409.0575 (2014).
Nucleotide sequence and DNaseI sensitivity are deep neural network based precursor microRNA 97. Esteva, A. et al. Dermatologist-level classification of
predictive of 3D chromatin architecture. Preprint prediction. Preprint at arXiv https://arxiv.org/ skin cancer with deep neural networks. Nature 542,
at bioRxiv https://doi.org/10.1101/103614 abs/1605.00017 (2016). 115–118 (2017).
(2018). 72. Boža, V., Brejová, B. & Vinař;, T. DeepNano: 98. Pawlowski, N., Caicedo, J. C., Singh, S., Carpenter, A. E.
49. Zeng, H. & Gifford, D. K. Predicting the impact of deep recurrent neural networks for base calling in & Storkey, A. Automating morphological profiling
non-coding variants on DNA methylation. Nucleic MinION nanopore reads. PLOS ONE 12, e0178751 with generic deep convolutional networks. Preprint at
Acids Res. 45, e99 (2017). (2017). bioRxiv https://doi.org/10.1101/085118 (2016).
50. Angermueller, C., Lee, H. J., Reik, W. & Stegle, O. 73. Mikheyev, A. S. & Tin, M. M. Y. A first look at the 99. Zeng, T., Li, R., Mukkamala, R., Ye, J. & Ji, S. Deep
DeepCpG: accurate prediction of single-cell DNA Oxford Nanopore MinION sequencer. Mol. Ecol. convolutional neural networks for annotating gene
methylation states using deep learning. Genome Biol. Resour. 14, 1097–1102 (2014). expression patterns in the mouse brain. BMC
18, 67 (2017). 74. Barabási, A.-L., Gulbahce, N. & Loscalzo, J. Network Bioinformatics 16, 147 (2015).
51. Zhou, J. et al. Deep learning sequence-based ab initio medicine: a network-based approach to human 100. Zhang, W. et al. in IEEE Transactions on Big Data
prediction of variant effects on expression and disease disease. Nat. Rev. Genet. 12, 56–68 (2011). (IEEE, 2018).
risk. Nat. Genet. 50, 1171–1179 (2018). 75. Mitra, K., Carvunis, A.-R., Ramesh, S. K. & Ideker, T. 101. Adam, P. et al. Automatic differentiation in PyTorch.
In this paper, two models, a deep CNN and a linear Integrative approaches for finding modular structure Presented at 31st Conference on Neural Information
model, are stacked to predict tissue-specific gene in biological networks. Nat. Rev. Genet. 14, 719–732 Processing Systems (NIPS 2017).
expression from DNA sequence, which demonstrates (2013). 102. Abadi, M. et al. Tensorflow: large-scale machine
the utility of this approach in non-coding variant 76. Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M. learning on heterogeneous distributed systems.
effect prediction. & Monfardini, G. The graph neural network model. Preprint at arXiv https://arxiv.org/abs/1603.04467
52. Cuperus, J. T. et al. Deep learning of the regulatory IEEE Trans. Neural Netw. 20, 61–80 (2009). (2016).
grammar of yeast 5’ untranslated regions from 77. Defferrard, M., Bresson, X. & Vandergheynst, P. in 103. Avsec, Z. et al. Kipoi: accelerating the community
500,000 random sequences. Genome Res. 27, Advances in Neural Information Processing Systems exchange and reuse of predictive models for genomics.
2015–2024 (2017). 29 (NIPS 2016) (eds Lee, D. D., Sugiyama, M., Preprint at bioRxiv https://doi.org/10.1101/375345
53. Pan, X. & Shen, H.-B. RNA-protein binding motifs Luxburg, U. V., Guyon, I. & Garnett, R.) 3844–3852 (2018).
mining with a new hybrid deep learning based (Curran Associates Inc., 2016). This paper describes a platform to exchange
cross-domain knowledge integration approach. 78. Kipf, T. N. & Welling, M. Semi-supervised classification trained predictive models in genomics including
BMC Bioinformatics 18, 136 (2017). with graph convolutional networks. Preprint at arXiv deep neural networks.
54. Avsec, Ž., Barekatain, M., Cheng, J. & Gagneur, J. https://arxiv.org/abs/1609.02907 (2016). 104. Breiman, L. Statistical modeling: the two cultures
Modeling positional effects of regulatory sequences 79. Battaglia, P. W. et al. Relational inductive biases, (with comments and a rejoinder by the author).
with spline transformations increases prediction deep learning, and graph networks. Preprint at arXiv Stat. Sci. 16, 199–231 (2001).
accuracy of deep neural networks. Bioinformatics 34, https://arxiv.org/abs/1806.01261 (2018). 105. Greenside, P., Shimko, T., Fordyce, P. & Kundaje, A.
1261–1269 (2018). 80. Hamilton, W. L., Ying, R. & Leskovec, J. Inductive Discovering epistatic feature interactions from neural
55. Budach, S. & Marsico, A. pysster: classification of representation learning on large graphs. Preprint network models of regulatory DNA sequences.
biological sequences by learning sequence and at arXiv https://arxiv.org/abs/1706.02216 (2017). Bioinformatics 34, i629–i637 (2018).
structure motifs with convolutional neural networks. 81. Chen, J., Ma, T. & Xiao, C. FastGCN: fast learning 106. Zeiler, M. D. & Fergus, R. in Computer Vision – ECCV
Bioinformatics 34, 3035–3037 (2018). with graph convolutional networks via importance 2014 (eds Fleet, D., Pajdla, T., Schiele, B. & Tuytelaars, T.)
56. Cheng, S. et al. MiRTDL: a deep learning approach sampling. Preprint at arXiv https://arxiv.org/abs/ Vol. 8689 818–833 (Springer International
for miRNA target prediction. IEEE/ACM Trans. Comput. 1801.10247 (2018). Publishing, 2014).
Biol. Bioinform. 13, 1161–1169 (2016). 82. Zitnik, M. & Leskovec, J. Predicting multicellular 107. Simonyan, K., Vedaldi, A. & Zisserman, A. Deep
57. Kim, H. K. et al. Deep learning improves prediction of function through multi-layer tissue networks. inside convolutional networks: visualising image
CRISPR-Cpf1 guide RNA activity. Nat. Biotechnol. 36, Bioinformatics 33, i190–i198 (2017). classification models and saliency maps. Preprint
239–241 (2018). 83. Zitnik, M., Agrawal, M. & Leskovec, J. Modeling at arXiv https://arxiv.org/abs/1312.6034 (2013).
58. Koh, P. W., Pierson, E. & Kundaje, A. Denoising polypharmacy side effects with graph convolutional 108. Shrikumar, A., Greenside, P., Shcherbina, A. &
genome-wide histone ChIP-seq with convolutional networks. Bioinformatics 34, i457–i466 (2018). Kundaje, A. Not just a black box: learning important
neuralnetworks. Bioinformatics 33, i225–i233 84. Duvenaud, D. K. et al. in Advances in Neural features through propagating activation differences.
(2017). Information Processing Systems 28 (NIPS 2015) (eds Preprint at arXiv https://arxiv.org/abs/1605.01713
59. Zhang, Y. et al. Enhancing Hi-C data resolution with Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M. (2016).
deep convolutional neural network HiCPlus. Nat. & Garnett, R.) 2224–2232 (Curran Associates Inc., This paper introduces DeepLIFT, a neural network
Commun. 9, 750 (2018). 2015). interpretation method that highlights inputs most
60. Nielsen, A. A. K. & Voigt, C. A. Deep learning to 85. Kearnes, S., McCloskey, K., Berndl, M., Pande, V. influential for the prediction.
predict the lab-of-origin of engineered DNA. Nat. & Riley, P. Molecular graph convolutions: moving 109. Sundararajan, M., Taly, A. & Yan, Q. Axiomatic
Commun. 9, 3135 (2018). beyond fingerprints. J. Comput. Aided Mol. Des. 30, attribution for deep networks. Preprint at arXiv
61. Luo, R., Sedlazeck, F. J., Lam, T.-W. & Schatz, M. 595–608 (2016). https://arxiv.org/abs/1703.01365 (2017).
Clairvoyante: a multi-task convolutional deep neural 86. Dutil, F., Cohen, J. P., Weiss, M., Derevyanko, G. 110. Lanchantin, J., Singh, R., Wang, B. & Qi, Y. Deep motif
network for variant calling in single molecule & Bengio, Y. Towards gene expression convolutions dashboard: visualizing and understanding genomic
sequencing. Preprint at bioRxiv https://doi.org/ using gene interaction graphs. Preprint at arXiv sequences using deep neural networks. Pac. Symp.
10.1101/310458 (2018). https://arxiv.org/abs/1806.06975 (2018). Biocomput. 22, 254–265 (2017).

402 | JULY 2019 | volume 20 www.nature.com/nrg


Reviews

111. Shrikumar, A. et al. TF-MoDISco v0.4.4.2-alpha: (eds Ghahramani, Z., Welling, M., Cortes, C., machine learning for on-device intelligence. Preprint at
technical note. Preprint at arXiv https://arxiv.org/ Lawrence, N. D. & Weinberger, K. Q.) 2672–2680 arXiv https://arxiv.org/abs/1610.02527 (2016).
abs/1811.00416v2 (2018). (Curran Associates Inc., 2014). 156. Beaulieu-Jones, B. K. et al. Privacy-preserving
112. Ma, J. et al. Using deep learning to model the 134. Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. generative deep neural networks support clinical data
hierarchical structure and function of a cell. & Yosef, N. Deep generative modeling for single-cell sharing. Preprint at bioRxiv https://doi.org/10.1101/
Nat. Methods 15, 290–298 (2018). transcriptomics. Nat. Methods 15, 1053–1058 159756 (2018).
113. Hinton, G. E. & Salakhutdinov, R. R. Reducing the (2018). 157. Lever, J., Krzywinski, M. & Altman, N. Classification
dimensionality of data with neural networks. Science 135. Way, G. P. & Greene, C. S. in Biocomputing 2018: evaluation. Nat. Methods 13, 603 (2016).
313, 504–507 (2006). Proceedings of the Pacific Symposium (eds Altman, R. B. 158. Tieleman, T. & Hinton, G. Lecture 6.5 - RMSProp,
114. Kramer, M. A. Nonlinear principal component analysis et al.) 80–91 (World Scientific, 2018). COURSERA: neural networks for machine learning
using autoassociative neural networks. AIChE J. 37, 136. Grønbech, C. H. et al. scVAE: variational auto-encoders (2012).
233–243 (1991). for single-cell gene expression data. Preprint at 159. Kingma, D. P. & Ba, J. Adam: a method for stochastic
115. Vincent, P., Larochelle, H., Bengio, Y. & Manzagol, P.-A. bioRxiv https://doi.org/10.1101/318295 (2018). optimization. Preprint at arXiv https://arxiv.org/abs/
in Proceedings of the 25th International Conference on 137. Wang, D. & Gu, J. VASC: dimension reduction and 1412.6980 (2014).
Machine Learning 1096–1103 (ACM, 2008). visualization of single-cell RNA-seq data by deep 160. Schmidhuber, J. Deep learning in neural networks:
116. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y. & variational autoencoder. Genomics Proteomics an overview. Neural Netw. 61, 85–117 (2015).
Manzagol, P.-A. Stacked denoising autoencoders: Bioinformatics 16, 320–331 (2018). 161. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning.
learning useful representations in a deep network 138. Lotfollahi, M., Alexander Wolf, F. & Theis, F. J. Nature 521, 436–444 (2015).
with a local denoising criterion. J. Mach. Learn. Res. Generative modeling and latent space arithmetics 162. Bottou, L. in Proceedings of Neuro-Nımes ‘91 12
11, 3371–3408 (2010). predict single-cell perturbation response across cell (EC2, 1991).
117. Jolliffe, I. in International Encyclopedia of Statistical types, studies and species. Preprint at bioRxiv 163. Bengio, Y. Practical recommendations for gradient-
Science (ed. Lovric, M.) 1094–1096 (Springer Berlin https://doi.org/10.1101/478503 (2018). based training of deep architectures. Preprint at arXiv
Heidelberg, 2011). 139. Hu, Q. & Greene, C. S. Parameter tuning is a key part https://arxiv.org/abs/1206.5533 (2012).
118. Plaut, E. From principal subspaces to principal of dimensionality reduction via deep variational 164. Bergstra, J. & Bengio, Y. Random search for hyper-
components with linear autoencoders. Preprint at autoencoders for single cell RNA transcriptomics. parameter optimization. J. Mach. Learn. Res. 13,
arXiv https://arxiv.org/abs/1804.10253 (2018). Preprint at bioRxiv https://doi.org/10.1101/385534 281–305 (2012).
119. Kunin, D., Bloom, J. M., Goeva, A. & Seed, C. Loss (2018). 165. Bergstra, J., Yamins, D. & Cox, D. in Proceedings of
landscapes of regularized linear autoencoders. Preprint 140. Gupta, A. & Zou, J. Feedback GAN (FBGAN) for DNA: the 30th International Conference on Machine
at arXiv https://arxiv.org/abs/1901.08168 (2019). a novel feedback-loop architecture for optimizing Learning Vol. 28 115–123 (JMLR W&CP, 2013).
120. Scholz, M., Kaplan, F., Guy, C. L., Kopka, J. protein functions. Preprint at arXiv https://arxiv.org/ 166. Shahriari, B., Swersky, K., Wang, Z., Adams, R. P.
& Selbig, J. Non-linear PCA: a missing data approach. abs/1804.01694 (2018). & de Freitas, N. Taking the human out of the loop:
Bioinformatics 21, 3887–3895 (2005). 141. Killoran, N., Lee, L. J., Delong, A., Duvenaud, D. & a review of bayesian optimization. Proc. IEEE 104,
121. Tan, J., Hammond, J. H., Hogan, D. A. & Greene, C. S. Frey, B. J. Generating and designing DNA with deep 148–175 (2016).
ADAGE-based integration of publicly available generative models. Preprint at arXiv https://arxiv.org/ 167. Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A.
Pseudomonas aeruginosa gene expression data with abs/1712.06148 (2017). & Talwalkar, A. Hyperband: a novel bandit-based
denoising autoencoders illuminates microbe-host 142. Ghahramani, A., Watt, F. M. & Luscombe, N. M. approach to hyperparameter optimization. J. Mach.
interactions. mSystems 1, e00025–15 (2016). Generative adversarial networks simulate gene Learn. Res. 18, 6765–6816 (2017).
122. Tan, J. et al. ADAGE signature analysis: differential expression and predict perturbations in single cells. 168. Elsken, T., Metzen, J. H. & Hutter, F. Neural architecture
expression analysis with data-defined gene sets. Preprint at bioRxiv https://doi.org/10.1101/262501 search: a survey. Preprint at arXiv https://arxiv.org/
BMC Bioinformatics 18, 512 (2017). (2018). abs/1808.05377 (2018).
123. Tan, J. et al. Unsupervised extraction of stable 143. Amodio, M. & Krishnaswamy, S. MAGAN: aligning
expression signatures from public compendia with biological manifolds. Preprint at arXiv https://arxiv. Acknowledgements
an ensemble of neural networks. Cell Syst. 5, 63–71 org/abs/1803.00385 (2018). Ž.A. was supported by the German Bundesministerium für
(2017). 144. Maurano, M. T. et al. Systematic localization of Bildung und Forschung (BMBF) through the project MechML
124. Brechtmann, F. et al. OUTRIDER: a statistical method common disease-associated variation in regulatory (01IS18053F). The authors acknowledge M. Heinig and
for detecting aberrantly expressed genes in RNA DNA. Science 337, 1190–1195 (2012). A. Raue for valuable feedback.
sequencing data. Am. J. Hum. Genet. 103, 907–917 145. Cheng, J. et al. MMSplice: modular modeling improves
(2018). the predictions of genetic variant effects on splicing. Author contributions
125. Ding, J., Condon, A. & Shah, S. P. Interpretable Genome Biol. 20, 48 (2019). The authors contributed equally to all aspects of the article.
dimensionality reduction of single cell transcriptome 146. van der Maaten, L. in Proceedings of the Twelfth
data with deep generative models. Nat. Commun. 9, International Conference on Artificial Intelligence Competing interests
2002 (2018). and Statistics (eds van Dyk, D. & Welling, M.) Vol. 5 The authors declare no competing interests.
126. Cho, H., Berger, B. & Peng, J. Generalizable and 384–391 (PMLR, 2009).
scalable visualization of single-cell data using neural 147. Angerer, P. et al. Single cells make big data: new Publisher’s note
networks. Cell Syst. 7, 185–191 (2018). challenges and opportunities in transcriptomics. Springer Nature remains neutral with regard to jurisdictional
127. Deng, Y., Bao, F., Dai, Q., Wu, L. & Altschuler, S. Curr. Opin. Syst. Biol. 4, 85–91 (2017). claims in published maps and institutional affiliations.
Massive single-cell RNA-seq analysis and imputation 148. Shaham, U. et al. Removal of batch effects using
via deep learning. Preprint at bioRxiv https://doi.org/ distribution-matching residual networks. Bioinformatics Reviewer information
10.1101/315556 (2018). 33, 2539–2546 (2017). Nature Reviews Genetics thanks C. Greene and the other
128. Talwar, D., Mongia, A., Sengupta, D. & Majumdar, A. 149. Regev, A. et al. The human cell atlas. eLife 6, e27041 anonymous reviewer(s) for their contribution to the peer
AutoImpute: autoencoder based imputation of (2017). review of this work.
single-cell RNA-seq data. Sci. Rep. 8, 16329 (2018). 150. Fleming, N. How artificial intelligence is changing drug
129. Amodio, M. et al. Exploring single-cell data with deep discovery. Nature 557, S55–S57 (2018).
multitasking neural networks. Preprint at bioRxiv 151. Kalinin, A. A. et al. Deep learning in pharmacogenomics: Related links
https://doi.org/10.1101/237065 (2019). from gene regulation to patient stratification. DragoNN: https://kundajelab.github.io/dragonn/tutorials.html
130. Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Pharmacogenomics 19, 629–650 (2018). Kaggle machine learning competitions: https://www.
Theis, F. J. Single-cell RNA-seq denoising using a deep 152. AlQuraishi, M. End-to-end differentiable learning of kaggle.com/sudalairajkumar/winning-solutions-of-kaggle-
count autoencoder. Nat. Commun. 10, 390 (2019). protein structure. Preprint at bioRxiv https://doi.org/ competitions
131. Lin, C., Jain, S., Kim, H. & Bar-Joseph, Z. Using neural 10.1101/265231 (2018). Keras: https://keras.io/
networks for reducing the dimensions of single-cell 153. Nawy, T. Spatial transcriptomics. Nat. Methods 15, Keras model zoos: https://keras.io/applications/
RNA-Seq data. Nucleic Acids Res. 45, e156 (2017). 30 (2018). PyTorch: http://pytorch.org
132. Kingma, D. P. & Welling, M. Auto-encoding variational 154. Eulenberg, P. et al. Reconstructing cell cycle and PyTorch model zoos: https://pytorch.org/docs/stable/
bayes. Preprint at arXiv https://arxiv.org/abs/1312. disease progression using deep learning. Nat. Commun. torchvision/models.html
6114 (2013). 8, 463 (2017). Tensorflow model zoos: https://github.com/tensorflow/
133. Goodfellow, I. et al. in Advances in Neural 155. KoneČný, J., McMahan, H. B., Ramage, D. models; https://www.tensorflow.org/hub/
Information Processing Systems 27 (NIPS 2014) & Richtárik, P. Federated optimization: distributed

NATure RevIeWS | GENETiCs volume 20 | JULY 2019 | 403

You might also like