Article
Article
Article
https://doi.org/10.1038/s42256-020-0217-y
The remarkable flexibility and adaptability of ensemble methods and deep learning models have led to the proliferation of their
application in bioinformatics research. Traditionally, these two machine learning techniques have largely been treated as inde-
pendent methodologies in bioinformatics applications. However, the recent emergence of ensemble deep learning—wherein
the two machine learning techniques are combined to achieve synergistic improvements in model accuracy, stability and repro-
ducibility—has prompted a new wave of research and application. Here, we share recent key developments in ensemble deep
learning and look at how their contribution has benefited a wide range of bioinformatics research from basic sequence analysis
to systems biology. While the application of ensemble deep learning in bioinformatics is diverse and multifaceted, we identify
and discuss the common challenges and opportunities in the context of bioinformatics research. We hope this Review Article
will bring together the broader community of machine learning researchers, bioinformaticians and biologists to foster future
research and development in ensemble deep learning, and inspire novel bioinformatics applications that are unattainable by
traditional methods.
B
ioinformatics, an interdisciplinary field of research, is at the With the aim of providing a reference point to foster research in the
centre of modern molecular biology, where computational increasingly popular field of ensemble deep learning and its applica-
methods are developed and utilized to transform biological tion to various challenges in bioinformatics, in this Review Article
data into knowledge and translate them for biomedical applications. we revisit the foundation of ensemble and deep learning, and sum-
Among the various computational methods utilized in bioinfor- marize and categorize the latest developments in ensemble deep
matics research, machine learning, a branch of artificial intelligence learning. This is followed by a survey of ensemble deep learning
characterized by data-driven model building, has been the key applications in bioinformatics. We then discuss the remaining chal-
enabling computational technology1. At the forefront of machine lenges and opportunities that we hope will inspire future research
learning, ensemble learning and deep learning have independently and development across multiple disciplines.
made a substantial impact on the field of bioinformatics through
their widespread applications, from basic nucleotide and protein Basics of ensemble and deep learning
sequence analysis to systems biology2,3. Ensemble learning refers to a class of strategies where instead of
Until recently, ensemble and deep learning models have largely building a single model, multiple ‘base’ models are combined to per-
been treated as independent methodologies in bioinformatics form tasks such as supervised and unsupervised learning7. Classic
applications. The fast-growing synergy between these two popular ensemble methods for supervised learning fall into three categories:
techniques, however, has attracted a new wave of development and bagging-, boosting- and stacking-based methods. In bagging8, indi-
application of next-generation machine learning methods referred vidual base models are trained on subsets of data sampled randomly
to as ensemble deep learning (Fig. 1a). The root of ensemble deep with replacement (Fig. 1b). In boosting9, models are trained sequen-
learning can be traced back two decades, when ensembles of neu- tially (Fig. 1c), where subsequent models focus on previous misclas-
ral networks were found to reduce generalization error4. However, sified samples. In stacking, a meta-learner is trained to optimally
the recent resurgence of ensemble deep learning models has combine the predictions made by base models10. Like supervised
brought about new ideas, algorithms, frameworks and architec- ensemble learning, conventional unsupervised ensemble learn-
tures that substantially enrich the old paradigm. Through its novel ing, such as ensemble clustering11, also relies on the generation and
application to a wide range of biological and biomedical research, integration of base models (Fig. 1d). While their variants, includ-
ensemble deep learning is unleashing its power in dealing with ing more advanced methods reviewed in the next section, have also
key challenges, including small sample size, high-dimensionality, been used in ensemble learning, a guiding principle in designing
imbalanced class distribution, and noisy and heterogeneous data ensemble methods has been ‘many heads are better than one’12.
generated from diverse cellular and biological systems using an Deep learning, a branch of machine learning, is rooted in artifi-
array of high-throughput omics technologies. These computa- cial neural networks13. The most fundamental architecture of deep
tional, methodological and technological undertakings and break- learning models is the densely connected neural network (DNN),
throughs together are leading a phenomenal transformation of consisting of a series of layers of neurons; each of these is connected
bioinformatics. to all neurons in the previous layer14. More sophisticated models
Both ensemble learning and deep learning methods have been expand on the basic architectures. In convolutional neural networks
extensively studied and reviewed in the context of bioinformatics (CNNs)15, each layer comprises a series of filters that ‘slide over’ the
applications5,6. However, the emergence of ensemble deep learn- output of the previous layer to extract local features across different
ing and its application in bioinformatics has yet to be documented. parts of the input. In recurrent neural networks (RNNs)16, circuits
1
School of Mathematics and Statistics, University of Sydney, Sydney, New South Wales, Australia. 2Charles Perkins Centre, University of Sydney,
Sydney, New South Wales, Australia. 3School of Environmental and Life Sciences, University of Sydney, Sydney, New South Wales, Australia.
4
Computational Systems Biology Group, Children’s Medical Research Institute, University of Sydney, Westmead, New South Wales, Australia.
✉e-mail: [email protected]
ni
el
ng
ar
En
Ensemble
li
ni
se
l
ia
ar
ne
deep learning in
behind promoting negative correlation among base models is to
ic
m
le
hi
ti f
bl
bioinformatics
p
ac
Ar
e
ee
le
M
ar
ni
better generalizability of the ensemble. An alternative approach to
ng
increase base model diversity is through multiple choice learning
b Data c d Data in which each network is ‘specialized’ on a particular subset of data
perturbation perturbation
during the training step26.
X X X An issue associated with training and storing multiple models is
the computational and storage demand involved. To address this,
...
... Base
... methods that perform knowledge distillation have become increas-
X1 X2 XT P1 P2 PT X1 X2 XT
...
classifiers
...
ingly popular27. One such implementation is based on the concept
Passing classification
P1 P2 PT
residues
C1 C2 CT of a teacher–student network framework, where the teacher net-
works are selected from a pool of pre-trained networks and the stu-
Base Base
classifiers clusters dent network distils knowledge of multiple teachers into a single
V WV V
and often simpler network28,29. The testing phase is storage and com-
Voting Consensus
function
Weighted
function
putationally efficient, as the samples only need to pass through a
voting
single student network.
Fig. 1 | The focus of this Review Article and classic ensemble methods. Ensemble within a single model. Ensemble strategies described in
a, Relationships of artificial intelligence, machine learning, deep learning, the previous section require training of multiple models. Deep
ensemble learning and bioinformatics. The red square denotes the focal learning models are often computationally costly to train and
point of this Review Article. b–d, Classic ensemble learning frameworks may take days or even weeks depending on the scale of the dataset
including bagging and its variants (b), boosting and its variants (c), and and model. Effort has been made to develop ‘implicit ensembles’
ensemble clustering based on data perturbation (d). X, input data. where a single neural network could achieve an effect similar to
integrating multiple network models. To this end, a group of tech-
niques focuses on random deactivation of neurons and layers dur-
are created to feed the output of a layer back into the same layer ing the training process of a single model. This leads to an implicit
along with new input, allowing the model to act on dependencies ensemble of networks with different architectures (Fig. 2b). For
between upstream and downstream values in a sequence. Variants example, the random deactivation of neurons, termed dropout,
of RNNs have been proposed to enable more effective learning in originally proposed as a regularization strategy30 for addressing
long-term dependency tasks, with the two most common ones model overfitting is now widely known as an implicit ensemble
being long short-term memory (LSTM)17 and gated recurrent unit strategy31,32. This has inspired follow-up works on random deacti-
(GRU)18. In residual neural networks (ResNet)19, shortcuts between vation of building blocks, termed ResBlocks, in ResNets33 and the
upstream and downstream layers are introduced to improve the combined random deactivation of neurons and layers34. Besides
effectiveness of backpropagation in networks with many hidden random deactivation-based methods, alternative strategies have
layers. In autoencoders20, networks are constructed with an encoder also been explored. One popular approach is the snapshot ensemble
and a decoder that together learn a more efficient latent space rep- technique, where the key idea is to save multiple versions of a sin-
resentation of the original higher-dimensional data. Although the gle model during the training process for forming an ensemble35.
difference between traditional neural networks and deep learning In a snapshot ensemble, a cyclic learning rate scheduler is utilized,
may seem elusive, the latter is increasingly defined by its unique where the learning rate is abruptly changed every few epochs to per-
architectures and ability to learn complex data representations that turb the network and thus may lead to diversity in the snapshots of
are beyond the capacity of classic models21. the model.
Ensemble deep learning Ensemble with model branching. Single-model ensemble approaches
Deep learning is well known for its power to approximate almost greatly reduce training cost compared to ensembles of multiple
any function and increasingly demonstrates predictive accuracy models. However, such a reduction in computational demand
that surpasses human experts. However, deep learning models are comes potentially at a cost in base model diversity. Since the infor-
not without shortcomings: they often exhibit high variance and may mation captured by the lower layers of neural networks is likely to
fall into local loss minima during training. Indeed, empirical results be similar across models, a group of techniques has emerged with a
of ensemble methods that combine the output of multiple deep focus on sharing lower layers followed by ‘branching’ of additional
learning models have been shown to achieve better generalizability layers36. These model branching approaches introduce diversity
than a single model22. In addition to simple ensemble approaches while also enjoying the reduction of time and computation of train-
such as averaging output from individual models, combining het- ing multiple models (Fig. 2c). Besides reducing computational cost,
erogeneous models enables multifaceted abstraction of data, and model branching has also been adapted to address other challenges
may lead to better learning outcomes23. In this section, we catego- in training an ensemble. For example, the gradient can be propa-
rize and summarize the most representative ensemble deep learning gated over a shorter path in a branching network, mitigating the
strategies for both supervised and unsupervised tasks. vanishing gradient problem37. In the knowledge distillation frame-
...
...
...
...
X ...
Voting X X
...
...
... function
T
X ...
d Data e f
perturbation
X1 Random deactivation
Model of neurons
perturbation
...
...
...
...
...
...
...
X X
XT X
C1 C1 C
Base Base
F F
...
...
clusters clusters
CT Consensus CT Consensus
function function
Fig. 2 | Typical ensemble deep learning frameworks in supervised and unsupervised learning. a, Ensemble across multiple models. Each neural network
is trained separately on the dataset, usually perturbed to allow the network to learn from diverse training samples. b, Ensemble within a single model.
Common strategies for creating intrinsic variants of the network include randomly deactivating and bypassing layers (indicated by the curved arrow)
and randomly deactivating neurons (indicated by the close-up). c, Ensemble by model branching. Common strategies include sharing lower layers and
branching out to learn different higher-level features with or without weight sharing. d, Unsupervised ensemble by data perturbation. Each autoencoder
is trained with a perturbed dataset such as bootstrapping. The latent representations are extracted for clustering and combined through a consensus
function. e, Model perturbation-based unsupervised ensemble. Multiple autoencoders each with a different model architecture can be used to learn
diverse representation of the original data. f, Unsupervised ensemble within a single model. Similar to the supervised case, random deactivation of neurons
can be used to create intrinsic variants of the network.
work, each branch acts as a student model, ensembled to form a multi-view clustering when such data are available. Representative
teacher model on the fly to reduce the computationally intensive examples include multi-view representation learning using deep
process of pre-training the teacher model38. The key commonality canonically correlated autoencoders41 and multi-view spectral clus-
between these model branching network ensembles is that by shar- tering where multiple embedding networks were used to represent
ing information, the base networks avoid parameter search from the original data from different feature sets42.
scratch and can converge faster.
Ensemble within a single model. The power of autoencoders in data
Unsupervised ensemble deep learning. In this section, we summa- dimension reduction has motivated research around creating bet-
rize the key ensemble deep learning frameworks for unsupervised ter data representations that are robust to noise in the input data.
tasks. For example, a denoising autoencoder architecture was introduced
in ref. 43, where values of a random subset of neurons are masked
Ensemble across multiple models. Most unsupervised ensemble (that is, changed to zero) during each training epoch, forcing the
deep learning methods employ autoencoders, a popular unsuper- network to overcome noise introduced to the data. The concept of
vised network architecture. Similar to the supervised approach, randomly masking neurons in denoising autoencoders is analogous
unsupervised ensemble methods can be categorized into those that to the dropout method used in the supervised approach, and hence
generate and combine multiple models through data and model can be considered as an implicit ensemble within a single model,
perturbation, and those that achieve implicit ensemble within a or ‘pseudo-ensemble’44, for unsupervised deep learning (Fig. 2f).
single model. In this line of research, a recent study exploits the flexibility of the
For methods based on data perturbation, strategies akin to bag- dropout algorithm and embeds it in a more advanced variational
ging in supervised learning are widely used (Fig. 2d). For example, autoencoder architecture45. The proposed algorithm employs a
Geddes et al. used random feature projection of the input data to novel strategy to learn the dropout parameter, thus alleviating the
train a set of autoencoders to create a cluster ensemble39. Training a need for manual tuning. Another extension in this direction is the
series of unsupervised networks with different hyper-parameters is ‘stacked’ denoising autoencoder that uses multiple layers of denois-
a common ensemble strategy for methods based on model pertur- ing autoencoders for improving data representation46. The data
bation (Fig. 2e). An example extending this approach uses differ- representation learned from such stacked denoising autoencoders
ent activation functions and a weighting scheme to improve model led to substantially improved classification accuracy compared with
accuracy40. An alternative to data and model perturbation is to use using raw input data.
Theoretical advances for ensemble deep learning. While early component for handling small sample size59. Beyond combining
works on the bias–variance trade-off framework have laid the different network architectures, studies have also integrated differ-
theoretical foundation for neural network ensembles47, recent ent genomic data modalities to capture distinct and complementary
research on ensemble deep learning mostly relies on empirical information. In one study, DNA sequences and their neighbouring
experiments due to the increasingly specialized ensemble method- cytosine–guanine dinucleotide (CpG) states were used as input into
ologies and complex neural network architectures. Nevertheless, two sub-networks of an ensemble to explore their relationship in
efforts have been made to advance the theoretical foundation of predicting DNA methylation states60. This has led to the identifica-
this fast-growing field48. Studies have shown the existence of mul- tion of sequence motifs related to DNA methylation and the effect of
tiple local minima in training neural networks, where some enjoy their mutation on CpG methylation. In another study, an ensemble
better generalizability than others49. This has inspired ensemble network that takes input data either from DNA sequences alone or
techniques such as snapshot methods that take advantage of the with the addition of epigenetic information extracted from chroma-
diversity of multiple local minima35. Theoretical justification for tin immunoprecipitation (ChIP) and deoxyribonuclease (DNase)
dropout as a form of averaging has been discussed in ref. 31, where sequencing were used to predict human immunodeficiency virus
the expectation of the gradient with dropout was shown to be the type 1 (HIV-1) integration sites61. The ensemble network, compris-
gradient of the regularized ensemble error. A recent mathematical ing CNNs with attention layers62, enabled the discovery of DNA
framework provided a new perspective of dropout by relating it to a sequence motifs that are important for HIV-1 integration.
form of data augmentation50.
Gene expression. Gene expression data including microarray,
Bioinformatics applications of ensemble deep learning RNA-sequencing (RNA-seq) and, recently, single-cell RNA-seq
This section categorizes representative works in different areas of (scRNA-seq)63–65, has been studied extensively to better under-
bioinformatics applications (Table 1) and identifies their benefits, stand complex diseases and to identify biomarkers that can guide
such as improving model accuracy, reproducibility, interpretability therapeutic decision making. A recent study on cancer type clas-
and model inference. sification demonstrated how ensemble deep learning can serve as
a potential strategy to address the key challenge of reproducibility
Sequence analysis. Biological sequence analysis represents one of in biomarker research66. The use of a DNN ensemble in this work
the fundamental applications of computational methods in molecu- allowed the derivation of important genes through consensus rank-
lar biology. RNN and its variants (for example, LSTM and GRU) ing across multiple models, resulting in a robust set of biomarkers.
are well-suited to sequential data. For example, an LSTM/CNN Due to the difficulty of obtaining patient samples, especially for rare
multi-model was trained to extract distinct features to predict patho- diseases and cancer types, another common challenge in analysing
genic potential of DNA sequences51. Compared to DNA sequences, gene expression data from cancers and diseases is the small sample
RNA sequences offer an additional layer of information where size. The use of ensemble learning to mitigate this issue is exempli-
instructions encoded in genes are transcribed. While traditional fied by ref. 67, where the authors applied a multi-model approach to
methods rely on various manually curated RNA sequence features, generate initial predictions from RNA-seq gene expression profiles
ensemble deep learning enables automatic learning from raw data. of cancer samples and integrated these predictions using a DNN to
One example is in predicting localization of long non-coding RNAs, produce the final ensemble prediction.
where multiple sub-networks were used to integrate distinct feature In addition to its role in medical research, ensemble deep learn-
sets to maximize model performance52. In another work, a CNN/ ing has been used in a wide range of applications to improve under-
RNN ensemble was used to integrate features and raw sequence data standing of basic biological mechanisms from gene expression data.
to predict different types of translation initiation sites53, overcoming An example is the use of a DNN ensemble to explore the embry-
the generalizability issue of traditional methods that can only pre- onic to fetal transition process, a defining stage where cells lose the
dict a specific type of translational initiation sites. potential for regeneration68. A benefit of training multiple networks
Following transcription, messenger RNAs (mRNAs) are further is that the prediction scores from each network can be further used
translated into proteins that carry out various functions. Similar to to generate an integrative score to determine the transition state of a
RNA sequence analysis, methods relying on ensembles of multiple sample between the embryonic and adult state, a strategy that is not
sub-networks were used to integrate information from multiple possible with a single model. The utility of unsupervised ensemble
features sets to predict DNA binding sites54 and post-translational deep learning has also been demonstrated on the extraction of bio-
modification (PTM) sites55 on protein sequences. The study on logical pathway signatures69. By integrating signatures across 100
PTM site prediction has further demonstrated that features learned autoencoders through consensus clustering, the ensemble model
by ensemble models are ‘transferable’ for predicting different types detected more biological pathways with higher significance than a
of PTMs, a key property for tackling the issue of small sample size single model. Unsupervised deep learning ensembles have also been
in training data. applied to cell type identification in single-cell research. In ref. 39,
an ensemble of autoencoders was used to generate a diverse set of
Genome analysis. While sequence analysis has led to many bio- latent representations of scRNA-seq data for subsequent analysis.
logical discoveries, alone it cannot capture the full repertoire of
information encoded in the genome. Additional layers of genetic Structural bioinformatics. Proteins are the key products of genes,
information including structural variants56 (for example, copy and their functions and mechanisms are largely governed by pro-
number variations (CNVs)) and epigenetic modifications57 of the tein structures encoded in amino acid sequences. Therefore, mod-
genome bring important insight to the understanding of biological elling and characterizing proteins from their primary amino acid
systems, populations and complex diseases. sequences to secondary and tertiary structures is essential for under-
A number of ensemble deep learning methods have been devel- standing and predicting their functions70. RNN and its architectural
oped on this front, such as classifying cancer types using CNV variants are specifically designed to capture long- and short-range
data and a snapshot ensemble model comprising CNNs, LSTMs interactions between sequences, and are hence well-suited to
and convolutional autoencoders58. The use of supervised CNN and decoding the relationship between amino acid sequences and the
LSTM models allows both global and local sequential features to protein structures they encode. Extending on the use of a single
be captured, and further integration with unsupervised convolu- RNN, the ensemble of variants of RNNs with CNNs is a common
tional autoencoders enables unsupervised pre-training, an effective hybrid architecture in recent applications that seeks to combine the
+ ResNet
Others Ref. 52 Ref. 58 Ref. 95
Within single CNN + RNN Ref. 58
model
Model branching CNN Refs. 96,97
CNN + Ref. 94
ResNet
Unsupervised Multiple models Autoencoder Refs. 39,69 Ref. 91,92
Others Ref. 89
Within single Autoencoder Ref. 58
model
power of RNN in analysing sequential data and CNN on extracting from data-independent acquisition (DIA) MS data80. While conven-
local features71,72. The replacement of CNN with ResNet73 as well as tional MS runs select only a few important peptides based on their
the addition of residual connections between GRU and CNN74 were signal levels (that is, data-dependent acquisition) for subsequent
also explored to facilitate feature propagation for improved mod- quantification, the DIA approach fragments every single peptide for
elling of long-range dependencies between amino acids. In these improved proteome coverage. However, the DIA approach may lead
works, ensemble deep learning not only improved generalizability to an increase in co-eluted peptides and therefore higher interfer-
on independent datasets but also led to the discovery of novel fea- ence in the data. The ensemble framework was able to quantify the
tures associated with protein structures. amount of interference between multiple peptides mapped to the
Besides predicting protein structures, many studies have focused same point, thereby removing interference and improving peptide
on directly predicting protein functions. An example of ensemble identification confidence and quantification accuracy.
deep learning application in this domain is illustrated by the work
of Zacharaki75, who used an ensemble of CNNs for protein enzy- Systems biology. Systems biology aims to map interactions of mole-
matic function prediction. Specifically, the ensemble is a fusion cule species, regulatory relationships and mechanisms to understand
of two CNNs trained separately on protein properties and amino complex biological systems as a whole81. One key aspect of systems
acid features for extracting complementary information. In another biology is the understanding of what and how biological molecules
example, Singh et al.76 built an ensemble deep learning model to interact. In recent times, ensemble deep learning has been applied
identify residue conformation crucial to protein folding and func- on this front to predict interactions among different biological mol-
tion. While the dataset used for model training has an extreme class ecules and entities. The application of an interpretable ensemble of
imbalance (1.4:1,000), the ensemble model, consisting of ResNet CNN models for predicting binding affinity between peptides and
and LSTM modules, yielded robust performance on independent major histocompatibility complex is an example of ensemble deep
test sets without manual generation of a balanced dataset. learning in this domain82 and has important implication in clinics.
The model demonstrated good generalizability across 30 indepen-
Proteomics. While protein structure and function prediction are dent datasets and uncovered binding motifs with literature support.
essential tasks for characterizing individual proteins, technological In predicting protein–protein interactions, an ensemble of DNNs
advances in quantitative mass spectrometry (MS) have now enabled trained on S. cerevisiae achieved more accurate results than other
global profiling of the entire proteome in cells, tissues and species77. machine learning methods83. Subsequently, the model was applied
Computational analysis of such large volume datasets is transform- to other datasets generated from different organisms and the rela-
ing our understanding of proteome dynamics in complex systems tive accuracy on each dataset was shown to be a good indicator of
and diseases78. the evolutionary relationships of those organisms.
Ensemble deep learning has been used as a key technique for Systems biology also extends to the interaction between biologi-
addressing various aspects of proteomics data analysis. The work of cal molecules and chemical compounds. In particular, the study of
Zohora et al.79 exemplifies the application of ensemble deep learn- protein and chemical compound interaction in drug development
ing to peptide identification from a liquid chromatography-MS has seen a growing number of ensemble deep learning applica-
(LC-MS) map, a critical step for identifying and quantifying protein tions. For example, Karimi et al. proposed an ensemble model that
abundance. Specifically, a hybrid network architecture comprising comprised various network modules for compound–protein affin-
both CNN and RNN modules was designed to detect sequential ity prediction84. To overcome the limited availability of labelled
features along the axes during the scan of an MS map. The final datasets, the model exploited abundant unlabelled compound and
model, an ensemble of multiple networks with different parameters, protein data through unsupervised pre-training. This was followed
was shown to achieve state-of-the-art results for protein quantifica- by interaction prediction on labelled data using CNN and RNN
tion. Another study proposed an ensemble of DNNs for learning modules in the ensemble. In another work on predicting drug and
target protein interactions, a CNN-based ensemble model was used in addressing various other challenges in bioimage analysis. For
to score the likelihood of interaction of randomly selected drug– example, an ensemble network with knowledge distillation and a
protein pairs85. The trained model revealed that drugs with similar branching strategy was used to reduce the number of parameters in
structures bind to similar target proteins, suggesting potential simi- the model and therefore lower the likelihood of overfitting on small
larity in the effects of these drugs. datasets97. To deal with the problem of class imbalance, Yuan et al.98
introduced an iterative regularization approach that, for a given
Multi-omics. Multi-omics analysis is a topic closely related to sys- iteration, penalizes misclassification of samples that were correctly
tems biology, where integrative methods are used to understand classified in previous iterations. This method alleviated the problem
biological regulation by combining an array of omics data. There of bias in favour of majority classes and preserved correctly classi-
is a growing interest in multi-omic studies as it is increasingly rec- fied minority examples.
ognized that a single type of omics data does not capture the entire
landscape of the complex biological networks86. Challenges and opportunities
Many conventional machine learning methods have been pro- The applications we have reviewed here reveal various challenges
posed to utilize the complementary information present across mul- and opportunities surrounding ensemble deep learning in bioinfor-
tiple modalities of omics data87,88. Most conventional approaches, matics research. In the following sections, we highlight several key
however, do not account for the relationships among different directions in which ensemble deep learning is likely to have increas-
omics layers. To this end, Liang et al. proposed to use an ensemble of ingly important impacts.
deep belief networks to encode gene expression, miRNA expression
and DNA methylation data into multiple layers of hidden variables Small sample size. Deep learning is known for its exceptional
for integrative clustering89, thereby actively exploring regulation performance on data with large sample size. While modern omics
across different omics layers. Ensembles of different deep learn- technologies have enabled the profiling of tens of thousands of
ing architectures have also been utilized to take advantage of the molecular species and biological events in a single experiment, the
unique characteristics of different data types. Using an ensemble of number of samples available is usually small owing to the cost in
CNNs and LSTMs, both genomic sequences and their secondary time and labour. Hence, bioinformatics applications are often con-
structures can now be integrated for alternative polyadenylation site fronted with the issue of limited sample size, causing unstable pre-
prediction on pre-mRNAs90. This addressed the gap where existing dictions and thus low reproducibility in results.
models overlooked RNA secondary structures, despite these being Fortunately, one essential property of ensemble methods is stabil-
important features to the polyadenylation process. Another applica- ity. Leveraging this key property, a number of ensemble deep learn-
tion in multi-omics was the use of a novel ensemble of autoencoders ing methods were proposed to specifically address small sample
wherein a coupling cost was used to encourage the base autoencod- size challenges, opening up the opportunity to utilize deep learn-
ers to learn from each other91. This unsupervised model allowed the ing in bioinformatics. While the most popular approach so far has
integration of two vastly different data types—single-cell transcrip- been using pre-trained models, more specialized methods have also
tomics and electrophysiological profiles—and to identify common been explored. Examples include extracting intermediate features
and unique cell types across datasets. learned by the network to generate additional output for integra-
High dimensionality and heterogeneity are both issues associ- tion and thus stabilizing the ensemble prediction99; and encourag-
ated with the large number of molecular features in multi-omics ing cooperation among individual models through a pairwise loss,
datasets. The application of autoencoders is popular in dealing with thereby reducing the variance caused by small sample size100. These
these challenges. In one instance, an ensemble of autoencoders was methods represent promising strategies that can be explored in
used to extract lower dimension and integrate over 450,000 features future lines of research.
in pan-cancer classification92. Stacking multiple deep learning mod-
els, each handling a different modality of omics data93, is another High-dimensionality and class imbalance. Omics data are
approach that avoids feature concatenation that might otherwise well-known for their high-dimensionality, as biological features (for
exacerbate the issue of high dimensionality in datasets potentially example, genes, proteins) frequently outnumber samples. This is
containing tens of thousands of features. further exacerbated by the issue of small sample size already men-
tioned. The problem, widely known as the ‘curse of dimensionality’,
Bioimage informatics. Traditionally, analysis of bioimages is often has been identified as one of the main causes of overfitting in deep
performed manually by field experts. With the growing number learning models due to the large number of parameters that needs
of computer vision applications demonstrating their superior per- to be fitted101. While deep learning models seem to be particularly
formance over human experts, automatic analysis has become an susceptible to the high-dimensionality of omics data, the com-
increasing focus in bioinformatics studies. A primary application bination of deep learning with ensemble methods such as model
of ensemble deep learning in bioimage informatics is the detec- averaging39 and the implicit ensemble through dropout30 has been
tion of diseases such as cancers in patient images. For instance, to demonstrated to be an effective approach for handling this issue.
improve classification of glioma from magnetic resonance images, Imbalanced class distribution is another common issue in many
Lu et al. embedded a branching module into ResNet for integrating bioinformatics applications102 where, for example, a biological
multi-scale information obtained from different receptive fields of event of interest is only present in a small proportion of the data.
the original ResNet94. Codella et al. proposed an ensemble model Ensemble deep learning is found to be an effective remedy for deal-
that combined network architectures, including ResNet, CNN and ing with this challenge. Bioinformatics applications include the use
U-Net, to segment and classify skin lesions from dermoscopic of bootstrap-sampling- and random-sampling-based ensemble
images95. It is noteworthy that the proposed model achieved a deep learning for dealing with class imbalance in DNA and protein
segmentation result with 95% accuracy, surpassing that of human sequence analyses53,54. Due to the increasing use of high-throughput
experts who exhibit an accuracy of around 91%. To segment cervi- technologies, ensemble deep learning strategies that are capable of
cal cell images, Song et al. performed multi-resolution extraction dealing with these challenges will remain an active research direc-
and colour space transformation of the images to generate diverse tion in bioinformatics.
feature sets, leading to enhanced segmentation accuracy96.
Besides improving classification and segmentation accu- Data noise and heterogeneity. Biological systems are inherently
racy, ensemble deep learning methods have also been explored heterogeneous and noisy. This is further confounded by technical