HMSC: a Hybrid Metagenomic Sequence Classification
Algorithm
Subrata Saha
Zigeng Wang
Sanguthevar Rajasekaran∗
Healthcare and Life Sciences Division
IBM Research
Yorktown Heights, New York, U.S.A
[email protected]
Dept. of Computer Science & Engr.
University of Connecticut
Storrs, Connecticut, U.S.A
[email protected]
Dept. of Computer Science & Engr.
University of Connecticut
Storrs, Connecticut, U.S.A
[email protected]
ABSTRACT
1
Widespread availability of next-generation sequencing (NGS) technologies has prompted a recent surge in interest in the microbiome.
As a consequence, metagenomics is a fast growing field in bioinformatics and computational biology. An important problem in analyzing metagenomic sequenced data is to identify the microbes present
in the sample and figure out their relative abundances. Genome
databases such as RefSeq and GenBank provide a growing resource
to characterize metagenomic sequenced datasets. However, both the
size of these databases and the high degree of sequence homology
that can exist between related genomes mean that accurate analysis
of metagenomic reads is computationally challenging. In this article
we propose a highly efficient algorithm dubbed as łHybrid Metagenomic Sequence Classifier" (HMSC) to accurately detect microbes
and their relative abundances in a metagenomic sample. The algorithmic approach is fundamentally different from other state-of-theart algorithms currently existing in this domain. HMSC judiciously
exploits both alignment-free and alignment-based approaches to
accurately characterize metagenomic sequenced data. Rigorous experimental evaluations on both real and synthetic datasets show
that HMSC is indeed an effective, scalable, and efficient algorithm
compared to the other state-of-the-art methods in terms of accuracy,
memory, and runtime.
Although we are normally unable to see microbes, they run the
world. Microbes are indispensable for every part of our human life
- truly speaking - all life on Earth! They influence our daily life in
a myriad ways. For an example, the microbes living in the human
gut and mouth enable us to extract energy from food. We could
not be able to digest food without them. Microbes also insulate us
against disease-causing agents. Metagenomics is a strong tool that
can be used to decipher microbial communities by directly sampling genetic materials from their natural habitats. It can be directly
applicable a wide variety of domains to solve practical challenges
such as biofuels, food safety, agriculture, medications, etc. Metagenomic sequencing shows promise as a sensitive and rapid method to
diagnose infectious diseases by comparing genetic material found
in a patient’s sample to a database of thousands of bacteria, viruses,
and other pathogens. There is also strong evidence that microbes
may contribute to many nonśinfectious chronic diseases such as
some forms of cancer, coronary heart disease, and diabetes.
Classifying metagenomic sequences in the metagenomic sample
is a very challenging task due to these facts: (1) researchers have
sequenced complete genomes of thousands of microbes. The total
size of the genomes is hundreds of gigabytes. Again, metagenomic
sample can contain millions to billions of biological sequencing
reads and their total size can range from gigabytes to terabytes. To
detect the microbes, we must align the metagenomic reads onto
the known reference genomes. Processing is difficult to accomplish
as we have time and memory bound; and (2) different microbes
can contain similar genomic regions in their genomes. Reads in the
given metagenomic sample may refer to different microbes that are
not present in the sample. Consequently, there might be a lot of false
identification if we just align the reads onto the references. There
are several subsequence-based approaches in this domain. These
methods suffer from low classification accuracy, high execution
time, and high memory usage. The users also need to post-process
the output. Thus, we need efficient and effective computational
technique to quickly and accurately identify microbes present in
metagenomic sample. To address the issues stated above we have
developed a highly efficient method to correctly identify and estimate the abundances of microbes in the metagenomic sample.
KEYWORDS
Hybrid Metagenomic Sequence Classifier (HMSC); Metagenomics;
CLARK; Kraken
ACM Reference Format:
Subrata Saha, Zigeng Wang, and Sanguthevar Rajasekaran. 2020. HMSC:
a Hybrid Metagenomic Sequence Classification Algorithm. In Proceedings
of the 11th ACM International Conference on Bioinformatics, Computational
Biology and Health Informatics (BCB ’20), September 21–24, 2020, Virtual
Event, USA. ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/
3388440.3412468
∗ Sanguthevar
Rajasekaran is the corresponding author.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from
[email protected].
BCB’20, September 21–24, 2020, Virtual Event, USA
© 2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-7964-9/20/09. . . $15.00
https://doi.org/10.1145/3388440.3412468
2
INTRODUCTION
RELATED WORKS
The traditional approach to solve the metagenomic sequence classification is to align each input read onto a large collection of
reference genomes using alignment software, such as BLAST [3]
or MegaBlast [15]. However, aligning the reads onto the reference
BCB’20, September 21–24, 2020, Virtual Event, USA
sequences requires a huge computing time. One way of salvaging
execution time could be to align the reads onto a select marker
genes in the reference genome instead of the whole genome. This
approach is followed in Metaphan [23], Metaphyler [11], Motu [21]
and Megan [8]. However, this approach also becomes infeasible
when there are more and more of reference genomes and the total
number of reads grows more and more.
Researchers have tried several alignment free methods. The most
popular among such methods is based on 𝑘-mer spectrum analysis
which uses a database of distinct subsequences of length 𝑘, or in
short 𝑘-mers, from the reference sequence to classify the reads. If a
read has distinct 𝑘-mers from multiple reference genomes then it is
assumed to be from their lowest common taxonomic ancestor (see,
e.g. [24]). The algorithms using this broad approach differ in the
way the database is built and queried. LMAT [1] is one of the first
such algorithms. Subsequent improvements are: Kraken [24] which
improves the speed and memory usage by employing a classification
tree. CLARK [18] improves memory usage further by storing only
a reduced set of target specific 𝑘-mers. CLARK-S [17] improves
the specificity of CLARK by sacrificing a little speed and memory.
Metacache [16] improves the memory usage further by a novel
application of minhashing technique on a subset of the context
aware 𝑘-mers to reduce memory usage. MetaOthello [12] uses a
probabilistic hashing classifier and improves the memory usage
and k-mer query efficiency with a novel I -Othello data structure.
LiveKraken [22] classifies reads in realtime from raw data streams
from Illumina sequencers. KrakenUniq [2] efficiently assesses the
coverage of unique k-mers and gives better recall and precision.
Kaiju [14] uses the same 𝑘-mer based approach, however, exploits
the fact that microbial and viral genomes are typically densely
packed with protein-coding genes which are more conserved and
more tolerant to sequencing errors because of the degeneracy of
the genetic code. Some applications require an estimate of the
relative abundance of constituent species. Using a probabilistic
method based on Byesian likelihood, Braken [13] augments the
output of Kraken with estimated abundance. There have been few
alignment free methods other than the 𝑘-mer spectrum analysis
as well. MetaKallisto [20] uses pseudo alignments. WGSQuikr [10]
and Metapallette [9] use a compressed sensing approach where the
abundance is estimated by solving a linear system of equations on
the 𝑘-mer spectrum profile of input reads and that of the reference
sequences. TaxMap [4] uses a compression algorithm to store the
LCA information and achieves better precision and sensitivity.
Recently, deep learning models [6], such as convolutional neural
network (CNN) and deep belief network (DBN), have been tested
on a small subset of bacteria taxonomic classification, but it is still
challenging for the classification for the entire bacteria domain.
3 METHODS
3.1 Our Algorithms
There are 3 major steps in our proposed algorithm HMSC. The first
step involves collecting a set of unique 𝑘-mers from each of the
genomic sequences of interest. This step is different from the 𝑘-mer
counting problem. In 𝑘-mer counting we compile all the distinct
𝑘-mers present in the input sequences together with the frequency
of each 𝑘-mer. A 𝑘-mer can be found in more than one genomic
Subrata Saha, Zigeng Wang, and Sanguthevar Rajasekaran
sequence. Conversely, each unique 𝑘-mer can be found in one and
only one genome sequence. The second step involves finding a
set of discriminating regions from each genome. We then build a
model sequence for each genome by adopting the discriminating
regions. Instead of using the full genome sequences, these pre-built
model sequences are then employed to accurately profile all the
8 taxonomic ranks. We also estimate approximate abundances of
all the 8 taxonomic ranks residing in a given metagenomic sample.
Experimental evaluations show that HMSC is highly efficient in
terms of accuracy, execution time, and memory footprint. Next, we
describe the algorithmic steps of HMSC in detail.
3.1.1 Unique k-mer mining. At the beginning, we identify a
set of unique 𝑘-mers from the given set of target genomes 𝐺 =
{𝑔1, 𝑔2, 𝑔3, . . . , 𝑔𝑛 }. A 𝑘-mer is said to be unique if it (and its reverse
complement) occurs in only one of the genomes 𝑔𝑖 ∈ 𝐺 where
1 ≤ 𝑖 ≤ 𝑛. However, if we want to search for unique 𝑘-mers from
the set of all such 𝑘-mers in one single pass, it will be a very memory intensive and time consuming procedure. For instance, In our
proposed algorithm we employ around 6𝑘 bacterial genomes and
each genome contains nearly 4.5𝑀 nucleotides on an average! To
reduce the memory footprint we partition 𝐺 into 𝑃 non-overlapping
parts. In each such part 𝑝 𝑗 (where 1 ≤ 𝑗 ≤ 𝑃) we search for 𝑘-mers
that are unique across all the genomes. To reduce the search space
we randomly select a subset of all the 𝑘-mers present in sequences.
For each partition 𝑝 𝑗 ∈ 𝑃 we maintain two data structures to
find the unique 𝑘-mers in each genome. Hash table 𝐻 is used to
record unique 𝑘-mers. Each key of the hash table 𝐻 represents a 𝑘mer and its corresponding value refers to the associated genome id
and start index. Hash set 𝑆 contains the non-unique 𝑘-mers found
in more than one genome. In both data structures the expected
time complexity to search, update, or delete operations is 𝑂 (1).
After collecting locally unique 𝑘-mers by utilizing 𝐻 and 𝑆 for each
genome in one part, we collect the unique set of 𝑘-mers from the
reduced set of locally unique 𝑘-mers found in the previous step
by checking if it occurs in the other genomes. We save the unique
𝑘-mers picked from each part with the associated genome ids and
start indices in the disk. Next we describe the method in detail.
For each genome 𝑔 in a part 𝑝 𝑗 , 1 ≤ 𝑗 ≤ 𝑃, we process it as
follows. We pick a random subset of 𝑘-mers in 𝑔. Let 𝑥 ′ and 𝑥 ′′ be
any such 𝑘-mer and its reverse complement, respectively. Suppose
𝑥 is the lexicographically smallest of 𝑥 ′ and 𝑥 ′′ . We first check
if 𝑥 is already declared as non-unique by searching for it in the
hash set 𝑆. If 𝑆 already contains 𝑥, it is non-unique. Otherwise, we
search for 𝑥 in the hash table 𝐻 . There are two possibilities: there
is an entry for 𝑥 in 𝐻 or there is no entry for 𝑥 in 𝐻 . If there is
an entry for 𝑥 in 𝐻 , there are two possibilities: (1) The 𝑘-mer in
the entry corresponds to another genome or (2) It corresponds to
the same genome (as that of 𝑥). In the former case we declare 𝑥 as
non-unique by recording 𝑥 in 𝑆 and delete the entry from 𝐻 . If the
later is the case, we do not do anything. If we do not find any entry
for 𝑥 in 𝐻 , we create an entry for 𝑥 in 𝐻 associated with its genome
id and position. At the end of processing all the randomly picked
𝑘-mers in all the genomes in 𝑝 𝑗 , in the above manner, 𝐻 has all the
locally unique 𝑘-mers in 𝑝 𝑗 . Please, note that these locally unique
𝑘-mers are only unique with respect to the randomly picked 𝑘-mers
from the genomes in 𝑝 𝑗 . From out of these locally unique 𝑘-mers,
HMSC: a Hybrid Metagenomic Sequence Classification Algorithm
BCB’20, September 21–24, 2020, Virtual Event, USA
we identify globally unique 𝑘-mers. To find the globally unique 𝑘mers, we iteratively retrieve each genome 𝑔 ′ ∈ 𝐺 from the disk. To
check the duplicity efficiently we build a hash set 𝑆 ′ containing the
lexicographically smallest 𝑘-mers of 𝑔 ′ . I.e., for each 𝑘-mer in the
retrieved genome 𝑔 ′ , we insert into 𝑆 ′ the lexicographically smallest
one between it and its reverse complement. Now the locally unique
set of 𝑘-mers are checked for duplicity against 𝑆 ′ . Note that the
genomes are retrieved from the disk into the main memory one at
a time. The locally unique 𝑘-mers will also be checked for duplicity
with respect to the genomes in 𝑝 𝑗 . Once we identify the globally
unique 𝑘-mers in 𝑝 𝑗 , we save them together with their positions
and genome ids in the disk.
As noted earlier, we partition 𝐺 into 𝑃 parts (for some suitable
value of 𝑃) to reduce the memory footprint. For each of the 𝑃 parts
we follow the same procedure as described above. We save the
globally unique 𝑘-mers along with their genome ids and positions
in the disk for each part. In the experiments we set 𝑘 = 31.
If a read 𝑟 is aligned onto multiple model sequences 𝑚𝑖 from
multiple genomes 𝑔𝑖 ∈ 𝐺, then the read 𝑟 is assigned to all of those
genomes 𝑔𝑖 . Since the taxonomic profiling of HMSC is based on
the model sequences of the genomes, not all the reads 𝑟 from the
metagenomic sample 𝑉 will be classified. This is due to the fact that
the model sequences may not contain all the stretches of the original
genomes. We estimate the abundance of a specific taxonomic rank
using the number of hits. Consider a specific genome 𝑔𝑖 . Let the
taxonomic rank of 𝑔𝑖 be 𝑡𝑖 . We estimate 𝑡𝑖 by taking the ratio of the
hits in 𝑚𝑖 with respect to that taxonomic rank to the total number
of hits (across all the model sequences).
The accuracy of our algorithm has been measured using the
ground truth. For instance, if we employ synthetic data, then we
will know what species are represented in the sample 𝑉 and also
the number of reads in 𝑉 corresponding to each of the species.
3.1.2 Model sequence formation. In this step we build a model
sequence for each genome 𝑔 ∈ 𝐺. Let the set of unique 𝑘-mers in 𝑔
(found in the previous step) be 𝑢. At first we sort 𝑢 in increasing
order with respect to the starting coordinates of the 𝑘-mers. Next
we cluster the sorted 𝑘-mers in such a way that in any cluster the
distance between any pair of consecutive 𝑘-mers is ≤ a threshold
𝑡 1 . Let a pair of consecutive 𝑘-mers be 𝑘 ′ and 𝑘 ′′ . Then the distance
between the end position of 𝑘 ′ and the start position of 𝑘 ′′ will
be no more than 𝑡 1 . We linearly search through the sorted list of
𝑘-mers and build a new cluster when the distance between any
pair of consecutive 𝑘-mers is > 𝑡 1 . Let the set of clusters be 𝐶.
Clearly, each cluster 𝑐 ∈ 𝐶 contains a set of consecutive 𝑘-mers
where the distance between any two consecutive 𝑘-mers is ≤ 𝑡 1 .
Consequently, each cluster 𝑐 ∈ 𝐶 contains a discriminating region
of the genome 𝑔. We extract a region from 𝑔 by using the start
index of the first 𝑘-mer and the end index of the last 𝑘-mer in 𝑐. We
call such a region discriminating since if any read 𝑟 having length
≥ 𝑡 1 + 2𝑘 is aligned onto such a region, then 𝑟 will fully contain at
least 1 unique 𝑘-mer. In our experiments we set 𝑡 1 = 40.
We sort all such regions from all the clusters 𝑐 based on decreasing order of their lengths. We discard some regions having length
≤ 𝑡 2 , a user defined threshold. We append a special string of length
4 containing a special character ł#ž at the end of each region. We
concatenate all such regions of a genome 𝑔 to build a model sequence. Because of this special string no aligned read will contain
the junction of any 2 regions given that the mismatch threshold
𝑑 < 4. Let these model sequences are 𝑚 1, 𝑚 2, . . . , 𝑚𝑛 where 𝑚𝑖 is
the model sequence of 𝑔𝑖 , 1 ≤ 𝑖 ≤ 𝑛. In our experiments we set
𝑡 2 = 300.
3.1.3 Taxonomic ranks identification. In this step we are inferring all the 8 taxonomic ranks (e.g., subspecies, species, genus,
family, order, class, phylum, and kingdom) and their corresponding
relative abundances from a given metagenomic sample 𝑉 . Metagenomic sequencing reads contained in 𝑉 are aligned onto each of the
model sequences 𝑚𝑖 built in the previous step. Suppose a read 𝑟 ∈ 𝑉
is aligned onto a model sequence 𝑚𝑖 within a certain mismatch
threshold 𝑑 (we set 𝑑 = 0 in our experiments). We can say that the
read 𝑟 belongs to the genome 𝑔𝑖 with a high level of confidence.
This is referred to as a hit.
4 EXPERIMENT RESULTS
4.1 Mock Microbial Metagenomes
A mock community of 36 bacterial species prevalent in the human
microbiome [19] was created as described by [5]. Sequencing libraries were created using the Illumina TruSeq Nano DNA HT kit
and sequenced on the Illumina HiSeq 2000 platform to generate
2 × 150 bp paired-end reads. Prior to analysis reads were processed
with Trimmomatic to remove sequencing adapters. Following the
same procedure we also generated another mock community having 11 bacterial species.
4.2
In Silico Microbial Metagenomes
We simulated 6 metagenomic paired-end in silico datasets by employing various Illumina platforms. Reads are produced by engaging
a shotgun sequence simulator named InSilicoSeq [7]. At first, we
randomly selected 200 genomes from around 6k łcompletž genomes
from GeneBank. Please, note that all the randomly selected genomes
also appeared in the databases of CLARK and Kraken. We generated
simulated paired-end reads D1-D6 containing randomly chosen 50
and 100 bacterial species from the set of 200 genomes as stated
above by employing 3 popular Illumina error models (e.g. HiSeq,
MiSeq, and NovaSeq) with realistic abundance distribution. The
details of the datasets can be found in Table 1.
Table 1: Dataset Information.
Data
Model
Genomes
Paired-end reads
Read length
D1
D2
D3
D4
D5
D6
D7
D8
HiSeq
HiSeq
MiSeq
MiSeq
NovaSeq
NovaSeq
HiSeq
HiSeq
50
100
50
100
50
100
11
36
3M
5M
3M
5M
3M
5M
4.4M
6.1M
125
125
300
300
150
150
150
150
BCB’20, September 21–24, 2020, Virtual Event, USA
4.3
Performance Metrics
To demonstrate the utility of unique 𝑘-mer based model sequences
for taxonomic profiling we compared the performance of HMSC
with two widely used, 𝑘-mer-based tools, CLARK and Kraken, selected for their high accuracy and low execution time. We computed
the following four performance metrics to demonstrate the efficacy
of our algorithm HMSC:
• Recall: In information retrieval, recall is the fraction of the taxa
level (e.g. genus, species, etc.) in the metagenomic sample that
𝑃 . Here
are successfully detected. It is defined as: 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇 𝑃𝑇+𝐹
𝑁
TP stands for the number of true positives, i.e. the number of taxa
present in the sample and correctly identified by an algorithm.
FN stands for the number of false negatives, i.e. the number of
taxa present in the sample and not identified by an algorithm.
• Precision: It is the fraction of retrieved taxa levels that are resid𝑃 . Here FP
ing in the sample. The definition is: 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇 𝑃𝑇+𝐹
𝑃
is the number of false positives, i.e. the number of taxa identified
by an algorithm that are not in the sample.
• F measure: It is the harmonic mean of precision and recall, the tra𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙
ditional F-measure or balanced F-score: 𝐹 = 2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 .
• Pearson’s Moment Correlation Coefficient
(PMCC): We use PMCC to measure how much estimated relative
abundances of taxonomic ranks are correlated with respect to
ground truths. In statistics, the Pearson moment correlation coefficient (PCC), also referred to as Pearson’s 𝑟 , is a measure of the
linear correlation between two variables 𝑋 and 𝑌 . We can think
of 𝑋 and 𝑌 as 2 vectors each having
𝑁Í entries.
The mathematical
Í
Í
𝑁 𝑋𝑌 −( 𝑋 𝑌 )
formulation is: 𝑟 = √ Í 2 Í 2
Í 2 Í 2
[𝑁
𝑥 −( 𝑥) ] [𝑁
𝑦 −(
𝑦) ]
We know the unique taxonomic id of each microbe residing in
the simulated datasets a priori. From a taxonomic id we can retrieve
all the taxonomic ranks of a microbe by traversing the taxonomy
tree of life. From these ground truths (i.e, taxonomic ids) of all
the datasets we identify each taxonomic rank 𝑡 of all the microbes.
Suppose 𝐴 that belongs to a specific taxonomic rank 𝑡 (such as,
genus). For each algorithm we also identify the same taxonomic
rank 𝑡 predicted, 𝐵. We compute recall and precision as |𝐴 ∩ 𝐵|/|𝐴|
and |𝐴 ∩ 𝐵|/|𝐵|, respectively. For each algorithm we compute the
performance metrics for running each taxonomic profiler on the
mock and in silico communities.
Subrata Saha, Zigeng Wang, and Sanguthevar Rajasekaran
datasets i.e., it was able to identify all the microbes (and their associated taxonomic ranks) prevalent in the samples. Although HMSC
detects microbes that are not in the samples (i.e., false positives),
the numbers are far smaller than CLARK or Kraken. It is evident
from precision and F measure - HMSC’s precisions and F measures
are higher than CLARK and Kraken for all taxonomic ranks in every
datasets. Please, note that we only show 6 taxonomic ranks (e.g.,
subspecies, species, genus, family, order, and class) in Table 2 and
4 taxonomic ranks (e.g., subspecies, species, genus, and family) in
Figure 1 because of space constraints.
Now consider mock datasets (D7-D8). In D7 dataset HMSC’s
recall of species-level taxonomy is better than that of CLARK and
Kraken. In all other cases recall measures are identical for all 3
algorithms. On the contrary precision and F measures are higher
than that of CLARK and Kraken in both of D7 and D8 datasets. It
is evident from Table 3 that both of the algorithms erroneously
identify a lot of microbes that are not residing in the sample.
4.5
Our algorithm HMSC deals with the model sequences of the reference genomes. Each model sequence comprises a set of discriminating regions of a genome. Therefore, all the reads coming from
a specific genome will not be aligned onto the model sequence
designated for that genome. Only the discriminating reads will be
aligned onto a specific model sequence. As model sequences contains discriminating stretches of genome sequences, it estimates approximate relative abundances instead of true relative abundances.
Although in theory the approximation may be over-represented
or under-represented, HMSC mimics, in practice, the true relative
abundances in most of cases. Since we do not know the true relative
abundances of the 2 mock communities (e.g., D7-D8), we could not
be able to compute Pearson’s correlations. Please, see Figure 1[b]
for visual comparisons of different methods employed including
HMSC. It is to be noted that the abundance estimations of HMSC
in MiSeq datasets are poor with respect to CLARK and Kraken. It
might be due to the fact that we are aligning reads onto model
sequences without any mismatches for every datasets to preserve
uniformity. Since the length of MiSeq reads is 300bp long, many of
them might not be aligned onto the model sequences within the
mismatch threshold we used (i.e., 𝑑 = 0).
4.6
4.4
Precision, Recall, and F Measure
The performance of a classification algorithm depends on both precision and recall. An algorithm can have a high recall but a small
precision due to the fact that the algorithm with a low precision
suffers from high false positives. To logically fix the issue the classification performance of an algorithm is measured by taking the
harmonic mean of recall and precision (known as F measure). It is
observed that the existing algorithms for classifying metagenomic
sequences suffer from very high false positives, i.e. they inaccurately identifies a large number of microbes that does not belong
to the metagenomic sample.
At first, consider the in silico datasets. As noted earlier our in
silico datasets consist of 6 metagenomic samples (please, see D1-D6
in Table 1). HMSC possesses perfect recall of 1.0 for every in silico
Relative Abundances
Execution Time and Memory Consumption
HMSC has a very low memory footprint. On average, HMSC uses
3.70× and 1.43× less memory than that of CLARK and Kraken,
respectively. The execution times of HMSC are also comparable
with the state-of-the-art algorithms. In general, HMSC is faster
than CLARK on real datasets D7 and D8. Comparing to Kraken,
HMSC’s running time is in the same order of magnitude. Please,
see Figure 1[c] for visual comparisons.
5
CONCLUSION
In this article we propose HMSC that can accurately detect microbes
and their relative abundances in a metagenomic sample. The algorithm judiciously exploits both alignment-free and alignment-based
approaches and our rigorous experimental evaluations show that it
is indeed an effective, scalable, and efficient algorithm compared to
HMSC: a Hybrid Metagenomic Sequence Classification Algorithm
BCB’20, September 21–24, 2020, Virtual Event, USA
Table 2: Performance Evaluations on in silico Datasets.
HMSC
CLARK
Kraken
Recall
Precision
F-score
PMCC
Recall
Precision
F-score
PMCC
Recall
Precision
F-score
PMCC
D1
Subspecies
Species
Genus
Family
1.0000
1.0000
1.0000
1.0000
0.3636
0.2816
0.4737
0.7647
0.5333
0.4395
0.6429
0.8667
0.9924
0.9860
0.9839
0.9896
NA
0.9796
1.0000
1.0000
NA
0.1627
0.2500
0.3391
NA
0.2791
0.4000
0.5065
NA
0.9998
1.0000
1.0000
1.0000
0.9796
1.0000
1.0000
0.0935
0.0673
0.1402
0.2308
0.1709
0.1260
0.2459
0.3750
0.9884
0.9998
0.9999
1.0000
D2
Subspecies
Species
Genus
Family
1.0000
1.0000
1.0000
1.0000
0.4390
0.4091
0.6385
0.8571
0.6102
0.5806
0.7793
0.9231
0.9892
0.9897
0.9895
0.9903
NA
0.9697
0.9880
1.0000
NA
0.0695
0.1312
0.2661
NA
0.1297
0.2316
0.4204
NA
0.9835
0.9861
0.9979
0.9722
0.9798
0.9880
1.0000
0.0508
0.0470
0.1004
0.2185
0.0966
0.0897
0.1822
0.3587
0.9776
0.9846
0.9870
0.9985
D3
Subspecies
Species
Genus
Family
1.0000
1.0000
1.0000
1.0000
0.2899
0.2450
0.5056
0.8667
0.4494
0.3936
0.6716
0.9286
0.5898
0.6536
0.6697
0.5844
NA
0.9796
1.0000
1.0000
NA
0.0779
0.1393
0.2167
NA
0.1444
0.2446
0.3562
NA
96.11
0.9682
0.9995
1.0000
0.9796
1.0000
1.0000
0.1047
0.1064
0.2206
0.3362
0.1896
0.1920
0.3614
0.5032
0.8758
0.9670
0.9995
0.9996
D4
Subspecies
Species
Genus
Family
1.0000
1.0000
1.0000
1.0000
0.4615
0.3722
0.7757
0.9041
0.6316
0.5425
0.8737
0.9496
0.5287
0.5698
0.5672
0.5936
NA
0.9697
0.9880
1.0000
NA
0.0515
0.1026
0.2245
NA
0.0979
0.1859
0.3667
NA
0.9823
0.9918
0.9932
0.9722
0.9798
0.9880
1.0000
0.0461
0.0450
0.0976
0.2178
0.0879
0.0861
0.1777
0.3577
0.9902
0.9835
0.9926
0.9938
D5
Subspecies
Species
Genus
Family
1.0000
1.0000
1.0000
1.0000
0.3846
0.3684
0.5844
0.8298
0.5556
0.5385
0.7377
0.9070
0.9661
0.9761
0.9754
0.9728
NA
0.9796
1.0000
1.0000
NA
0.1330
0.2356
0.3023
NA
0.2341
0.3814
0.4643
NA
0.9813
0.9993
0.9994
1.0000
0.9796
1.0000
1.0000
0.0881
0.0777
0.1613
0.2583
0.1619
0.1439
0.2778
0.4105
0.9251
0.9810
0.9995
0.9996
D6
Subspecies
Species
Genus
Family
1.0000
1.0000
1.0000
1.0000
0.4500
0.3822
0.6975
0.9167
0.6207
0.5531
0.8218
0.9565
0.9948
0.9624
0.9551
0.9751
NA
0.9697
0.9880
1.0000
NA
0.0659
0.1252
0.2472
NA
0.1235
0.2222
0.3964
NA
0.9807
0.9967
0.9854
0.9722
0.9798
0.9880
1.0000
0.0507
0.0469
0.1006
0.2222
0.0963
0.0896
0.1826
0.3636
0.9966
0.9809
0.9968
0.9855
Table 3: Performance Evaluations on mock Datasets.
HMSC
CLARK
Kraken
Recall
Precision
F1 score
Recall
Precision
F1 sore
Recall
Precision
F1 score
D7
Species
Genus
Family
0.8182
1.0000
1.0000
0.2250
0.6250
0.7143
0.3529
0.7692
0.8333
0.7273
1.0000
1.0000
0.0069
0.0181
0.0394
0.0138
0.0357
0.0758
0.7273
1.0000
1.0000
0.0040
0.0129
0.0322
0.0080
0.0254
0.0623
D8
Species
Genus
Family
0.6667
0.9130
0.9091
0.1678
0.5250
0.8000
0.2682
0.6667
0.8511
0.6944
0.9565
1.0000
0.0126
0.0288
0.0806
0.0248
0.0558
0.1492
0.7222
0.9565
1.0000
0.0100
0.0238
0.0690
0.0198
0.0464
0.1290
the other state-of-the-art methods in terms of accuracy, memory,
and runtime.
ACKNOWLEDGMENTS
This work has been supported in part by the following NSF grants:
1447711, 1743418, and 1843025.
REFERENCES
[1] Sasha K Ames, David A Hysom, Shea N Gardner, G Scott Lloyd, Maya B Gokhale,
and Jonathan E Allen. 2013. Scalable metagenomic taxonomy classification using
a reference genome database. Bioinformatics 29, 18 (Sept. 2013), 2253ś2260.
[2] FP Breitwieser, DN Baker, and SL Salzberg. 2018. KrakenUniq: confident and fast
metagenomics classification using unique k-mer counts. Genome biology 19, 1
(2018), 198.
[3] Christiam Camacho, George Coulouris, Vahram Avagyan, Ning Ma, Jason Papadopoulos, Kevin Bealer, and Thomas L Madden. 2009. BLAST : architecture
and applications. BMC Bioinformatics 10, 1 (2009), 421.
[4] André Corvelo, Wayne E Clarke, Nicolas Robine, and Michael C Zody. 2018.
taxMaps: comprehensive and highly accurate taxonomic classification of shortread data in reasonable time. Genome research (2018), grś225276.
[5] Patricia I Diaz, AK Dupuy, L Abusleme, B Reese, C Obergfell, L Choquette, Anna
Dongari-Bagtzoglou, Douglas E Peterson, E Terzi, and LD Strausbaugh. 2012.
Using high throughput sequencing to explore the biodiversity in oral bacterial
communities. Molecular oral microbiology 27, 3 (2012), 182ś201.
[6] Antonino Fiannaca, Laura La Paglia, Massimo La Rosa, Giovanni Renda, Riccardo
Rizzo, Salvatore Gaglio, Alfonso Urso, et al. 2018. Deep learning models for
bacteria taxonomic classification of metagenomic data. BMC bioinformatics 19, 7
(2018), 198.
[7] Hadrien Gourlé, Oskar Karlsson-Lindsjö, Juliette Hayer, and Erik BongcamRudloff. 2018. Simulating Illumina metagenomic data with InSilicoSeq. Bioinformatics 35, 3 (2018), 521ś522.
[8] D H Huson, A F Auch, J Qi, and S C Schuster. 2007. MEGAN analysis of metagenomic data. Genome Res. 17, 3 (2007), 377ś386.
[9] David Koslicki and Daniel Falush. 2016. MetaPalette: a k-mer Painting Approach
for Metagenomic Taxonomic Profiling and Quantification of Novel Strain Variation. mSystems 1, 3 (May 2016).
[10] David Koslicki, Simon Foucart, and Gail Rosen. 2014. WGSQuikr: fast wholegenome shotgun metagenomic classification. PLoS One 9, 3 (March 2014), e91784.
[11] Bo Liu, Theodore Gibbons, Mohammad Ghodsi, and Mihai Pop. 2010. MetaPhyler:
Taxonomic profiling for metagenomic sequences. In 2010 IEEE International
Conference on Bioinformatics and Biomedicine (BIBM).
[12] Xinan Liu, Ye Yu, Jinpeng Liu, Corrine F Elliott, Chen Qian, and Jinze Liu. 2017. A
novel data structure to support ultra-fast taxonomic classification of metagenomic
sequences with k-mer signatures. Bioinformatics 34, 1 (2017), 171ś178.
[13] Jennifer Lu, Florian P Breitwieser, Peter Thielen, and Steven L Salzberg. 2016.
Bracken: Estimating species abundance in metagenomics data.
[14] Peter Menzel, Kim Lee Ng, and Anders Krogh. 2015. Kaiju: Fast and sensitive
taxonomic classification for metagenomics.
BCB’20, September 21–24, 2020, Virtual Event, USA
Subrata Saha, Zigeng Wang, and Sanguthevar Rajasekaran
Algorithms:
HMSC
Clark
F1 Score
0.6
Species
Subspecies
0.4
0.4
0.3
0.1
0.1
D1
D2
D3
D4
D5
D6
F1 Score
D7
0.0
D8
D1
D2
D3
D4
D5
D6
D7
D8
D1
D2
D3
D4
D5
D6
D7
D8
1.0
0.8
F1 Score
0.8
Family
0.6
Genus
0.3
0.2
0.2
0.4
0.2
0.0
F1 Score
0.5
0.5
0.0
Kraken
0.6
0.6
0.4
0.2
D1
D2
D3
D4
D5
D6
D7
0.0
D8
(a) F1 score of different algorithms.
Algorithms:
HMSC
Clark
Kraken
PMCC
1.0
1.0
0.9
0.9
0.8
0.8
Species
Subspecies
PMCC
0.7
0.6
0.6
D1
D2
D3
D4
D5
D6
PMCC
D7
0.5
D8
1.0
1.0
0.9
0.9
0.8
0.8
Family
Genus
0.5
0.7
0.7
0.6
0.5
D1
D2
D3
D4
D5
D6
D7
D8
D1
D2
D3
D4
D5
D6
D7
D8
D6
D7
D8
PMCC
0.7
0.6
D1
D2
D3
D4
D5
D6
D7
0.5
D8
(b) Pearson’s coefficient of different algorithms.
Running Time
500
Algorithms:
HMSC
Clark
Kraken
Memory Usage
108
Memory Usage (Kb)
Running Time (sec)
400
300
200
100
0
D1
D2
D3
D4
D5
D6
D7
D8
107
D1
D2
D3
D4
D5
(c) Execution time and memory consumption.
Figure 1: Performance Evaluations.
[15] Aleksandr Morgulis, George Coulouris, Yan Raytselis, Thomas L Madden, Richa
Agarwala, and Alejandro A Schäffer. 2008. Database indexing for production
MegaBLAST searches. Bioinformatics 24, 16 (Aug. 2008), 1757ś1764.
[16] André Müller, Christian Hundt, Andreas Hildebrandt, Thomas Hankeln, and
Bertil Schmidt. 2017. MetaCache: Context-aware classification of metagenomic
reads using minhashing. Bioinformatics (Aug. 2017).
[17] Rachid Ounit and Stefano Lonardi. 2016. Higher classification sensitivity of short
metagenomic reads with CLARK-S. Bioinformatics 32, 24 (Dec. 2016), 3823ś3825.
[18] Rachid Ounit, Steve Wanamaker, Timothy J Close, and Stefano Lonardi. 2015.
CLARK: fast and accurate classification of metagenomic and genomic sequences
using discriminative k-mers. BMC Genomics 16 (March 2015), 236.
[19] Jane Peterson, Susan Garges, Maria Giovanni, Pamela McInnes, Lu Wang, Jeffery A Schloss, Vivien Bonazzi, Jean E McEwen, Kris A Wetterstrand, Carolyn
Deal, et al. 2009. The NIH human microbiome project. Genome research 19, 12
(2009), 2317ś2323.
[20] L Schaeffer, H Pimentel, N Bray, P Melsted, and L Pachter. 2017. Pseudoalignment
for metagenomic read assignment. Bioinformatics 33, 14 (July 2017), 2082ś2088.
[21] Shinichi Sunagawa, Daniel R Mende, Georg Zeller, Fernando Izquierdo-Carrasco,
Simon A Berger, Jens Roat Kultima, Luis Pedro Coelho, Manimozhiyan Arumugam, Julien Tap, Henrik Bjùrn Nielsen, Simon Rasmussen, Sùren Brunak, Oluf
Pedersen, Francisco Guarner, Willem M de Vos, Jun Wang, Junhua Li, Joël Doré,
S Dusko Ehrlich, Alexandros Stamatakis, and Peer Bork. 2013. Metagenomic
species profiling using universal phylogenetic marker genes. Nat. Methods 10, 12
(Dec. 2013), 1196ś1199.
[22] Simon H Tausch, Benjamin Strauch, Andreas Andrusch, Tobias P Loka, Martin S
Lindner, Andreas Nitsche, and Bernhard Y Renard. 2018. LiveKrakenśReal-time
metagenomic classification of Illumina data. Bioinformatics 1 (2018), 3.
[23] Duy Tin Truong, Eric A Franzosa, Timothy L Tickle, Matthias Scholz, George
Weingart, Edoardo Pasolli, Adrian Tett, Curtis Huttenhower, and Nicola Segata.
2015. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat. Methods
12, 10 (Oct. 2015), 902ś903.
[24] Derrick E Wood and Steven L Salzberg. 2014. Kraken: ultrafast metagenomic
sequence classification using exact alignments. Genome Biol. 15, 3 (March 2014),
R46.