Frequency Analysis Techniques For Identification of Viral Genetic Data
Frequency Analysis Techniques For Identification of Viral Genetic Data
Frequency Analysis Techniques For Identification of Viral Genetic Data
ABSTRACT Environmental metagenomic samples and samples obtained as an attempt to identify a pathogen associated with the
emergence of a novel infectious disease are important sources of novel microorganisms. The low costs and high throughput of
sequencing technologies are expected to allow for the genetic material in those samples to be sequenced and the genomes of the
one hand, the reference database is biased towards known human have references with very low homology due to biased sampling
pathogens. One can argue that this bias is inherent to the way the and high mutation rates at our disposal. One cannot help but
samples are obtained— often as a result of an effort to fight the compare this task to deciphering ancient scripts, such as the
disease associated with the pathogen. Although metagenomic Mayan, Ugaritic, and Sumerian scripts, etc., without relying on a
samples are often obtained to study pathogen activity, the chal- close reference. The success of these archeological and linguistic
lenging cases occur when the pathogen cannot be identified and achievements can only be an inspiration for us. For the rest of this
isolated by traditional biochemical methods, which means that it paper, we present the techniques which were successfully applied
is unlikely that it will be included in the reference database. Fur- to the recent identification of a novel viral pathogen affecting
thermore, in some metagenomic studies, samples without any no- farmed salmon (7), although they were developed for dealing with
ticeable pathogen activity are obtained. In such cases, the chance this seemingly daunting task in a general setting.
of matching novel viral species present in the sample to known
references is diminished further. One can hope that as the cost of RESULTS
obtaining sequenced metagenomic data decreases, the influence Even though RNA viruses have relatively small genomes at about
concentrated on cases when k equals 1, 2, or 3. The main reason for mine the target sequences closest to the query for some fixed dis-
this is that viral genomes are very short, on the order of 10,000 tance measure on the signature space. The algorithm simply or-
nucleotides for RNA viruses. Furthermore, due to low concentra- ders the target sequences from the database according to their
tions of viral material and the resulting low coverage obtained distance to the query. The Jenson-Shannon divergence was ex-
from current sequencing technologies, candidate viral contigs are plored previously as a possible distance measure between genomic
very short, at lengths of approximately 500 to 1,000 nucleotides. In signatures (16). In this work, we concentrate on a symmetrized
principle, our technique is extensible to larger k if sufficiently long version of the Kullback-Leibler distance, defined below in Mate-
contigs are available. The guideline one should keep in mind is rials and Methods, which is motivated by likelihood consider-
that to have good estimates of the frequencies of k-mers from a ations. Unless specified otherwise, we refer to this notion of dis-
genome of length L, L must be at least several times larger than the tance as the frequency distance. The Kullback-Leibler divergence,
4k value. as a measure of similarity among genetic sequences, was also stud-
We investigated several possible distance measures to assess ied by Wu et al. (17).
their ability to categorize viral genetic data. First, maximum like- To analyze the performance of frequency analysis of sequence
lihood considerations led us to the entropy-based distances data (FASD), we took all the reference negative-sense single-
(Kullback-Leibler [KL], Jensen-Shannon, and divergences) of stranded RNA (ssRNA) viral segments available at NCBI (279 se-
the frequency estimates obtained from the signatures. Second, as quences) to form our target sequence database. For a fixed length,
Karlin and Burge (15) point out, since the relative dinucleotide from each sequence in the target database we extracted 100 equally
abundance is characteristic of the species producing the genetic spaced subsequences of such lengths to obtain a database of que-
material, we considered the rectilinear distance of the relative ries with 27,900 sequences. For every percentile, we obtained the
dinucleotide abundances, as defined by those authors (15). Third, fraction of query sequences that had their target sequences of or-
we considered the Euclidean distance and the correlation coeffi- igin ranked in that percentile in the order produced by FASD. We
cients of the frequency vectors because they form natural mea- performed this analysis for 1-, 2-, and 3-mers with queries having
sures of similarity with well-established general mathematical lengths of 500 nucleotides (Fig. 1, top left) and for 3-mers with
properties. Fourth, we considered the 2 test statistic as a measure queries having lengths of 300, 500, and 800 nucleotides (Fig. 1, top
of the likelihood of the distribution of k-mers in the two genomes right). This analysis was repeated with respect to the highest per-
happening by random chance only. centile of a viral segment from the same family as the sequence of
Given a query nucleotide sequence and a database of target origin of the query (Fig. 1, middle). In the case of 3-mers with
nucleotide sequences, the goal of frequency analysis is to deter- queries having lengths of 500 nucleotides, we found that for 88%
of the queries, the correct target is in the top 5%, and for 97% of gets that come from different segments (Fig. 2, bottom left). Sec-
the queries, the correct target family is in the top 5%. ond, we compared the distribution of distances for all pairs of
In the previous analysis, every query sequence is part of a target queries coming from the same segment to the distribution of dis-
sequence. To analyze the performance of FASD without the as- tances of queries from different segments (Fig. 2, bottom right).
sumption that the database contains a target to which the query The sensitivity and specificity of a simple test, which takes a
aligns, we formed a disjoint target database in which for every threshold frequency distance and decides whether two sequences
query subsequence, we included the remaining, complementary are similar if they are closer than this threshold, can be assessed
part of the sequence of origin of the query. The disjoint target with the relative operating characteristic (ROC) curve of the test.
database again contains a correct target sequence for every query Based on the distribution of distances given in Fig. 2, we computed
sequence, but now the query and its correct target are disjoint and the ROC curves of this test for the following two cases: (i) when the
so do not align. We performed the same percentile analysis as positive examples are pairs of negative-sense ssRNA viral se-
described in the previous paragraph, except that for every query, quences and the negative examples are pairs of a negative-sense
we considered only the list of targets to which it did not align well. ssRNA viral sequence and an rRNA sequence and (ii) when the
rates, lack of knowledge about related genomes, and the high rate An application of the method is to relate different genes or
of sequence errors of current sequencing technologies complicate segments from the same virus, even when they do not align at all.
the identification of the few reads/contigs from the pathogen ge- For instance, by applying FASD to the PB1 gene of the 2009 H1N1
nome that could be present in a clinical sample. pandemic, we can identify not only the close relatives of this gene
In this work, we studied a frequency analysis method for com- in several influenza A viruses but also other segments from the
parison of genomic data that is not based on sequence alignment same viruses as well (Table 1). It is equivalent to using BLAST to
but on the statistical properties of the genetic sequences repre- align hemagglutinin and also finding neuraminidase.
sented by their signatures. These properties take into account Another application of the method is to “assemble” reads/con-
codon biases, mutational biases, e.g., errors in polymerase activity, tigs from the same organism without necessarily requiring an
and selection biases, e.g., pressure from the immune system of the overlap between them. We refer to this form of alignment as “hor-
host (19, 20) and restriction sites in phages infecting a bacterial
izontal” to distinguish it from the “vertical” alignment required
host. In addition, the frequency analysis method is robust under
for the traditional assembly algorithms. As an example of horizon-
frame shift sequencing errors (e.g., the homopolymer errors in-
tal assembly, we took two viruses, parainfluenza and rabies, and
troduced by pyrosequencing) and genomic rearrangements.
split their genomes into 10 disconnected nonoverlapping pieces.
In the viral context, since viruses depend on their host for replica-
tion, pressures from the host are an important source of bias in viral The tree in Fig. 4 gives the result of a hierarchical clustering using
genetic signatures. These pressures are a result of a complex virus- the average method of combining clusters performed on the pair-
host interaction, which includes the particulars of the replicative and wise frequency distances of the pieces. It contains two distinct
infectious mechanism of the virus, e.g., the cellular context of the clusters corresponding to the two viruses of the example.
interaction, the structure of the virus, and its evolutionary history, Another example of the ability of FASD to relate sequences that
including previous and other coexisting hosts. A better understand- cannot be aligned is given by the plot in Fig. 5. Here we have
ing of the factors contributing to signature biases will provide a better computed the pairwise frequency distances and the logarithms of
understanding of the boundaries within which the frequency method the Karlin-Altshul E values of the Smith-Waterman alignment
succeeds. In this work, we assume the existence of such biases and scores between the coding sequences of the influenza A, B, and C
leverage them towards categorization of viral genetic data. viruses and the Ebola Zaire virus. As can be seen, the frequency
distance can often indicate evolutionary relatedness and/or simi- analysis performed well and was able to detect genetic relation-
lar sources of origin, although the sequences do not align. ships even in the cases where there was no alignment. Due to its
Frequency analysis techniques, based on the k-mer composi- nature, the frequency analysis approach is less specific than an
tion of genomic data, provide an alternative way of uncovering alignment-based one. In fact, one can imagine the two approaches
biological relationships, different from standard techniques based lying on the opposite sides of a spectrum, with frequency analysis
on sequence alignment. The technique is applicable to cases in relying on common statistical properties of short subsequences
which the genetic data allows for good estimates of the k-mer and alignment relying on the presence of a common long subse-
frequencies, e.g., when contigs and genomes are sufficiently long. quence. The lack of specificity of frequency analysis is both its
We found that in the simulated setting of our study, the frequency weakness, as it produces false positives, and its strength, as it can
TABLE 1 Top 13 negative-sense ssRNA viral sequences produced by FASD, given as a query of the PB1 gene of the 2009 H1N1 pandemic
Frequency Value for L ⫻
distance frequency distance P valuea Viral segment
0.0011 5.12 3.14E⫺01 A/New York/392/2004 (H3N2) segment 2
0.0028 12.86 3.05E⫺02 A/Korea/426/68 (H2N2) segment 2
0.0041 16.41 9.86E⫺03 A/goose/Guangdong/1/96 (H5N1) segment 4
0.0047 21.73 1.76E⫺03 A/Hong Kong/1073/99 (H9N2) segment 2
0.0051 20.53 2.61E⫺03 Influenza B virus RNA 5
0.0056 25.75 4.73E⫺04 A/goose/Guangdong/1/96 (H5N1) segment 2
0.0059 27.21 2.92E⫺04 A/Puerto Rico/8/34 (H1N1) segment 2
0.0061 24.59 6.92E⫺04 A/Puerto Rico/8/34 (H1N1) segment 4
0.0062 28.44 1.95E⫺04 A/Hong Kong/1073/99 (H9N2) segment 1
0.0071 32.32 5.38E⫺05 A/Korea/426/68 (H2N2) segment 1
0.0073 29.19 1.52E⫺04 A/Korea/426/68 (H2N2) segment 4
0.0076 33.69 3.41E⫺05 A/New York/392/2004 (H3N2) segment 3
0.0077 28.27 2.06E⫺04 Maize fine streak virus, complete genome
a The computation of the P value is explained in “Statistics” in Materials and Methods.
div(Sa, Sb) ⫽ 冘F
X a,X log2 Fa,X ⁄ Fb,X
FIG 5 The left plot contains the frequency distances between pairs of segments from four viruses (FLUA, influenza A; FLUB, influenza B; FLUC, influenza C;
EBOVZ, Ebola virus Zaire). The right plot is the logarithms of the Karlin-Altschul E values computed for the Smith-Waterman alignment scores between those
pairs of segments.
quency of the 2-mer XY compared to its frequency as predicted by We are especially grateful to the researchers at the Center for Infection
the frequencies of the 1-mers X and Y. Then, the Karlin distance and Immunity for many interesting discussions and insights. We also
for two sequences, Sa and Sb, with relative 2-mer frequencies of Ka want to thank Miguel Brown for helping us to create a web interface.
and Kb, respectively, is calculated as follows:
distK(Sa, Sb) ⫽ (1 ⁄ 16) 冘 XY |Ka,XY ⫺ Kb,XY|
REFERENCES
1. Mardi, E. 2008. The impact of next-generation sequencing technology on
Statistics. Frequency analysis relies on the Kullback-Leibler genetics. Trends Genet. 24:133–141.
(KL) divergence as a measure of distance between a query and a 2. Williamson, S., D. B. Rusch, S. Yooseph, A. L. Halpern, K. B. Heidel-
target genome. By thinking of the KL divergence as a statistic of a berg, J. I. Glass, C. Andrews-Pfannkoch, D. Fadrosh, C. S. Miller, G.
Sutton, M. Frazier, and J. C. Venter. 2008. The sorcerer II global ocean
random event, we need to assess its significance. We will show that sampling expedition: metagenomic characterization of viruses within
for sufficiently long random query genomes, the distribution of aquatic microbial samples. PLoS One 3:e1456.
this statistic is approximated by a gamma distribution. 3. Gill, S., M. Pop, R. T. Deboy, P. B. Eckburg, P. J. Turnbaugh, B. S.
Take a query genome of length L with a k-mer frequency vector Samuel, J. I. Gordon, D. A. Relman, C. M. Fraser-Liggett, and K. E.
(2)
1⫺M
2 冕 冘 e⫺ x2
i i dx
11. Krause, L., N. N. Diaz, A. Goesmann, S. Kelley, T. W. Nattkemper, F.
Rohwer, R. A. Edwards, and J. Stoye. 2008. Phylogenetic classification of
short environmental DNA fragments. Nucleic Acids Res. 36:2230 –2239.
12. McHardy, A., H. Martín, A. Tsirigos, P. Hugenholtz, and I. Rigoutsos.
BM⫺1[(L · K)1⁄2]
2007. Accurate phylogenetic classification of variable-length DNA frag-
where BM ⫺ 1[(L · K)1/2] is the M ⫺ 1 ball with a radius of (L
· K)1/2. ments. Nat. Methods 4:63–72.
After hyperspherical coordinate change, the integral becomes the 13. Blaisdell, B. 1986. A measure of the similarity of sets of sequences not
gamma distribution with (M ⫺ 1)/2 degrees of freedom requiring sequence alignment. Proc. Natl. Acad. Sci. U. S. A. 83:
L·K
5155–5159.
冊冕
14. Vinga, S., and J. Almeida. 2003. Alignment-free sequence comparison: a
冉M⫺1
冊 1 M⫺1
⫺s
review. Bioinformatics 19:513–523.
2 ,L·K ⫽
冉
p M⫺1 s 2 e ds 15. Karlin, S., and C. Burge. 1995. Dinucleotide relative abundance
⌫ 2
extremes: a genomic signature. Trends Genet. 11:283–290.
0
16. Sims, G., S.-R. Jun, G. Wu, and S.-H. Kim. 2009. Alignment-free
Thus, we let 1 ⫺ p[(M ⫺ 1)/2, L · K] be the P value for the com- genome comparison with feature frequency profiles (FFP) and optimal
parison of a query and a target genome under the hypothesis that they resolutions. Proc. Natl. Acad. Sci. U. S. A. 106:2677–2682.
17. Wu, T., Y. Hsieh, and L. Li. 2001. Statistical measures of DNA sequence
are related; i.e., a low P value indicates that the hypothesis that the dissimilarity under Markov chain models of base composition. Biometrics
query sequence originated from the target genome should be re- 57:441– 443.
jected. Notice that this P value is not the same as the Karlin-Altschul 18. Karlin, S., and S. Altschul. 1990. Methods for assessing the statistical
E value from sequence alignment, which measures the significance of significance of molecular sequence features by using general scoring
a score, obtained from a random distribution. schemes. Proc. Natl. Acad. Sci. U. S. A. 87:2264 –2268.
19. Greenbaum, B., A. Levine, G. Bhanot, and R. Rabadan. 2008. Patterns
ACKNOWLEDGMENTS of evolution and host gene mimicry in influenza and other RNA viruses.
PLoS Pathog. 4(6):e1000079.
This work has been supported by grants from by the Northeast Biodefense 20. Greenbaum, B., R. Rabadan, and A. Levine. 2009. Patterns of oligonu-
Center (grant U54-AI057158) and the National Library of Medicine cleotide sequences in viral and host cell RNA identify mediators of the host
(grant 1R01LM010140-01). innate immune system. PLoS One 4(6):e5969.