BMC Bioinformatics: Chaos Game Representation For Comparison of Whole Genomes
BMC Bioinformatics: Chaos Game Representation For Comparison of Whole Genomes
BMC Bioinformatics: Chaos Game Representation For Comparison of Whole Genomes
Address: Computational Modelling and Simulation, Regional Research Laboratory (CSIR), Thiruvananthapuram, 695019, India
Email: Jijoy Joseph - [email protected]; Roschen Sasikumar* - [email protected]
* Corresponding author
Abstract
Background: Chaos game representation of genome sequences has been used for visual
representation of genome sequence patterns as well as alignment-free comparisons of sequences
based on oligonucleotide frequencies. However the potential of this representation for making
alignment-based comparisons of whole genome sequences has not been exploited.
Results: We present here a fast algorithm for identifying all local alignments between two long
DNA sequences using the sequence information contained in CGR points. The local alignments can
be depicted graphically in a dot-matrix plot or in text form, and the significant similarities and
differences between the two sequences can be identified. We demonstrate the method through
comparison of whole genomes of several microbial species. Given two closely related genomes we
generate information on mismatches, insertions, deletions and shuffles that differentiate the two
genomes.
Conclusion: Addition of the possibility of large scale sequence alignment to the repertoire of
alignment-free sequence analysis applications of chaos game representation, positions CGR as a
powerful sequence analysis tool.
Page 1 of 10
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:243 http://www.biomedcentral.com/1471-2105/7/243
k = kmax
Sequence comparisons can be made using two different
aspects of the CGR:
Page 2 of 10
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:243 http://www.biomedcentral.com/1471-2105/7/243
Results
Figures 3, 4, 5, 6, 7 show dot matrix plots showing the
local alignments between Human Immunodeficiency
Virus [GenBank: K02013.1] and Chimpanzee Immunode-
ficiency Virus [EMBL:X52154], Pyrococcus abyssi GE5
[EMBL:AL096836] and Pyrococcus horikoshii OT3
[DDBJ:BA000001], E. coli OH157:H7 [DDBJ:BA000007]
and E. coli K12 [GenBank: U00096], Rickettsia p. madrid
E. [EMBL:AJ235269] and Rickettsia c. malish 7 [Gen-
Bank:AE006914], Mycobacterium leprae TN
[EMBL:AL450380] and Mycobacterium tuberculosis
H37Rv [EMBL:AL123456] respectively. It can be seen that
large segments have been inverted between Pyrococcus
abyssi and Pyrococcus horikoshii as well as between
Mycobacterium leprae TN and Mycobacterium tuberculo-
sis H37Rv. Text files giving positions of Insertions/Dele-
Figure
Local alignments
2
tions and mismatches inferred from the local alignments
Local alignments. Definition of local alignments
between the two strains of E. coli are given as Supplemen-
tary material.
Table 1: Computational time chart. Table 1 shows the computation time taken by the program running on a Pentium IV 2.5 GHz
machine, for comparing various genome sequences.
Page 3 of 10
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:243 http://www.biomedcentral.com/1471-2105/7/243
Figure 3
Immunodeficiency virus
Immunodeficiency virus. Human immunodeficiency virus and Chimpanzee immunodeficiency virus
facilitates other types of sequence comparisons ranging Consider the ith nucleotide of one sequence and the jth
from visual comparisons of patterns to oligonucleotide nucleotide of the other. Co-ordinates of the CGR points
frequency spectrums and genome signatures. corresponding to these positions on the two sequences are
given by:
Conclusion
A new algorithm that uses information from chaos game Xi = 0.5 (Xi-1+gix) Yi = 0.5 (Yi-1+giy)
representation of genome sequences for finding all local
alignments between the sequences has been developed. Xj = 0.5 (X j-1+gjx) Yj = 0.5 (Y j-1+gjy) (2)
Fast comparisons can be made between sequences of meg-
abase size using a Pentium IV machine. As far as the speed The distance between CGR points is defined as
of alignment is concerned, the program, in its present
state does not offer any major improvements over MUM- d(i, j) = max(abs(Xi- Xj), abs(Yi - Yj)) (3)
mer, but it is possible that the method can be imple-
mented more efficiently through better programming If the nucleotides at positions 'i' and 'j' of the first and the
inputs. Addition of the possibility of large scale sequence second sequence respectively are equal then gix = gjx and giy
alignment to the existing repertoire of alignment-free = gjy Then from equations (2) and (3) we get,
sequence analysis possibilities from chaos game represen-
tation, positions CGR as a powerful quantitative sequence d(i, j) = 0.5 d(i - 1, j - 1) (4)
analysis tool.
i.e. A pair of similar nucleotides makes the distance
Methods between the corresponding CGR points, half the distance
Using CGR points for finding identical segments in two between the previous pair of points. Extending this argu-
sequences ment, we can say that if k consecutive nucleotides previ-
In the following we show how the distance between CGR ous to positions i and j on the two sequences are identical,
points can be used to identify sequence identities without the distance between the CGR points corresponding to i
having to match the sequences nucleotide by nucleotide. and j is given by
Page 4 of 10
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:243 http://www.biomedcentral.com/1471-2105/7/243
Thermococcales
Figure 4
Thermococcales. Pyrococcus Abyssi and Pyrococcus Horikoshii
d(i, j) = (0.5)k d(i - k, j - k) (5) This can be seen to be the same as the length of similar
sequence proposed by Almeida et al. as a measure for
As k increases d (i, j) becomes smaller, i.e. as the length of assessing local similarity in two sequences.
identical sequence increases, the CGR points come closer
together. Equations 5 and 7 can be used to develop an algorithm for
detection of all identical segments in two sequences based
It must be noted that the closeness of two CGR points is on the distance between CGR points.
not a sufficient condition to conclude that there is a length
of similar sequence behind them. d (i, j) can become very Calculating kmax for a pair of positions (i, j) on the two
low even when the sequences are very different. Such cases sequences we can estimate that, at the most, the sequence
correspond to points on either side of, but close to, the segment from i to i- kmax in one sequence could be identi-
borders of the quadrants corresponding to the four nucle- cal to the segment from j to j- kmax in the other sequence.
otides. However if eqn. (5) is satisfied it can be inferred We then check whether eqn. (5) is satisfied for k = kmax to
that the sequence segment (i-k to i) in one sequence is see if these segments are truly identical. If not, we substi-
identical to the segment (j-k to j) in the other sequence. tute k-1 for k and again check again if eqn.(5) is satisfied
and if not, the procedure is repeated till the condition is
Taking log on both sides of eqn. (5), we get, satisfied. Thus starting from (i- kmax, j- kmax), the first posi-
tion (i-k, j-k) that satisfies eqn.(5) is determined. This
log(d(i, j)) − log(d(i − k, j − k)) gives the length k up to which segments prefixed to posi-
k= (6) tions i and j in the two sequences, are identical. The flow
log(0.5)
chart of this procedure is shown in Fig. 1
We can get an upper bound for k by putting d(i-k, j-k) = 1
in eqn.(6): This method thus identifies identical segments without
having to match the whole segment nucleotide by nucle-
kmax = -log2 (d(i, j)) (7) otide. Search can be completely avoided if kmax is found to
Page 5 of 10
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:243 http://www.biomedcentral.com/1471-2105/7/243
Enterobacteriaceae
Figure 5
Enterobacteriaceae. E.coli K12 and E.coli O157:H7
be less than a threshold and long homologous segments 3. Starting from the last nucleotide of sequence A, we
can be identified by checking only a few points from (i- identify the square in which the corresponding CGR point
kmax, j- kmax) instead of matching the whole length of the i falls.
segment.
3. The CGR points of B that fall in the same square, corre-
Speeding up the algorithm spond to the n-mers in B that match the n-mer which is
The disadvantage of the above method is that the compu- prefixed to the position i in A
tational cost is of the order of the product of the length of
the two sequences. 4. We calculate d (i, j) and kmax for those CGR points j of
the sequence B, which fall in the same square as the CGR
In order to speed up the program, we find a way to avoid point i of sequence A.
computing d(i, j) for all pairs of CGR points of the two
sequences. For this we use information from a resolved 5. Using d(i, j) and kmax, we determine the length of
CGR in which the CGR square is divided into grid of size matching segments, as described in the last section (Fig.1)
2n × 2n. All CGR points falling in a square denotes the
existence of a particular n-mer prefixed to that position. 6. The longest matching segment is taken as the best local
alignment at position i
The algorithm for comparing two sequences A and B is
described below: 7. The procedure is repeated next for the point i-k in A, k
being the length of the longest matching segment.
1. The CGR co-ordinates for both the sequences are calcu-
lated We can thus find all the non-overlapping local alignments
between the two sequences. Using this approach, the all-
2. The CGR is resolved using a 2n × 2n grid and the CGR to-all comparisons of the previous section is reduced to
points of sequence B that fall in each square are noted and some-to-some comparisons, which speeds up the algo-
stored rithm considerably. This technique is similar to the
Page 6 of 10
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:243 http://www.biomedcentral.com/1471-2105/7/243
Rickettsiales
Figure 6
Rickettsiales. Rickettsia Prowazekii Madrid E and Rickettsia Conorii Malish 7
anchored alignment method used in other alignment pro- sequence A. An example list of alignments is shown in
grams; the difference is that we use information from Table 2.
CGR, both for finding the anchors as well as for extending
them. Floating point error
For long identical sequence segments, the distance value
The program yields the list of all local alignments between may go below the minimum value possible for a floating
the two sequences in the order of their position in the point variable. The distance defined in double precision
Table 2: Sample alignment list.A sample list of all local alignments between two sequences in the order of their position in the
sequence A is shown in Table 2:
Page 7 of 10
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:243 http://www.biomedcentral.com/1471-2105/7/243
Actinobacteria
Figure 7
Actinobacteria. Mycobacterim Tuberculosis H37Rv and Mycobacterium Leprae TN
variable becomes zero when the length of identical seg- increasing order of L.END.B and any disruption of order
ment is greater than 64. in the list with respect to position in Sequence B is indic-
ative of shuffling. By examining the disruption of order in
Therefore in our implementation, when we encounter L.END.B we can estimate the number of shuffles that have
zero value for the distance we jump back by sixty posi- taken place in Sequence B with respect to Sequence A.
tions and check distance again; if the distance is again
zero, we jump back another sixty positions and so on until (b) Mismatches
the distance becomes non-zero. We add all the skipped Let, ∆A = abs (L2.START.A – L1.END.A) and
positions to the k that we finally calculate with the non-
zero distance value. ∆B = abs (L2.START.B – L1.END.B)
Analysing the local alignments for shuffles, mismatches where L1 and L2 are two consecutive alignments in the
and insertion/deletions ordered list.
A local alignment can be defined by the start and end
positions of identical segments in the two sequences. A Mismatch length between the alignments can be calcu-
pair of local alignments is given in figure 2. lated as:
Consider two local alignments L1 and L2 defined by Mismatch length = min (∆A, ∆B)
(L1.START.A, L1.END.A, L1.START.B, L1.END.B) and Mismatches between the forward strands of the genomes
of E.coli K12 and E.coli O157:H7 is given as additional
(L2.START.A, L2.END.A, L2.START.B, L2.END.B) file [see Additional file 1].
(a) Shuffles/Rearrangements
Consider the list of local alignments that are ordered in
increasing order of L.END.A. This list may not be in
Page 8 of 10
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:243 http://www.biomedcentral.com/1471-2105/7/243
Page 9 of 10
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:243 http://www.biomedcentral.com/1471-2105/7/243
Additional File 6
The Source code of the programme for chaining aligned segments, filtering
background noise and for identifying insertions deletions and mismatches.
Click here for file
[http://www.biomedcentral.com/content/supplementary/1471-
2105-7-243-S6.C]
Acknowledgements
We express our thanks to Prof. T. K. Chandrasekhar, Director, Regional
Research Laboratory (CSIR), Thiruvananthapuram for his support and
encouragement of this work. We gratefully acknowledge helpful discus-
sions with Prof. Alok Bhattacharya and Prof. Andrew Lynn. One of the
authors (J.J), acknowledges the CSIR for financial support.
References
1. Jeffrey HJ: Chaos game representation of gene structure.
Nucleic Acids Res 1990, 18(8):2163-2170.
2. Basu S, Pan A, Dutta C, Das J: Mathematical characterization of
chaos game representation. New algorithms for nucleotide
sequence analysis. J Mol Biol 1992, 228:715-719.
3. Hill KA, Schisler NJ, Singh SM: Chaos game representation of
coding regions of human globin genes and alcohol dehydro-
genase genes of phylogenetically divergent species. J Mol Evol
1992, 35:261-269.
4. Oliver JL, Bernaola-Galvan P, Guerrero G, Roman-Roldan R: Entro-
pic profiles of DNA sequences through chaos-game-derived
images. J Theor Biol 1993, 160(4):457-470.
5. Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B: Genomic signa-
ture: characterization and classification of species assessed
by chaos game representation of sequences. Mol Biol Evol 1999,
16(10):1391-1399.
6. Goldman N: Nucleotide, dinucleotide and trinucleotide fre-
quencies explain patterns observed in chaos game represen-
tations of DNA sequences. Nucleic Acids Res 1993, 21:2487-2491.
7. Almeida JS, Carrico JA, Maretzek A, Noble PA, Fletcher M: Analysis
of genomic sequences by chaos game representation. Bioin-
formatics 2001, 17(5):429-437.
8. Wang Y, Hill K, Singh S, Kari L: The spectrum of genomic signa-
tures: from di-nucleotides to chaos game representation.
Gene 2005, 346:173-185.
9. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu
C, Salzberg SL: Versatile and open software for comparing
large genomes. Genome Biology 2004, 5:R12.
10. Ning Z, Cox AJ, Mullikin JC: SSAHA: a fast search method for
large DNA databases. Genome Res 2001, 11:1725-1729.
11. Bray N, Dubchak I, Pachter L: AVID: A global alignment pro-
gram. Genome Res 2003, 13(1):7-102.
12. Schwartz S, Kent JW, Smit A, Zhang Z, Baertsch R, Hardison RC,
Haussler D, Miller W: Human-Mouse Alignments with
BLASTZ. Genome Res 2003, 13:103-107.
Page 10 of 10
(page number not for citation purposes)