BMC Bioinformatics: Chaos Game Representation For Comparison of Whole Genomes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

BMC Bioinformatics BioMed Central

Methodology article Open Access


Chaos game representation for comparison of whole genomes
Jijoy Joseph and Roschen Sasikumar*

Address: Computational Modelling and Simulation, Regional Research Laboratory (CSIR), Thiruvananthapuram, 695019, India
Email: Jijoy Joseph - [email protected]; Roschen Sasikumar* - [email protected]
* Corresponding author

Published: 05 May 2006 Received: 17 January 2006


Accepted: 05 May 2006
BMC Bioinformatics 2006, 7:243 doi:10.1186/1471-2105-7-243
This article is available from: http://www.biomedcentral.com/1471-2105/7/243
© 2006 Joseph and Sasikumar; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract
Background: Chaos game representation of genome sequences has been used for visual
representation of genome sequence patterns as well as alignment-free comparisons of sequences
based on oligonucleotide frequencies. However the potential of this representation for making
alignment-based comparisons of whole genome sequences has not been exploited.
Results: We present here a fast algorithm for identifying all local alignments between two long
DNA sequences using the sequence information contained in CGR points. The local alignments can
be depicted graphically in a dot-matrix plot or in text form, and the significant similarities and
differences between the two sequences can be identified. We demonstrate the method through
comparison of whole genomes of several microbial species. Given two closely related genomes we
generate information on mismatches, insertions, deletions and shuffles that differentiate the two
genomes.
Conclusion: Addition of the possibility of large scale sequence alignment to the repertoire of
alignment-free sequence analysis applications of chaos game representation, positions CGR as a
powerful sequence analysis tool.

Background described by an iterated function system defined by the


Chaos game representation was proposed as a scale-inde- following equations
pendent representation for genomic sequences by H.J. Jef-
frey [1]. A CGR of a DNA sequence is plotted in a unit X i = 0.5(X i−1 + gix ) 
square, the four vertices of which are labelled by the
Yi = 0.5(Yi−1 + giy ) 
(1)
nucleotides A-(0,0), C-(0,1), G-(1,1), T-(1,0). The plot-
ting procedure can be described by the following steps: where gix and giy are the X and Y co-ordinates respectively,
the first nucleotide of the sequence is plotted halfway of the corners corresponding to the nucleotide at position
between the centre of the square and the vertex represent- i in the sequence. For example if the ith nucleotide is C,
ing this nucleotide; successive nucleotides in the sequence
are plotted halfway between the previous plotted point gix = 0 and giy = 1.
and the vertex representing the nucleotide being plotted.
CGRs of DNA sequences were shown to exhibit interesting
Mathematically coordinates of the successive points in the patterns. These interesting features relevant to the DNA
chaos game representation of a DNA sequence is sequence organization attracted immediate further

Page 1 of 10
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:243 http://www.biomedcentral.com/1471-2105/7/243

lengths. The work of Almeida et al. positioned CGR as a


kmax = log (d (i,j))
powerful sequence modelling tool that has the advantages
of computational efficiency and scale independence.

k = kmax
Sequence comparisons can be made using two different
aspects of the CGR:

1. The frequency matrices of oligonucleotides of different


lengths, that are derivable from CGR by resolving the CGR
No
Is d(i,j) = 2-k x using grids of different sizes
k = k -1
d(i-k, j-k)
2. The co-ordinates of the CGR points of the two
sequences
Yes
In the CGR, a point corresponding to a sequence of length
(i-k to i) matches with (j-k to j).
Length of the identical segment = k 'n' is contained within a square with side of length 2-n [1].
The frequency of appearance of any oligomer in a
sequence can be found out by partitioning the CGR space
Figure k1
Finding into squares of appropriate sizes. Thus counting the CGR
Finding k. Flow chart of the procedure for identifying points in the squares of a 2n × 2n grid gives the number of
matching segments occurrences of all possible n-mers in the sequence. This
representation is called Frequency chaos game representa-
tion (FCGR) where frequency of an oligomer is the
number of points in the corresponding square. It is also
research [2-4]. CGR has been used in various kinds of possible to calculate oligonucleotide frequencies of non-
investigations of DNA sequences. The first potential of integer lengths by resolving the CGR using grids of sizes
CGR to be recognized was its capability to depict genomic other than powers of two. [7]
signatures. Hill et al. [3] examined the CGRs of coding
sequences of 29 relatively conserved alcohol dehydroge- Most applications of CGR have been based on point
nase genes from phylogenetically divergent species. They counts calculated at various grid resolutions (FCGR).
found that CGRs were similar for the genes of the same or Sequence comparisons based on CGR co-ordinates have
closely related species but were different for the genes been left relatively unexplored. Almeida et al. [7], pointed
from distantly related species. Oliver et al. [4] used the out that, regions of local similarity between two
density of CGR points to derive entropy profiles for DNA sequences is reflected in the distance between CGR points.
sequences that showed a different degree of variability CGR points come closer together as sequence similarity
within and between genomes. Using CGR for making, oli- increases. They defined a measure of local similarity as
gomer frequency counts Deshavanne et al. [5] observed length of similar sequence nH calculated as a function of
that subsequences of a genome exhibit the main charac- the maximum absolute difference between either CGR
teristics of the whole genome, attesting to the validity of a coordinate. However no attempt was made to use the
genomic signature concept. information for developing an algorithm for aligning and
comparing whole genomes.
CGR research received a setback when Goldmann [6]
asserted that simple Markov Chain models based solely As more and more genomes are being sequenced it has
on di-nucleotide and tri-nucleotide frequencies can com- become possible to study evolutionary events by compar-
pletely account for the complex patterns exhibited in ing whole genomes of closely related species and identify-
CGRs of DNA sequences. However Almeida et al. [7] ing the differences. Efficient programs for detecting and
showed that Markov chain models are in fact particular aligning matching segments in pairs of mega-base scale
cases of CGRs. They showed that the distribution of points sequences is important for comparing whole genomes
in CGR is a generalization of Markov chain probability and determining evolutionary relationships. Several pro-
tables that accommodates non-integer orders. Wang et al. grams for large-scale genome comparison have been
[8] proved that while nucleotide, di-nucleotide and tri- developed in the last six years, for example, MUMMER[9],
nucleotide frequencies are able to influence the patterns SSAHA[10], AVID[11], BLASTZ[12]. All these programs
in CGRs these frequencies cannot solely determine the follow an anchor-based approach in which all matching
patterns in CGRs. They showed that CGR is completely n-mers for a fixed n are first identified as potential anchors
determined by frequencies of oligonucleotides of all and the anchors are extended into longer alignments.

Page 2 of 10
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:243 http://www.biomedcentral.com/1471-2105/7/243

chaining together close local alignments. The program


finds multiple local alignments between two sequences,
allowing the detection of homologous segments, internal
sequence duplications and shuffling of segments.

Results
Figures 3, 4, 5, 6, 7 show dot matrix plots showing the
local alignments between Human Immunodeficiency
Virus [GenBank: K02013.1] and Chimpanzee Immunode-
ficiency Virus [EMBL:X52154], Pyrococcus abyssi GE5
[EMBL:AL096836] and Pyrococcus horikoshii OT3
[DDBJ:BA000001], E. coli OH157:H7 [DDBJ:BA000007]
and E. coli K12 [GenBank: U00096], Rickettsia p. madrid
E. [EMBL:AJ235269] and Rickettsia c. malish 7 [Gen-
Bank:AE006914], Mycobacterium leprae TN
[EMBL:AL450380] and Mycobacterium tuberculosis
H37Rv [EMBL:AL123456] respectively. It can be seen that
large segments have been inverted between Pyrococcus
abyssi and Pyrococcus horikoshii as well as between
Mycobacterium leprae TN and Mycobacterium tuberculo-
sis H37Rv. Text files giving positions of Insertions/Dele-
Figure
Local alignments
2
tions and mismatches inferred from the local alignments
Local alignments. Definition of local alignments
between the two strains of E. coli are given as Supplemen-
tary material.

The time taken for finding all local alignments between


pairs of sequences of different sizes is given in Table 1. It
In this paper we develop a fast algorithm for comparison can be seen that the time of execution of the program
of pairs of long sequences using the information con- depends not only on the length of the sequences, but also
tained in CGR points. We first show how all similar seg- on the degree of similarity between them. For example,
ments of two sequences can be identified based on the the time taken for comparing M.leprae and M.tuberculo-
distance between the CGR points of the two sequences. sis is much greater than the time taken for comparing M.
Since determination of distance between all pairs of CGR bovis and M.tuberculosis even though the sizes of the
points, is costly in time (complexity O(N × M), N and M genomes are similar. The time taken by this program for
being the length of the two sequences), we speed up the comparing the two E.Coli genomes is 68 seconds while
program by using an anchored alignment approach simi- MUMmer a large-scale sequence alignment tool available
lar to that used in other programs. We use CGR resolved takes only 17 seconds. The emphasis of this paper is on
by a 2n × 2n grid for fast location of the matching n-mers the theoretical development of the method rather than on
which form the anchors. The distance between CGR software development and it is possible that with better
points corresponding to each pair of matching n-mers, is programming inputs the implementation can be made
then used to see if the matching n-mers can be extended more efficient and faster. The main advantage of this
into longer local alignments. We allow for mismatches by method comes from the fact that CGR simultaneously

Table 1: Computational time chart. Table 1 shows the computation time taken by the program running on a Pentium IV 2.5 GHz
machine, for comparing various genome sequences.

Organisms Length A Length B Time (forward Time (reverse complement) in


(In base pairs) (In base pairs) strand) in seconds seconds

HIV vs CIV 9229 9811 <1 1


P. Abyssi vs P. Horikoshii 1765118 1738505 24 27
E. coli O157:H7 vs E. coli K12 5498450 4639675 68 156
R. p. Madrid E vs R. c. Malish 7 1111523 1268755 18 24
M. tuberculosis H37Rv vs M. leprae TN 4411532 3268203 119 120
M. bovis AF2122 vs M.tuberculosis H37Rv 4345492 4411532 10 230
M. tuberculosis H37Rv vs M. tuberculosis CDC1551 4411532 4403662 6 232

Page 3 of 10
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:243 http://www.biomedcentral.com/1471-2105/7/243

Figure 3
Immunodeficiency virus
Immunodeficiency virus. Human immunodeficiency virus and Chimpanzee immunodeficiency virus

facilitates other types of sequence comparisons ranging Consider the ith nucleotide of one sequence and the jth
from visual comparisons of patterns to oligonucleotide nucleotide of the other. Co-ordinates of the CGR points
frequency spectrums and genome signatures. corresponding to these positions on the two sequences are
given by:
Conclusion
A new algorithm that uses information from chaos game Xi = 0.5 (Xi-1+gix) Yi = 0.5 (Yi-1+giy)
representation of genome sequences for finding all local
alignments between the sequences has been developed. Xj = 0.5 (X j-1+gjx) Yj = 0.5 (Y j-1+gjy) (2)
Fast comparisons can be made between sequences of meg-
abase size using a Pentium IV machine. As far as the speed The distance between CGR points is defined as
of alignment is concerned, the program, in its present
state does not offer any major improvements over MUM- d(i, j) = max(abs(Xi- Xj), abs(Yi - Yj)) (3)
mer, but it is possible that the method can be imple-
mented more efficiently through better programming If the nucleotides at positions 'i' and 'j' of the first and the
inputs. Addition of the possibility of large scale sequence second sequence respectively are equal then gix = gjx and giy
alignment to the existing repertoire of alignment-free = gjy Then from equations (2) and (3) we get,
sequence analysis possibilities from chaos game represen-
tation, positions CGR as a powerful quantitative sequence d(i, j) = 0.5 d(i - 1, j - 1) (4)
analysis tool.
i.e. A pair of similar nucleotides makes the distance
Methods between the corresponding CGR points, half the distance
Using CGR points for finding identical segments in two between the previous pair of points. Extending this argu-
sequences ment, we can say that if k consecutive nucleotides previ-
In the following we show how the distance between CGR ous to positions i and j on the two sequences are identical,
points can be used to identify sequence identities without the distance between the CGR points corresponding to i
having to match the sequences nucleotide by nucleotide. and j is given by

Page 4 of 10
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:243 http://www.biomedcentral.com/1471-2105/7/243

Thermococcales
Figure 4
Thermococcales. Pyrococcus Abyssi and Pyrococcus Horikoshii

d(i, j) = (0.5)k d(i - k, j - k) (5) This can be seen to be the same as the length of similar
sequence proposed by Almeida et al. as a measure for
As k increases d (i, j) becomes smaller, i.e. as the length of assessing local similarity in two sequences.
identical sequence increases, the CGR points come closer
together. Equations 5 and 7 can be used to develop an algorithm for
detection of all identical segments in two sequences based
It must be noted that the closeness of two CGR points is on the distance between CGR points.
not a sufficient condition to conclude that there is a length
of similar sequence behind them. d (i, j) can become very Calculating kmax for a pair of positions (i, j) on the two
low even when the sequences are very different. Such cases sequences we can estimate that, at the most, the sequence
correspond to points on either side of, but close to, the segment from i to i- kmax in one sequence could be identi-
borders of the quadrants corresponding to the four nucle- cal to the segment from j to j- kmax in the other sequence.
otides. However if eqn. (5) is satisfied it can be inferred We then check whether eqn. (5) is satisfied for k = kmax to
that the sequence segment (i-k to i) in one sequence is see if these segments are truly identical. If not, we substi-
identical to the segment (j-k to j) in the other sequence. tute k-1 for k and again check again if eqn.(5) is satisfied
and if not, the procedure is repeated till the condition is
Taking log on both sides of eqn. (5), we get, satisfied. Thus starting from (i- kmax, j- kmax), the first posi-
tion (i-k, j-k) that satisfies eqn.(5) is determined. This
log(d(i, j)) − log(d(i − k, j − k)) gives the length k up to which segments prefixed to posi-
k= (6) tions i and j in the two sequences, are identical. The flow
log(0.5)
chart of this procedure is shown in Fig. 1
We can get an upper bound for k by putting d(i-k, j-k) = 1
in eqn.(6): This method thus identifies identical segments without
having to match the whole segment nucleotide by nucle-
kmax = -log2 (d(i, j)) (7) otide. Search can be completely avoided if kmax is found to

Page 5 of 10
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:243 http://www.biomedcentral.com/1471-2105/7/243

Enterobacteriaceae
Figure 5
Enterobacteriaceae. E.coli K12 and E.coli O157:H7

be less than a threshold and long homologous segments 3. Starting from the last nucleotide of sequence A, we
can be identified by checking only a few points from (i- identify the square in which the corresponding CGR point
kmax, j- kmax) instead of matching the whole length of the i falls.
segment.
3. The CGR points of B that fall in the same square, corre-
Speeding up the algorithm spond to the n-mers in B that match the n-mer which is
The disadvantage of the above method is that the compu- prefixed to the position i in A
tational cost is of the order of the product of the length of
the two sequences. 4. We calculate d (i, j) and kmax for those CGR points j of
the sequence B, which fall in the same square as the CGR
In order to speed up the program, we find a way to avoid point i of sequence A.
computing d(i, j) for all pairs of CGR points of the two
sequences. For this we use information from a resolved 5. Using d(i, j) and kmax, we determine the length of
CGR in which the CGR square is divided into grid of size matching segments, as described in the last section (Fig.1)
2n × 2n. All CGR points falling in a square denotes the
existence of a particular n-mer prefixed to that position. 6. The longest matching segment is taken as the best local
alignment at position i
The algorithm for comparing two sequences A and B is
described below: 7. The procedure is repeated next for the point i-k in A, k
being the length of the longest matching segment.
1. The CGR co-ordinates for both the sequences are calcu-
lated We can thus find all the non-overlapping local alignments
between the two sequences. Using this approach, the all-
2. The CGR is resolved using a 2n × 2n grid and the CGR to-all comparisons of the previous section is reduced to
points of sequence B that fall in each square are noted and some-to-some comparisons, which speeds up the algo-
stored rithm considerably. This technique is similar to the

Page 6 of 10
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:243 http://www.biomedcentral.com/1471-2105/7/243

Rickettsiales
Figure 6
Rickettsiales. Rickettsia Prowazekii Madrid E and Rickettsia Conorii Malish 7

anchored alignment method used in other alignment pro- sequence A. An example list of alignments is shown in
grams; the difference is that we use information from Table 2.
CGR, both for finding the anchors as well as for extending
them. Floating point error
For long identical sequence segments, the distance value
The program yields the list of all local alignments between may go below the minimum value possible for a floating
the two sequences in the order of their position in the point variable. The distance defined in double precision

Table 2: Sample alignment list.A sample list of all local alignments between two sequences in the order of their position in the
sequence A is shown in Table 2:

Order in A StartA EndA Order in B StartB EndB Length

0 8127 8158 0 9402 9433 31


1 10846 10920 1 12193 12267 74
2 11125 11158 2 12446 12479 33
3 18260 18296 3 20577 20613 36
4 20041 20109 4 22402 22470 68
5 20923 20975 5 23284 23336 52
6 21233 21284 6 23594 23645 51
7 23591 23622 7 25970 26001 31
8 26750 26835 8 28887 28972 85
9 53377 53437 9 37193 37253 60
10 67041 67072 84 493374 493405 31
11 67200 67231 85 493533 493564 31
12 143809 143840 262 2932826 2932857 31
--- ------ ------ ------- ------- ------- -------

Page 7 of 10
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:243 http://www.biomedcentral.com/1471-2105/7/243

Actinobacteria
Figure 7
Actinobacteria. Mycobacterim Tuberculosis H37Rv and Mycobacterium Leprae TN

variable becomes zero when the length of identical seg- increasing order of L.END.B and any disruption of order
ment is greater than 64. in the list with respect to position in Sequence B is indic-
ative of shuffling. By examining the disruption of order in
Therefore in our implementation, when we encounter L.END.B we can estimate the number of shuffles that have
zero value for the distance we jump back by sixty posi- taken place in Sequence B with respect to Sequence A.
tions and check distance again; if the distance is again
zero, we jump back another sixty positions and so on until (b) Mismatches
the distance becomes non-zero. We add all the skipped Let, ∆A = abs (L2.START.A – L1.END.A) and
positions to the k that we finally calculate with the non-
zero distance value. ∆B = abs (L2.START.B – L1.END.B)

Analysing the local alignments for shuffles, mismatches where L1 and L2 are two consecutive alignments in the
and insertion/deletions ordered list.
A local alignment can be defined by the start and end
positions of identical segments in the two sequences. A Mismatch length between the alignments can be calcu-
pair of local alignments is given in figure 2. lated as:

Consider two local alignments L1 and L2 defined by Mismatch length = min (∆A, ∆B)

(L1.START.A, L1.END.A, L1.START.B, L1.END.B) and Mismatches between the forward strands of the genomes
of E.coli K12 and E.coli O157:H7 is given as additional
(L2.START.A, L2.END.A, L2.START.B, L2.END.B) file [see Additional file 1].

(a) Shuffles/Rearrangements
Consider the list of local alignments that are ordered in
increasing order of L.END.A. This list may not be in

Page 8 of 10
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:243 http://www.biomedcentral.com/1471-2105/7/243

(c) Insertions/Deletions Operating system: Linux


Diagonal off-set between two consecutive local align-
ments that are consecutive in Sequence B also, indicate Programming language: Standard C
deletions and insertions and can be calculated as
License: GNU General Public License
IN/DEL length = max (∆A, ∆B)-min(∆A, ∆B)
Abbreviations
Insertions/Deletions between the forward strands of the CGR – Chaos Game Representation
genomes of E.coli K12 and E.coli O157:H7 is given as
additional file [see Additional file 2]. FCGR – Frequency Chaos Game Representation

(d) Duplications Authors' contributions


Duplications in B can be identified wherever RS developed the relationship between the length of iden-
L1.START.A=L2.START.A and L1.END.A = L2.END.A tical segments and distance between CGR points. JJ devel-
oped the algorithm for speeding up the comparison. JJ did
(e) Inversions all the coding and generated the results. RS wrote the
Inversion of segments is detected by finding local align- paper and JJ made editorial corrections.
ments between Sequence A and the reverse complement
of Sequence B Additional material

Chaining local alignments and filtering background noise


Short spurious alignments or background noise can be Additional File 1
Text file showing mismatches between the forward strands of the genomes
removed by filtering out all alignments below a certain of E.coli K12 and E.coli O157:H7.
threshold length. However this carries with it the danger Click here for file
of filtering out many "true" alignments that are separated [http://www.biomedcentral.com/content/supplementary/1471-
by small mismatches. Therefore before filtering it is better 2105-7-243-S1.txt]
to chain together the perfect local alignments by allowing
a certain amount of mismatches. We allow for short mis- Additional File 2
matches by chaining together local alignments that are Text file showing insertions/deletions between the forward strands of the
genomes of E.coli K12 and E.coli O157:H7.
have no diagonal off-set and differ only by mismatches of Click here for file
a few nucleotides. We specify the maximum allowable [http://www.biomedcentral.com/content/supplementary/1471-
mismatches per length of the chained alignment. If there 2105-7-243-S2.txt]
is no diagonal off-set between them i.e. ∆A = ∆B, and the
mismatch falls below the threshold value, the two align- Additional File 3
ments are chained together into a single alignment. Text file showing matching segments in forward strands of both genomes
of E.coli K12 and E.coli O157:H7.
Click here for file
Chained alignments having length below a threshold are
[http://www.biomedcentral.com/content/supplementary/1471-
discarded to filter out the background noise. Text file 2105-7-243-S3.txt]
showing matching regions between the genomes of E.coli
K12 and E.coli O157:H7, after filtering background noise, Additional File 4
is given as additional file [see Additional file 3]. Further, Text file showing matching segments in the forward strand of E.coli
the forward strand of E.coli K12 is compared with the O157:H7 reverse strand of E.coli K12 (inversions).
complementary strand of E.coli O157:H7. The resulting Click here for file
[http://www.biomedcentral.com/content/supplementary/1471-
text file showing the matching regions is given as addi-
2105-7-243-S4.txt]
tional file [see Additional file 4].
Additional File 5
Availability and requirements Source code of the program for finding similar sequences in two sequences.
The source code for finding matching segments using Click here for file
CGR is given as additional file [see Additional file 5]. [http://www.biomedcentral.com/content/supplementary/1471-
2105-7-243-S5.C]
The source code for chaining matching segments and fil-
tering background noise and also showing insertions/
deletions/mismatches is given as additional file [see Addi-
tional file 6]

Page 9 of 10
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:243 http://www.biomedcentral.com/1471-2105/7/243

Additional File 6
The Source code of the programme for chaining aligned segments, filtering
background noise and for identifying insertions deletions and mismatches.
Click here for file
[http://www.biomedcentral.com/content/supplementary/1471-
2105-7-243-S6.C]

Acknowledgements
We express our thanks to Prof. T. K. Chandrasekhar, Director, Regional
Research Laboratory (CSIR), Thiruvananthapuram for his support and
encouragement of this work. We gratefully acknowledge helpful discus-
sions with Prof. Alok Bhattacharya and Prof. Andrew Lynn. One of the
authors (J.J), acknowledges the CSIR for financial support.

References
1. Jeffrey HJ: Chaos game representation of gene structure.
Nucleic Acids Res 1990, 18(8):2163-2170.
2. Basu S, Pan A, Dutta C, Das J: Mathematical characterization of
chaos game representation. New algorithms for nucleotide
sequence analysis. J Mol Biol 1992, 228:715-719.
3. Hill KA, Schisler NJ, Singh SM: Chaos game representation of
coding regions of human globin genes and alcohol dehydro-
genase genes of phylogenetically divergent species. J Mol Evol
1992, 35:261-269.
4. Oliver JL, Bernaola-Galvan P, Guerrero G, Roman-Roldan R: Entro-
pic profiles of DNA sequences through chaos-game-derived
images. J Theor Biol 1993, 160(4):457-470.
5. Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B: Genomic signa-
ture: characterization and classification of species assessed
by chaos game representation of sequences. Mol Biol Evol 1999,
16(10):1391-1399.
6. Goldman N: Nucleotide, dinucleotide and trinucleotide fre-
quencies explain patterns observed in chaos game represen-
tations of DNA sequences. Nucleic Acids Res 1993, 21:2487-2491.
7. Almeida JS, Carrico JA, Maretzek A, Noble PA, Fletcher M: Analysis
of genomic sequences by chaos game representation. Bioin-
formatics 2001, 17(5):429-437.
8. Wang Y, Hill K, Singh S, Kari L: The spectrum of genomic signa-
tures: from di-nucleotides to chaos game representation.
Gene 2005, 346:173-185.
9. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu
C, Salzberg SL: Versatile and open software for comparing
large genomes. Genome Biology 2004, 5:R12.
10. Ning Z, Cox AJ, Mullikin JC: SSAHA: a fast search method for
large DNA databases. Genome Res 2001, 11:1725-1729.
11. Bray N, Dubchak I, Pachter L: AVID: A global alignment pro-
gram. Genome Res 2003, 13(1):7-102.
12. Schwartz S, Kent JW, Smit A, Zhang Z, Baertsch R, Hardison RC,
Haussler D, Miller W: Human-Mouse Alignments with
BLASTZ. Genome Res 2003, 13:103-107.

Publish with Bio Med Central and every


scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical researc h in our lifetime."
Sir Paul Nurse, Cancer Research UK

Your research papers will be:


available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours — you keep the copyright

Submit your manuscript here: BioMedcentral


http://www.biomedcentral.com/info/publishing_adv.asp

Page 10 of 10
(page number not for citation purposes)

You might also like