Genome Informatics 21: 165-176 (2008)
A PHYLOGENOMIC APPROACH FOR STUDYING PLASTID
ENDOSYMBIOSIS
AHMED MOUSTAFA1 *
[email protected]
CHEONG XIN CHAN2 *
[email protected]
MEGAN DANFORTH2
[email protected]
DAVID ZEAR2
[email protected]
HIBA AHMED2
[email protected]
NAGNATH JADHAV2
[email protected]
TREVOR SAVAGE2
[email protected]
DEBASHISH BHATTACHARYA1,2
[email protected]
*These authors contributed equally to this work.
1 Interdisciplinary Genetics Program, University of Iowa, Iowa City, IA 52242, U.S.A.
2 Department of Biology and Roy J. Carter Center for Comparative Genomics, University of Iowa, Iowa City, IA 52242, U.S.A.
Gene transfer is a major contributing factor to functional innovation in genomes. Endosymbiotic gene transfer (EGT) is a specific instance of lateral gene transfer (LGT) in
which genetic materials are acquired by the host genome from an endosymbiont that has
been engulfed and retained in the cytoplasm. Here we present a comprehensive approach
for detecting gene transfer within a phylogenetic framework. We applied the approach
to examine EGT of red algal genes into Thalassiosira pseudonana, a free-living diatom
for which a complete genome sequence has recently been determined. Out of 11,390 predicted protein-coding sequences from the genome of T. pseudonana, 124 (1.1%, clustered
into 80 gene families) are inferred to be of red algal origin (bootstrap support ≥ 75%).
Of these 80 gene families, 22 (27.5%) encode novel, unknown functions. We found 21.3%
of the gene families to putatively encode non-plastid-targeted proteins. Our results suggest that EGT of red algal genes provides a relatively minor contribution to the nuclear
genome of the diatom, but the transferred genes have functions that extend beyond photosynthesis. This assertion awaits experimental validation. Whereas the current study is
focused within the context of secondary endosymbiosis, our approach can be applied to
large-scale detection of gene transfer in any system.
Keywords: phylogenomics; endosymbiotic gene transfer; lateral gene transfer; plastid;
chromalveolates.
1. Introduction
Lateral gene transfer (LGT) is a phenomenon in which genetic materials are transmitted between non-lineal individuals (e.g., between two different strains or species).
This phenomenon is one of the major mechanisms for functional innovation in the
genomes of prokaryotes [1, 2] and eukaryotes [3, 4], as well as for the acquisition
of new virulence genes in pathogens [5]. Therefore, the elucidation of gene transfer
events will enhance our understanding of how genomes evolve. Here we present a
systematic approach for detecting LGT within the context of plastid endosymbiosis.
165
166
A. Moustafa et al.
1.1. Plastid endosymbiosis and gene transfer
The origin and establishment of the photosynthetic organelle (plastid) in algae
and plants are important for understanding biotic evolution because these taxa
form the primary food source for all life on earth. The endosymbiosis hypothesis
postulates that the plastid originated from the ancient engulfment and retainment
of a free-living cyanobacterium (the endosymbiont) by a heterotrophic, unicellular
protist. This ancestral photosynthetic eukaryote diversified into the red, green, and
glaucophyte algae [6, 7]. Subsequent to this, a secondary endosymbiosis occurred,
in which a red alga, that had gained its photosynthetic capability from primary
endosymbiosis, was itself engulfed by a non-photosynthetic protist, giving rise to
the progenitor of the eukaryote supergroup Chromalveolata [7, 8]. The process of
endosymbiosis and the origin of plastid are detailed in [9–11] and Figure 1 in [6].
The phenomenon of endosymbiosis led to the transfer of genetic material from the
endosymbiont to the host nuclear genome via endosymbiotic gene transfer (EGT),
which is a specific case of LGT.
Chromalveolata is one of the six major “supergroups” of eukaryotes. This lineage
consists of a taxonomically diverse group of species that are of high ecological and
economic importance, including diatoms, seaweeds, dinoflagellates, and the malaria
parasite Plasmodium. Our group has previously demonstrated EGT (and LGT) in
chromalveolate genomes [3, 12–14], but the extent of EGT from red algae into chromalveolates, vis-à-vis secondary endosymbiosis, has not been studied in a rigorous
manner.
Among the chromalveolates, diatoms are unicellular eukaryotes and one of the
primary contributors to the marine food chain. The diatoms are estimated to generate ≈ 40% of the organic carbon produced annually in the sea [15]. These taxa
affect the flux of atmospheric carbon dioxide into the oceans, which in turn has
effects on global climate [16]. Recently, the genome of the free-living diatom Thalassiosira pseudonana was sequenced to completion [17]. Using the available genomic
sequences, here we present a rigorous, phylogenomic pipeline to examine the extent
of EGT of red algal genes in T. pseudonana, and investigate if these transferred
genes are restricted to photosynthesis-related functions.
2. A phylogenomic approach for inferring phylogenies
With the increasing amount of available genome data, phylogenomics, the intersection of evolutionary and genomic approaches [18], has become a key instrument in
studying genomes on a gene-by-gene basis. This is done primarily by the automated
generation and inspection of phylogenetic trees. In many recent studies, phylogenomics has been employed to answer various questions including, e.g., prediction of
biochemical gene functions [19], evolution of gene functions [20], detection of gene
transfer events [1, 3], and resolution of complex taxonomic relationships [13].
Our phylogenomic pipeline consists of four basic steps as shown in Figure 1.
First, homologous genes for the target sequences are identified (step 1) using WU-
A Phylogenomic Approach for Studying Plastid Endosymbiosis
167
FASTA
(query)
Database
(MySQL)
Export
(PERL)
FASTA
WUBLAST
XML
Parsing
(Java & PERL)
(target)
1 Identification of
homologous genes
Patterns of
interest
Phylogeny
sorting
(PhyloSort)
Fig. 1.
4
Topolo
Topological
analysis
of
l
phylogeny
PHYLIP
FASTA
2
3 Phylogeny
Phyy
infe
inference
Phylogeny
inference
Mu
Multiple
Alignment
sequ
sequence
(e.g. MUSCLE)
li
alignment
PHYLIP
(e.g. RAxML)
Refinement
& conversion
(Java)
A schematic diagram of the phylogenomic pipeline: functional components and data flow.
BLAST (http://blast.wustl.edu/) searches against a database containing sequences
collected from public resources, e.g. NCBI (http://www.ncbi.nlm.nih.gov/) and
JGI (http://www.jgi.doe.gov/). We used WU-BLAST because this program shows
higher time-efficiency than the original BLAST algorithm [21]. Following this, multiple sequence alignment (step 2) is performed for each homologous gene family prior
to phylogeny inference (step 3). We used MUSCLE [22] to align the sequences, and
both neighbor-joining (NJ) [23] and maximum likelihood (ML) [24] to reconstruct
the phylogenies, because these yield high accuracy in a reasonably short period
of time [22, 24]. However, other approaches for sequence alignment and phylogeny
inference can easily be incorporated into our pipeline. Finally, once the phylogeny
for each gene family is obtained, these can be searched for topological patterns of
interest (step 4). In the current study, we used PhyloSort [25] to sort and examine
monophyletic relationships between chromalveolates and other taxa of interest.
2.1. Analysis of EGT in Thalassiosira pseudonana
We obtained all 11,390 predicted protein-coding sequences from the complete Thalassiosira pseudonana genome from JGI (http://www.jgi.gov/). We performed a
preliminary screening using BLAST (at e-value ≤ 0.001) for sequences that are
highly similar to and thus possibly share a common ancestry (i.e., homologous) with
the genes in red algae. Using 5,014 protein sequences from the complete genome
of the red alga Cyanidioschyzon merolae [26], we found 4,894 (43.0% of 11,390)
protein-coding sequences in T. pseudonana to have homologs in C. merolae.
These protein-coding sequences were used as input in our phylogenomic pipeline
that utilizes our local database, which consists of 2,555,575 sequences from 62 eukaryote genomes, inclusive of complete and partial expressed sequence tag (EST)
sequences spanning Plantae, chromalveolates, Rhizaria, excavates, animals, fungi,
and Amoebozoa, and 500 complete bacterial genomes. Initially, the phylogenetic
168
A. Moustafa et al.
80
trees were constructed using NJ with a Poisson-distance correction and 100 replicates for the bootstrap analysis. By searching for the monophyly of cyanobacteria
and chromalveolates, with or without Plantae, we identified and removed 1,907
chromalveolate genes with a potential cyanobacterial origin. This step was designed
to exclude genes that were introduced via EGT into the red algal nucleus as a result of primary endosymbiosis. For the remaining 2,987 trees, we searched for the
monophyly of red algae and chromalveolates, with or without green and glaucophyte
algae (≥ 75% bootstrap support). We identified 288 protein-coding sequences in T.
pseudonana with potential red algal origin through EGT (as a result of secondary
endosymbiosis).
Following this, we inferred ML phylogenies for each of the 288 genes using
RAxML [24] (WAG model [27]; 100 bootstrap replicates). Using the same approach
for detecting secondary EGT (described above), we identified 124 genes in chromalveolates with a putative red algal origin, and clustered these into 80 distinct families. We manually annotated the functions of these gene families. Blast2GO [28] was
used to annotate each family based on significant matches (e-value ≤ 10−5 ) in the
Gene Ontology (GO) database (http://geneontology.org/), for the three GO classes:
molecular function, biological processes, and cellular components. The GO protein
target prediction was complemented with PSORT [29] and Predotar [30]. Plastidtargeting localization was inferred when two out of the three prediction methods
yielded positive results.
To examine the significance of the observed monophyly between chromalveolates
and Plantae, we repeated the phylogenomic analysis using a dataset that excluded
with Plantae
without Plantae
Bacteria
(including cyanobacteria)
Animalia Excavata
20
percentage (%)
40
60
Plantae
Amoebozoa
Fungi
Rhizaria
Archaea
0
Vira
Prokaryotes
Eukaryotes
Viruses
Fig. 2. Distribution of monophyly between chromalveolates and different lineages, for Thalassiosira pseudonana genes that showed a potential algal ancestry. The Y-axis represents the percentage of monophyletic relationships recovered, the X-axis represents the different lineages of
prokaryotes, eukaryotes, and viruses. The blue and red bars represent the distributions across the
dataset inclusive and exclusive of Plantae genomes, respectively.
A Phylogenomic Approach for Studying Plastid Endosymbiosis
169
Plantae genomes (glaucophytes, red, and green algae), and compared the observed
monophyly between chromalveolates and the other lineages, with the existing results
(dataset inclusive of Plantae genomes). As shown in Figure 2, the distributions of
the observed monophyly between chromalveolates and non-Plantae are not significantly different between the two instances, i.e., when Plantae genomes are included
or not (Kolmogorov-Smirnov test [31], p-value > 0.05). This finding suggests that
the observed monophyletic relationship between chromalveolates and Plantae is
non-random, and not biased by a secondary or tertiary association between chromalveolates and the other lineages. The strong association between chromalveolates
and Bacteria (33.6%) in the dataset that excluded Plantae genomes can be explained
by the presence of cyanobacterial genes, which have originated via primary EGT
(most of which are of plastid function). The (cyano)bacterial association with diatom genes can therefore be explained by endosymbiosis and not by other scenarios
that involve LGT from prokaryotes.
3. EGT of red algal genes in Thalassiosira pseudonana
We observe 124 (1.1% of the total 11,390) protein-coding sequences from the genome
of T. pseudonana to have a red algal origin. The phylogenetic trees built with each
of these genes and their respective homologs show monophyly of the red algae and
chromalveolates with bootstrap support ≥ 75%. The genes are clustered into 80 putative families (Table 1). Among these gene families, 40 (50.0%) are well-annotated
with gene ontologies (complete annotation for ≥ 90% of the sequences in each family), whereas 18 (22.5%) are partially annotated (complete annotation for < 90%
of the sequences in each family). The remaining 22 (27.5%) are either incompletely
annotated or have no significant match in the gene ontology database. We consider
these 22 gene families to encode novel, unknown functions in the diatom.
The majority of genes from T. pseudonana in each of these families is primarily
represented by single-copy sequences (58, 72.5%), with some containing two (14,
17.5%) or three (6, 7.5%) gene copies. There are two families in which the gene
is highly duplicated within the genome of T. pseudonana. These are the ABC-1
domain protein (7 copies) and light-harvesting protein (13 copies). As shown in
the last column of Table 1, 23 (28.8%) of the 80 gene families putatively code for
proteins targeted to the plastid, 21 (26.3%) putatively code for proteins targeted
to multiple organelles with the majority going to the plastid, 19 (23.8%) of the
proteins are potentially targeted to multiple organelles with the minority being the
plastid, whereas the remainder (17, 21.3%) putatively code for proteins that are not
targeted to the plastid. In parallel with gene ontology analysis, we do not observe a
N-terminal extension in the bacterial homologs of these 17 eukaryotic gene families,
suggesting that these genes are not targeted to membrane-bounded organelles. The
families in which the gene copy is highly duplicated in T. pseudonana are found
to be targeted to multiple organelles in the cell (including the mitochondrion and
nucleus) and are not restricted to the plastid.
170
A. Moustafa et al.
Table 1: Gene families showing a red algal origin in T. pseudonana. The number
of genes from the species in each family is shown. Indication whether a family
encodes for a putative plastid-targeted proteins is shown in the last column,
based on GO annotations of cellular components for each family: completely
plastid-targeted (+++), targeted to multiple membrane-bounded organelles
with majority to plastid (++), targeted to multiple membrane-bounded organelles with minority being plastid (+), and not targeted in plastid at all
(-).
No.
ID
Description
No. of genes in
T. pseudonana
Plastidtargeted (+/-)
1
2
3
4
5
6
7
49
33
15
21
12
24
63
3
3
2
2
2
2
1
+++
+++
+++
+++
+++
+++
+++
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
17
50
34
31
57
67
54
39
52
56
45
41
44
53
78
81
4
8
27
5
3
7
32
61
48
69
64
66
72
28
14
18
22
26
bile acid:sodium symporter
sodium hydrogen exchanger
ATP-dependent CLP protease proteolytic subunit
HAD-superfamily hydrolase subfamily variant 3
protease Do
unknown protein
2-c-methyl-d-erythritol 4-phosphate
cytidylyltransferase
3-dehydroquinate synthase
aspartate aminotransferase
aspartate kinase
carboxyl-terminal protease
fkbp-type peptidyl-prolyl cis-transisomerase
glycosyl transferase group 1
GTP pyrophosphokinase
monogalactosyldiacylglycerol synthase
serine acetyltransferase
small drug exporter protein
sulfolipid (UDP-sulfoquinovose) biosynthesis protein
tRNA pseudouridine synthase a
unknown protein
unknown protein
unknown protein
unknown protein
light-harvesting protein
ABC-1 domain protein
phosphoglycolate phosphatase precursor
trehalose-6-phosphate synthase
ABC family transporter
ATP-dependent RNA helicase
cysteinyl-tRNA synthase
cytochrome C peroxidase
dihydrodipicolinate reductase
methionyl aminopeptidase
peptidyl-prolyl cis-transcyclophilin type
RNA polymerase sigma factor
thioredoxin-1
translation elongation factor g
unknown protein
unknown protein
unknown protein
unknown protein
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
13
7
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
+++
+++
+++
+++
+++
+++
+++
+++
+++
+++
+++
+++
+++
+++
+++
+++
++
++
++
++
++
++
++
++
++
++
++
++
++
++
++
++
++
++
continued on next page. . .
A Phylogenomic Approach for Studying Plastid Endosymbiosis
171
Table 1 – Continued
No.
ID
42
43
44
45
46
47
48
49
50
42
76
75
55
23
16
62
11
51
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
43
9
46
74
73
60
30
20
37
80
68
19
79
35
10
40
2
6
38
59
71
36
70
65
47
1
25
29
58
77
Description
unknown protein
unknown protein
valyl-tRNA synthetase
peroxisomal membrane protein
unknown protein
zinc finger protein
histone deacetylase family protein
hypothetical protein
phosphate phosphoenolpyruvate translocator
precursor
protein phosphatase 2c related protein
ABC transporter related protein
cell division protein
DNA topoisomerase VI subunit a
elongation factor 1 alpha
GTP binding protein
HAD superfamily (subfamily ig) 5-nucleotidase
heat shock protein 90
homogentisate solanesyltransferase
NADH dehydrogenase
ribosomal protein s7
unknown protein
unknown protein
p-ATPase family transporter: cation
anion exchange family protein
prolyl-tRNA synthase
unknown protein
unknown protein
amine oxidase
chromodomain helicase DNA binding protein
DNA topoisomerase VI subunit b
glucose-6-phosphate isomerase
glycerol-3-phosphate dehydrogenase (NAD+)
HSP associated protein like
s-adenosyl-l-homocysteine hydrolase
unknown protein
unknown protein
unknown protein
unknown protein
unknown protein
No. of genes in
T. pseudonana
Plastidtargeted (+/-)
1
1
1
3
3
3
2
2
2
++
++
++
+
+
+
+
+
+
2
1
1
1
1
1
1
1
1
1
1
1
1
3
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
+
+
+
+
+
+
+
+
+
+
+
+
+
-
Figure 3 shows the gene ontology annotations for all homologous sequences from
the 80 gene families, for each class of (a) molecular function, (b) biological process
and, (c) cellular component. As shown in the panels (a) through (c), the families
are of diverse functions that are involved in a variety of biological processes and the
encoded proteins are targeted to various compartments within the cell. The gene
functions range from biomolecule-binding, transporters, to catalytic activities. Most
of these genes are annotated to engage in metabolic processes, whereas some are
172
A. Moustafa et al.
metabolic process (46.6)
nucleotide binding (16.1)
transferase
activity
(10.6)
hydrolase activity
(17.7)
nucleic acid
binding (7.7)
developmental
processes
(1.4)
others
(3.2)
others (5.5)
ion binding
(6.1)
transcription-related
activity (2.4)
amine binding (1.8)
helicase activity (1.9)
protein binding
(6.0)
oxidoreductase
activity (5.2)
ligase activity (3.9)
translation factor activity,
nucleic acid binding (2.3)
isomerase activity (3.8)
cofactor binding (3.4)
substrate-specific transporter
activity (2.6)
transmembrane transporter
activity (3.1)
response to
stimulus (3.2)
cellular process
(33.7)
localization (3.8)
biological
regulation (4.4)
(a) molecular function
intracellular part
(27.4)
establishment
of localization (3.7)
(b) biological processes
intracellular (28.7)
others (2.9)
organelle
envelope (0.5)
organelle
membrane (1.0)
organelle
lumen (1.1)
intracellular
organelle
(9.9)
membrane-bounded
organelle (8.7)
membrane (6.9)
protein complex (4.6)
non-membranebounded organelle (1.8)
intracellular
organelle part (3.0)
membrane part (3.3)
(c) cellular component
Fig. 3. Gene ontology (GO) annotations of all homologous sequences in the 80 gene families
that show support for red algal origin in T. pseudonana. Annotations is shown for the classes (a)
molecular function at GO level 3; (b) biological process at GO level 2; (c) cellular component at
GO level 3. The numbers shown are in percentage.
related to cellular, regulatory, and localization processes.
3.1. Examples of EGT in chromalveolates
Figure 4 and Figure 5 shows three examples of EGT of red algal genes into the
nucleus of chromalveolates.
Figure 4 is the phylogeny of a gene family that putatively encodes plastidtargeted small drug exporter proteins, showing strong bootstrap support (92%)
for monophyly of an RRC group: a red alga, Cyanidioschyzon merolae, a Rhizaria,
Bigelowiella natans, and three species of chromalveolates, including T. pseudonana.
In the absence of genetic transfer, the red algae and Rhizaria would be sister taxa to
A Phylogenomic Approach for Studying Plastid Endosymbiosis
Arabidopsis thaliana
Oryza sativa
Plants
Physcomitrella patens
100
173
Green alga
Cyanidioschyzon merolae Red alga
Bigelowiella natans
Rhizaria
Thalassiosira pseudonana
48
98
Chromalveolates
Phaeodactylum tricornutum
32
Aureococcus anophagefferens
100
Dehalococcoides sp. Chloroflexi
51
Synechococcus elongatus Cyanobacteria
100
Thermus thermophilus Deinococci
Bacteroides capillosus
Bacteroidetes
Bacteria
92
74
28
Firmicutes
0.8
Firmicutes
Fig. 4. A maximum likelihood phylogeny showing an example of EGT of an annotated plastidtargeted protein from red algae to T. pseudonana (monophyly support for chromalveolates and
red algae). Numbers shown are bootstrap support values for each node. The scale bar is shown in
unit of substitution per site.
the green algae. This phylogeny implies EGT between the ancestral lineage of the
red algae to the ancestral lineage of chromalveolates. In addition, the RRC grouping
also forms a monophyletic relationship with all gene copies present in bacteria (bootstrap support 100%), suggesting that the transferred gene is of an ancient bacterial
origin. The observation supports the notion of plastid endosymbiosis that plastids
in chromalveolates originated from red algae, which in turn are of a cyanobacterial
origin.
In contrast, Figure 5 shows the phylogenies of (a) a plastid-targeted gene family
and (b) a non-plastid-targeted gene famaily of unknown (and likely novel) functions.
In the gene phylogeny shown in Figure 5(a), three species of red algae form the sister
taxa with three species of chromalveolates rather than with the green algae. The
monophyly of red algae and chromalveolates is strongly supported at bootstrap support 100%. Although the gene function is unknown, this family putatively encodes
proteins targeted only to plastids and might therefore play roles in the process of
photosynthesis. For the gene phylogeny shown in 5(b), homologous sequences are
absent in a large number of lineages. A non-EGT explanation would involve many
gene loss events along a large number of lineages. The most parsimonious explanation for such a gene phylogeny is an EGT event from the ancestral lineage of the
red alga Cyanidioschyzon merolae to the ancestral lineage of the chromalveolates.
4. Performance and limitations
We have demonstrated the use of a rigorous, computational phylogenomic approach
to infer the events of gene transfer within the context of plastid endosymbiosis. Our
174
A. Moustafa et al.
94
Oryza sativa
Plants
Arabidopsis thaliana
Physcomitrella patens Green alga
Cyanidioschyzon merolae
Chondrus crispus
Red algae
100
98
Porphyra yezoensis
72
Aureococcus anophagefferens
65
72
Phaeodactylum tricornutum Chromalveolates
100
Thalassiosira pseudonana
Chlamydomonas reinhardtii
100
Volvox carteri
100
Ostreococcus lucimarinus
93
Ostreococcus tauri
Green algae
100
Micromonas RCC299
100
0.5
Micromonas CCMP1545
(a) Gene family ID 81, plastid-targeted
Cyanophora paradoxa
Cyanidioschyzon merolae
Glaucophyte
Red alga
Aureococcus anophagefferens
76
Isochrysis galbana
78
98
Phaeodactylum tricornutum
Chromalveolates
Thalassiosira pseudonana
0.8
(b) Gene family ID 58, non-plastid-targeted
Fig. 5. Two maximum likelihood phylogenies showing EGT of red algal genes in T. pseudonana
(monophyly support for chromalveolates and red algae). The genes are of unknown function for
(a) a plastid-targeted gene family and (b) a non-plastid-targeted gene family. Numbers shown are
bootstrap support values for each node. The scale bars are shown in unit of substitution per site.
approach is based on the implicit assumption that genes are transferred as a whole.
The transfer of genes in smaller fragments, which introduces within-gene discrepancies of phylogenetic signal, might not be fully recovered using this approach. In
addition, the efficiency of detecting phylogenetic signal can also be compromised
by sequence divergence, presence or absence of informative and/or invariant sites.
Therefore, the extent of genetic transfer inferred in this study is a conservative
estimate.
In the current study, our approach shows a low false positive discovery rate
of 1.23% (e.g., trees that return the incorrect monophyly of chromalveolates and
A Phylogenomic Approach for Studying Plastid Endosymbiosis
175
animals). In a preliminary study, we generated simulated eight-taxon protein sets
(sample size = 100, sequence length = 1000 amino acids) that are evolved homogeneously at various degrees of sequence conservation. Our phylogenomic approach
yielded 0% false positive in recovering the target monophyletic relationships (data
not shown), with 0.17% false negative rate in cases where sequences are highly
divergent (average substitution per site = 2). Under a more-realistic evolutionary
regime, e.g., heterogeneous evolution with varied substitution rates along the same
or different lineages, the false positive and negative rates are expected to be higher.
Based on bioinformatic predictions and analysis at a high statistical (bootstrap)
confidence, our findings suggest that genes that show a history of EGT from red
algae into T. pseudonana extend beyond plastid-related (e.g., photosynthetic) functions, and thus these transferred genes might make a much greater impact in genome
innovation of T. pseudonana than previously thought. Nevertheless, the extent of
such an impact in plastid endosymbiosis remains to be verified by experimental
approaches. The current approach is suitable for an high-throughput detection of
whole-gene transfer within broader biological contexts at a multi-genome scale.
5. Authors’ contributions
AM designed and implemented the phylogenomic pipeline, conducted the phylogenomic analysis and contributed to the preparation of the manuscript draft. CXC
conducted downstream functional analysis of the gene families, wrote and prepared
the table, figures, and the manuscript draft. Both AM and CXC contributed to
the analysis of the results. MD, DZ, HA, NJ and TS conducted gene-by-gene phylogenetic analysis to validate the results from the pipeline. DB conceived of and
supervised this study. AM, CXC and DB conceived, edited and approved the final
manuscript.
6. Acknowledgments
This work was supported by a grant from the National Institutes of Health
(R01ES013679) awarded to DB. We acknowledge the intellectual input of Adrián
Reyes-Prieto and Valérie Reeb (University of Iowa) in this project.
References
[1] R. G. Beiko, T. J. Harlow and M. A. Ragan, Proc. Natl. Acad. Sci. U.S.A. 102, 14332
(2005).
[2] E. Lerat, V. Daubin, H. Ochman and N. A. Moran, PLoS Biology 3, Art. e130 (2005).
[3] T. Nosenko and D. Bhattacharya, BMC Evol. Biol. 7, Art. 173 (2007).
[4] D. Bhattacharya and T. Nosenko, J. Phycol. 44, 7 (2008).
[5] V. M. D’Costa, K. M. McGrann, D. W. Hughes and G. D. Wright, Science 311, 374
(2006).
[6] D. Bhattacharya, H. S. Yoon and J. D. Hackett, Bioessays 26, 50 (2004).
[7] G. I. McFadden, J. Phycol. 37, 951 (2001).
176
A. Moustafa et al.
[8] T. Cavalier-Smith, J. Eukaryot. Microbiol. 46, 347 (1999).
[9] A. Reyes-Prieto, A. P. M. Weber and D. Bhattacharya, Ann. Rev. Genet. 41, 147
(2007).
[10] D. Bhattacharya, J. M. Archibald, A. P. M. Weber and A. Reyes-Prieto, Bioessays
29, 1239 (2007).
[11] S. B. Gould, R. F. Waller and G. I. McFadden, Annu Rev Plant Biol 59, 491 (2008).
[12] J. D. Hackett, H. S. Yoon, M. B. Soares, M. F. Bonaldo, T. L. Casavant, T. E. Scheetz,
T. Nosenko and D. Bhattacharya, Curr. Biol. 14, 213 (2004).
[13] J. D. Hackett, H. S. Yoon, S. Li, A. Reyes-Prieto, S. E. Rummele and D. Bhattacharya,
Mol. Biol. Evol. 24, 1702 (2007).
[14] A. Reyes-Prieto, A. Moustafa and D. Bhattacharya, Curr Biol 18, 956 (2008).
[15] D. M. Nelson, P. Tréguer, M. A. Brzezinski, A. Leynaert and B. Quéguiner, Global
Biogeochem. Cycl. 9, 359 (1995).
[16] M. A. Brzezinski, C. J. Pride, V. M. Franck, D. M. Sigman, J. L. Sarmiento, K. Matsumoto, N. Gruber, G. H. Rau and K. H. Coale, Geophys. Res. Lett. 29, 1564 (2002).
[17] E. V. Armbrust, J. A. Berges, C. Bowler, B. R. Green, D. Martinez, N. H. Putnam,
S. G. Zhou, A. E. Allen, K. E. Apt, M. Bechner, M. A. Brzezinski, B. K. Chaal,
A. Chiovitti, A. K. Davis, M. S. Demarest, J. C. Detter, T. Glavina, D. Goodstein,
M. Z. Hadi, U. Hellsten, M. Hildebrand, B. D. Jenkins, J. Jurka, V. V. Kapitonov,
N. Kröger, W. W. Y. Lau, T. W. Lane, F. W. Larimer, J. C. Lippmeier, S. Lucas, M. Medina, A. Montsant, M. Obornik, M. S. Parker, B. Palenik, G. J. Pazour,
P. M. Richardson, T. A. Rynearson, M. A. Saito, D. C. Schwartz, K. Thamatrakoln,
K. Valentin, A. Vardi, F. P. Wilkerson and D. S. Rokhsar, Science 306, 79 (2004).
[18] J. A. Eisen and C. M. Fraser, Science 300, 1706 (2003).
[19] J. Huang, G. S. V. Aller, A. N. Taylor, J. J. Kerrigan, W. S. Liu, J. M. Trulli, Z. Lai,
D. Holmes, K. M. Aubart, J. R. Brown and M. Zalacain, J. Bacteriol. 188, 5249
(2006).
[20] U. John, B. Beszteri, E. Derelle, Y. V. de Peer, B. Read, H. Moreau and A. Cembella,
Protist 159, 21 (2008).
[21] S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman, J Mol Biol 215,
403 (1990).
[22] R. C. Edgar, Nucl. Acids Res. 32, 1792 (2004).
[23] N. Saitou and M. Nei, Mol. Biol. Evol. 4, 406 (1987).
[24] A. Stamatakis, Bioinformatics 22, 2688 (2006).
[25] A. Moustafa and D. Bhattacharya, BMC Evol. Biol. 8, Art. 6 (2008).
[26] M. Matsuzaki, O. Misumi, I. T. Shin, S. Maruyama, M. Takahara, S. Y. Miyagishima, T. Mori, K. Nishida, F. Yagisawa, Y. Yoshida, Y. Nishimura, S. Nakao,
T. Kobayashi, Y. Momoyama, T. Higashiyama, A. Minoda, M. Sano, H. Nomoto,
K. Oishi, H. Hayashi, F. Ohta, S. Nishizaka, S. Haga, S. Miura, T. Morishita,
Y. Kabeya, K. Terasawa, Y. Suzuki, Y. Ishii, S. Asakawa, H. Takano, N. Ohta,
H. Kuroiwa, K. Tanaka, N. Shimizu, S. Sugano, N. Sato, H. Nozaki, N. Ogasawara,
Y. Kohara and T. Kuroiwa, Nature 428, 653 (2004).
[27] S. Whelan and N. Goldman, Mol. Biol. Evol. 18, 691 (2001).
[28] A. Conesa, S. Götz, J. M. Garcı́a-Gómez, J. Terol, M. Talón and M. Robles, Bioinformatics 21, 3674 (2005).
[29] P. Horton, K. J. Park, T. Obayashi, N. Fujita, H. Harada, C. Adams-Collier and
K. Nakai, Nucl. Acids Res. 35, W585 (2007).
[30] I. Small, N. Peeters, F. Legeai and C. Lurin, Proteomics 4, 1581 (2004).
[31] F. J. Massey, J. Am. Stat. Assoc. 46, 68 (1951).