Academia.eduAcademia.edu

A Phylogenomic Approach for Studying Plastid Endosymbiosis

2008, Genome Informatics 2008

Gene transfer is a major contributing factor to functional innovation in genomes. Endosymbiotic gene transfer (EGT) is a specific instance of lateral gene transfer (LGT) in which genetic materials are acquired by the host genome from an endosymbiont that has been engulfed and retained in the cytoplasm. Here we present a comprehensive approach for detecting gene transfer within a phylogenetic framework. We applied the approach to examine EGT of red algal genes into Thalassiosira pseudonana, a free-living diatom for which a complete genome sequence has recently been determined. Out of 11,390 predicted protein-coding sequences from the genome of T. pseudonana, 124 (1.1%, clustered into 80 gene families) are inferred to be of red algal origin (bootstrap support ≥ 75%). Of these 80 gene families, 22 (27.5%) encode novel, unknown functions. We found 21.3% of the gene families to putatively encode non-plastid-targeted proteins. Our results suggest that EGT of red algal genes provides a relatively minor contribution to the nuclear genome of the diatom, but the transferred genes have functions that extend beyond photosynthesis. This assertion awaits experimental validation. Whereas the current study is focused within the context of secondary endosymbiosis, our approach can be applied to large-scale detection of gene transfer in any system.

Genome Informatics 21: 165-176 (2008) A PHYLOGENOMIC APPROACH FOR STUDYING PLASTID ENDOSYMBIOSIS AHMED MOUSTAFA1 * [email protected] CHEONG XIN CHAN2 * [email protected] MEGAN DANFORTH2 [email protected] DAVID ZEAR2 [email protected] HIBA AHMED2 [email protected] NAGNATH JADHAV2 [email protected] TREVOR SAVAGE2 [email protected] DEBASHISH BHATTACHARYA1,2 [email protected] *These authors contributed equally to this work. 1 Interdisciplinary Genetics Program, University of Iowa, Iowa City, IA 52242, U.S.A. 2 Department of Biology and Roy J. Carter Center for Comparative Genomics, University of Iowa, Iowa City, IA 52242, U.S.A. Gene transfer is a major contributing factor to functional innovation in genomes. Endosymbiotic gene transfer (EGT) is a specific instance of lateral gene transfer (LGT) in which genetic materials are acquired by the host genome from an endosymbiont that has been engulfed and retained in the cytoplasm. Here we present a comprehensive approach for detecting gene transfer within a phylogenetic framework. We applied the approach to examine EGT of red algal genes into Thalassiosira pseudonana, a free-living diatom for which a complete genome sequence has recently been determined. Out of 11,390 predicted protein-coding sequences from the genome of T. pseudonana, 124 (1.1%, clustered into 80 gene families) are inferred to be of red algal origin (bootstrap support ≥ 75%). Of these 80 gene families, 22 (27.5%) encode novel, unknown functions. We found 21.3% of the gene families to putatively encode non-plastid-targeted proteins. Our results suggest that EGT of red algal genes provides a relatively minor contribution to the nuclear genome of the diatom, but the transferred genes have functions that extend beyond photosynthesis. This assertion awaits experimental validation. Whereas the current study is focused within the context of secondary endosymbiosis, our approach can be applied to large-scale detection of gene transfer in any system. Keywords: phylogenomics; endosymbiotic gene transfer; lateral gene transfer; plastid; chromalveolates. 1. Introduction Lateral gene transfer (LGT) is a phenomenon in which genetic materials are transmitted between non-lineal individuals (e.g., between two different strains or species). This phenomenon is one of the major mechanisms for functional innovation in the genomes of prokaryotes [1, 2] and eukaryotes [3, 4], as well as for the acquisition of new virulence genes in pathogens [5]. Therefore, the elucidation of gene transfer events will enhance our understanding of how genomes evolve. Here we present a systematic approach for detecting LGT within the context of plastid endosymbiosis. 165 166 A. Moustafa et al. 1.1. Plastid endosymbiosis and gene transfer The origin and establishment of the photosynthetic organelle (plastid) in algae and plants are important for understanding biotic evolution because these taxa form the primary food source for all life on earth. The endosymbiosis hypothesis postulates that the plastid originated from the ancient engulfment and retainment of a free-living cyanobacterium (the endosymbiont) by a heterotrophic, unicellular protist. This ancestral photosynthetic eukaryote diversified into the red, green, and glaucophyte algae [6, 7]. Subsequent to this, a secondary endosymbiosis occurred, in which a red alga, that had gained its photosynthetic capability from primary endosymbiosis, was itself engulfed by a non-photosynthetic protist, giving rise to the progenitor of the eukaryote supergroup Chromalveolata [7, 8]. The process of endosymbiosis and the origin of plastid are detailed in [9–11] and Figure 1 in [6]. The phenomenon of endosymbiosis led to the transfer of genetic material from the endosymbiont to the host nuclear genome via endosymbiotic gene transfer (EGT), which is a specific case of LGT. Chromalveolata is one of the six major “supergroups” of eukaryotes. This lineage consists of a taxonomically diverse group of species that are of high ecological and economic importance, including diatoms, seaweeds, dinoflagellates, and the malaria parasite Plasmodium. Our group has previously demonstrated EGT (and LGT) in chromalveolate genomes [3, 12–14], but the extent of EGT from red algae into chromalveolates, vis-à-vis secondary endosymbiosis, has not been studied in a rigorous manner. Among the chromalveolates, diatoms are unicellular eukaryotes and one of the primary contributors to the marine food chain. The diatoms are estimated to generate ≈ 40% of the organic carbon produced annually in the sea [15]. These taxa affect the flux of atmospheric carbon dioxide into the oceans, which in turn has effects on global climate [16]. Recently, the genome of the free-living diatom Thalassiosira pseudonana was sequenced to completion [17]. Using the available genomic sequences, here we present a rigorous, phylogenomic pipeline to examine the extent of EGT of red algal genes in T. pseudonana, and investigate if these transferred genes are restricted to photosynthesis-related functions. 2. A phylogenomic approach for inferring phylogenies With the increasing amount of available genome data, phylogenomics, the intersection of evolutionary and genomic approaches [18], has become a key instrument in studying genomes on a gene-by-gene basis. This is done primarily by the automated generation and inspection of phylogenetic trees. In many recent studies, phylogenomics has been employed to answer various questions including, e.g., prediction of biochemical gene functions [19], evolution of gene functions [20], detection of gene transfer events [1, 3], and resolution of complex taxonomic relationships [13]. Our phylogenomic pipeline consists of four basic steps as shown in Figure 1. First, homologous genes for the target sequences are identified (step 1) using WU- A Phylogenomic Approach for Studying Plastid Endosymbiosis 167 FASTA (query) Database (MySQL) Export (PERL) FASTA WUBLAST XML Parsing (Java & PERL) (target) 1 Identification of homologous genes Patterns of interest Phylogeny sorting (PhyloSort) Fig. 1. 4 Topolo Topological analysis of l phylogeny PHYLIP FASTA 2 3 Phylogeny Phyy infe inference Phylogeny inference Mu Multiple Alignment sequ sequence (e.g. MUSCLE) li alignment PHYLIP (e.g. RAxML) Refinement & conversion (Java) A schematic diagram of the phylogenomic pipeline: functional components and data flow. BLAST (http://blast.wustl.edu/) searches against a database containing sequences collected from public resources, e.g. NCBI (http://www.ncbi.nlm.nih.gov/) and JGI (http://www.jgi.doe.gov/). We used WU-BLAST because this program shows higher time-efficiency than the original BLAST algorithm [21]. Following this, multiple sequence alignment (step 2) is performed for each homologous gene family prior to phylogeny inference (step 3). We used MUSCLE [22] to align the sequences, and both neighbor-joining (NJ) [23] and maximum likelihood (ML) [24] to reconstruct the phylogenies, because these yield high accuracy in a reasonably short period of time [22, 24]. However, other approaches for sequence alignment and phylogeny inference can easily be incorporated into our pipeline. Finally, once the phylogeny for each gene family is obtained, these can be searched for topological patterns of interest (step 4). In the current study, we used PhyloSort [25] to sort and examine monophyletic relationships between chromalveolates and other taxa of interest. 2.1. Analysis of EGT in Thalassiosira pseudonana We obtained all 11,390 predicted protein-coding sequences from the complete Thalassiosira pseudonana genome from JGI (http://www.jgi.gov/). We performed a preliminary screening using BLAST (at e-value ≤ 0.001) for sequences that are highly similar to and thus possibly share a common ancestry (i.e., homologous) with the genes in red algae. Using 5,014 protein sequences from the complete genome of the red alga Cyanidioschyzon merolae [26], we found 4,894 (43.0% of 11,390) protein-coding sequences in T. pseudonana to have homologs in C. merolae. These protein-coding sequences were used as input in our phylogenomic pipeline that utilizes our local database, which consists of 2,555,575 sequences from 62 eukaryote genomes, inclusive of complete and partial expressed sequence tag (EST) sequences spanning Plantae, chromalveolates, Rhizaria, excavates, animals, fungi, and Amoebozoa, and 500 complete bacterial genomes. Initially, the phylogenetic 168 A. Moustafa et al. 80 trees were constructed using NJ with a Poisson-distance correction and 100 replicates for the bootstrap analysis. By searching for the monophyly of cyanobacteria and chromalveolates, with or without Plantae, we identified and removed 1,907 chromalveolate genes with a potential cyanobacterial origin. This step was designed to exclude genes that were introduced via EGT into the red algal nucleus as a result of primary endosymbiosis. For the remaining 2,987 trees, we searched for the monophyly of red algae and chromalveolates, with or without green and glaucophyte algae (≥ 75% bootstrap support). We identified 288 protein-coding sequences in T. pseudonana with potential red algal origin through EGT (as a result of secondary endosymbiosis). Following this, we inferred ML phylogenies for each of the 288 genes using RAxML [24] (WAG model [27]; 100 bootstrap replicates). Using the same approach for detecting secondary EGT (described above), we identified 124 genes in chromalveolates with a putative red algal origin, and clustered these into 80 distinct families. We manually annotated the functions of these gene families. Blast2GO [28] was used to annotate each family based on significant matches (e-value ≤ 10−5 ) in the Gene Ontology (GO) database (http://geneontology.org/), for the three GO classes: molecular function, biological processes, and cellular components. The GO protein target prediction was complemented with PSORT [29] and Predotar [30]. Plastidtargeting localization was inferred when two out of the three prediction methods yielded positive results. To examine the significance of the observed monophyly between chromalveolates and Plantae, we repeated the phylogenomic analysis using a dataset that excluded with Plantae without Plantae Bacteria (including cyanobacteria) Animalia Excavata 20 percentage (%) 40 60 Plantae Amoebozoa Fungi Rhizaria Archaea 0 Vira Prokaryotes Eukaryotes Viruses Fig. 2. Distribution of monophyly between chromalveolates and different lineages, for Thalassiosira pseudonana genes that showed a potential algal ancestry. The Y-axis represents the percentage of monophyletic relationships recovered, the X-axis represents the different lineages of prokaryotes, eukaryotes, and viruses. The blue and red bars represent the distributions across the dataset inclusive and exclusive of Plantae genomes, respectively. A Phylogenomic Approach for Studying Plastid Endosymbiosis 169 Plantae genomes (glaucophytes, red, and green algae), and compared the observed monophyly between chromalveolates and the other lineages, with the existing results (dataset inclusive of Plantae genomes). As shown in Figure 2, the distributions of the observed monophyly between chromalveolates and non-Plantae are not significantly different between the two instances, i.e., when Plantae genomes are included or not (Kolmogorov-Smirnov test [31], p-value > 0.05). This finding suggests that the observed monophyletic relationship between chromalveolates and Plantae is non-random, and not biased by a secondary or tertiary association between chromalveolates and the other lineages. The strong association between chromalveolates and Bacteria (33.6%) in the dataset that excluded Plantae genomes can be explained by the presence of cyanobacterial genes, which have originated via primary EGT (most of which are of plastid function). The (cyano)bacterial association with diatom genes can therefore be explained by endosymbiosis and not by other scenarios that involve LGT from prokaryotes. 3. EGT of red algal genes in Thalassiosira pseudonana We observe 124 (1.1% of the total 11,390) protein-coding sequences from the genome of T. pseudonana to have a red algal origin. The phylogenetic trees built with each of these genes and their respective homologs show monophyly of the red algae and chromalveolates with bootstrap support ≥ 75%. The genes are clustered into 80 putative families (Table 1). Among these gene families, 40 (50.0%) are well-annotated with gene ontologies (complete annotation for ≥ 90% of the sequences in each family), whereas 18 (22.5%) are partially annotated (complete annotation for < 90% of the sequences in each family). The remaining 22 (27.5%) are either incompletely annotated or have no significant match in the gene ontology database. We consider these 22 gene families to encode novel, unknown functions in the diatom. The majority of genes from T. pseudonana in each of these families is primarily represented by single-copy sequences (58, 72.5%), with some containing two (14, 17.5%) or three (6, 7.5%) gene copies. There are two families in which the gene is highly duplicated within the genome of T. pseudonana. These are the ABC-1 domain protein (7 copies) and light-harvesting protein (13 copies). As shown in the last column of Table 1, 23 (28.8%) of the 80 gene families putatively code for proteins targeted to the plastid, 21 (26.3%) putatively code for proteins targeted to multiple organelles with the majority going to the plastid, 19 (23.8%) of the proteins are potentially targeted to multiple organelles with the minority being the plastid, whereas the remainder (17, 21.3%) putatively code for proteins that are not targeted to the plastid. In parallel with gene ontology analysis, we do not observe a N-terminal extension in the bacterial homologs of these 17 eukaryotic gene families, suggesting that these genes are not targeted to membrane-bounded organelles. The families in which the gene copy is highly duplicated in T. pseudonana are found to be targeted to multiple organelles in the cell (including the mitochondrion and nucleus) and are not restricted to the plastid. 170 A. Moustafa et al. Table 1: Gene families showing a red algal origin in T. pseudonana. The number of genes from the species in each family is shown. Indication whether a family encodes for a putative plastid-targeted proteins is shown in the last column, based on GO annotations of cellular components for each family: completely plastid-targeted (+++), targeted to multiple membrane-bounded organelles with majority to plastid (++), targeted to multiple membrane-bounded organelles with minority being plastid (+), and not targeted in plastid at all (-). No. ID Description No. of genes in T. pseudonana Plastidtargeted (+/-) 1 2 3 4 5 6 7 49 33 15 21 12 24 63 3 3 2 2 2 2 1 +++ +++ +++ +++ +++ +++ +++ 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 17 50 34 31 57 67 54 39 52 56 45 41 44 53 78 81 4 8 27 5 3 7 32 61 48 69 64 66 72 28 14 18 22 26 bile acid:sodium symporter sodium hydrogen exchanger ATP-dependent CLP protease proteolytic subunit HAD-superfamily hydrolase subfamily variant 3 protease Do unknown protein 2-c-methyl-d-erythritol 4-phosphate cytidylyltransferase 3-dehydroquinate synthase aspartate aminotransferase aspartate kinase carboxyl-terminal protease fkbp-type peptidyl-prolyl cis-transisomerase glycosyl transferase group 1 GTP pyrophosphokinase monogalactosyldiacylglycerol synthase serine acetyltransferase small drug exporter protein sulfolipid (UDP-sulfoquinovose) biosynthesis protein tRNA pseudouridine synthase a unknown protein unknown protein unknown protein unknown protein light-harvesting protein ABC-1 domain protein phosphoglycolate phosphatase precursor trehalose-6-phosphate synthase ABC family transporter ATP-dependent RNA helicase cysteinyl-tRNA synthase cytochrome C peroxidase dihydrodipicolinate reductase methionyl aminopeptidase peptidyl-prolyl cis-transcyclophilin type RNA polymerase sigma factor thioredoxin-1 translation elongation factor g unknown protein unknown protein unknown protein unknown protein 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 13 7 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 +++ +++ +++ +++ +++ +++ +++ +++ +++ +++ +++ +++ +++ +++ +++ +++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ continued on next page. . . A Phylogenomic Approach for Studying Plastid Endosymbiosis 171 Table 1 – Continued No. ID 42 43 44 45 46 47 48 49 50 42 76 75 55 23 16 62 11 51 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 43 9 46 74 73 60 30 20 37 80 68 19 79 35 10 40 2 6 38 59 71 36 70 65 47 1 25 29 58 77 Description unknown protein unknown protein valyl-tRNA synthetase peroxisomal membrane protein unknown protein zinc finger protein histone deacetylase family protein hypothetical protein phosphate phosphoenolpyruvate translocator precursor protein phosphatase 2c related protein ABC transporter related protein cell division protein DNA topoisomerase VI subunit a elongation factor 1 alpha GTP binding protein HAD superfamily (subfamily ig) 5-nucleotidase heat shock protein 90 homogentisate solanesyltransferase NADH dehydrogenase ribosomal protein s7 unknown protein unknown protein p-ATPase family transporter: cation anion exchange family protein prolyl-tRNA synthase unknown protein unknown protein amine oxidase chromodomain helicase DNA binding protein DNA topoisomerase VI subunit b glucose-6-phosphate isomerase glycerol-3-phosphate dehydrogenase (NAD+) HSP associated protein like s-adenosyl-l-homocysteine hydrolase unknown protein unknown protein unknown protein unknown protein unknown protein No. of genes in T. pseudonana Plastidtargeted (+/-) 1 1 1 3 3 3 2 2 2 ++ ++ ++ + + + + + + 2 1 1 1 1 1 1 1 1 1 1 1 1 3 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 + + + + + + + + + + + + + - Figure 3 shows the gene ontology annotations for all homologous sequences from the 80 gene families, for each class of (a) molecular function, (b) biological process and, (c) cellular component. As shown in the panels (a) through (c), the families are of diverse functions that are involved in a variety of biological processes and the encoded proteins are targeted to various compartments within the cell. The gene functions range from biomolecule-binding, transporters, to catalytic activities. Most of these genes are annotated to engage in metabolic processes, whereas some are 172 A. Moustafa et al. metabolic process (46.6) nucleotide binding (16.1) transferase activity (10.6) hydrolase activity (17.7) nucleic acid binding (7.7) developmental processes (1.4) others (3.2) others (5.5) ion binding (6.1) transcription-related activity (2.4) amine binding (1.8) helicase activity (1.9) protein binding (6.0) oxidoreductase activity (5.2) ligase activity (3.9) translation factor activity, nucleic acid binding (2.3) isomerase activity (3.8) cofactor binding (3.4) substrate-specific transporter activity (2.6) transmembrane transporter activity (3.1) response to stimulus (3.2) cellular process (33.7) localization (3.8) biological regulation (4.4) (a) molecular function intracellular part (27.4) establishment of localization (3.7) (b) biological processes intracellular (28.7) others (2.9) organelle envelope (0.5) organelle membrane (1.0) organelle lumen (1.1) intracellular organelle (9.9) membrane-bounded organelle (8.7) membrane (6.9) protein complex (4.6) non-membranebounded organelle (1.8) intracellular organelle part (3.0) membrane part (3.3) (c) cellular component Fig. 3. Gene ontology (GO) annotations of all homologous sequences in the 80 gene families that show support for red algal origin in T. pseudonana. Annotations is shown for the classes (a) molecular function at GO level 3; (b) biological process at GO level 2; (c) cellular component at GO level 3. The numbers shown are in percentage. related to cellular, regulatory, and localization processes. 3.1. Examples of EGT in chromalveolates Figure 4 and Figure 5 shows three examples of EGT of red algal genes into the nucleus of chromalveolates. Figure 4 is the phylogeny of a gene family that putatively encodes plastidtargeted small drug exporter proteins, showing strong bootstrap support (92%) for monophyly of an RRC group: a red alga, Cyanidioschyzon merolae, a Rhizaria, Bigelowiella natans, and three species of chromalveolates, including T. pseudonana. In the absence of genetic transfer, the red algae and Rhizaria would be sister taxa to A Phylogenomic Approach for Studying Plastid Endosymbiosis Arabidopsis thaliana Oryza sativa Plants Physcomitrella patens 100 173 Green alga Cyanidioschyzon merolae Red alga Bigelowiella natans Rhizaria Thalassiosira pseudonana 48 98 Chromalveolates Phaeodactylum tricornutum 32 Aureococcus anophagefferens 100 Dehalococcoides sp. Chloroflexi 51 Synechococcus elongatus Cyanobacteria 100 Thermus thermophilus Deinococci Bacteroides capillosus Bacteroidetes Bacteria 92 74 28 Firmicutes 0.8 Firmicutes Fig. 4. A maximum likelihood phylogeny showing an example of EGT of an annotated plastidtargeted protein from red algae to T. pseudonana (monophyly support for chromalveolates and red algae). Numbers shown are bootstrap support values for each node. The scale bar is shown in unit of substitution per site. the green algae. This phylogeny implies EGT between the ancestral lineage of the red algae to the ancestral lineage of chromalveolates. In addition, the RRC grouping also forms a monophyletic relationship with all gene copies present in bacteria (bootstrap support 100%), suggesting that the transferred gene is of an ancient bacterial origin. The observation supports the notion of plastid endosymbiosis that plastids in chromalveolates originated from red algae, which in turn are of a cyanobacterial origin. In contrast, Figure 5 shows the phylogenies of (a) a plastid-targeted gene family and (b) a non-plastid-targeted gene famaily of unknown (and likely novel) functions. In the gene phylogeny shown in Figure 5(a), three species of red algae form the sister taxa with three species of chromalveolates rather than with the green algae. The monophyly of red algae and chromalveolates is strongly supported at bootstrap support 100%. Although the gene function is unknown, this family putatively encodes proteins targeted only to plastids and might therefore play roles in the process of photosynthesis. For the gene phylogeny shown in 5(b), homologous sequences are absent in a large number of lineages. A non-EGT explanation would involve many gene loss events along a large number of lineages. The most parsimonious explanation for such a gene phylogeny is an EGT event from the ancestral lineage of the red alga Cyanidioschyzon merolae to the ancestral lineage of the chromalveolates. 4. Performance and limitations We have demonstrated the use of a rigorous, computational phylogenomic approach to infer the events of gene transfer within the context of plastid endosymbiosis. Our 174 A. Moustafa et al. 94 Oryza sativa Plants Arabidopsis thaliana Physcomitrella patens Green alga Cyanidioschyzon merolae Chondrus crispus Red algae 100 98 Porphyra yezoensis 72 Aureococcus anophagefferens 65 72 Phaeodactylum tricornutum Chromalveolates 100 Thalassiosira pseudonana Chlamydomonas reinhardtii 100 Volvox carteri 100 Ostreococcus lucimarinus 93 Ostreococcus tauri Green algae 100 Micromonas RCC299 100 0.5 Micromonas CCMP1545 (a) Gene family ID 81, plastid-targeted Cyanophora paradoxa Cyanidioschyzon merolae Glaucophyte Red alga Aureococcus anophagefferens 76 Isochrysis galbana 78 98 Phaeodactylum tricornutum Chromalveolates Thalassiosira pseudonana 0.8 (b) Gene family ID 58, non-plastid-targeted Fig. 5. Two maximum likelihood phylogenies showing EGT of red algal genes in T. pseudonana (monophyly support for chromalveolates and red algae). The genes are of unknown function for (a) a plastid-targeted gene family and (b) a non-plastid-targeted gene family. Numbers shown are bootstrap support values for each node. The scale bars are shown in unit of substitution per site. approach is based on the implicit assumption that genes are transferred as a whole. The transfer of genes in smaller fragments, which introduces within-gene discrepancies of phylogenetic signal, might not be fully recovered using this approach. In addition, the efficiency of detecting phylogenetic signal can also be compromised by sequence divergence, presence or absence of informative and/or invariant sites. Therefore, the extent of genetic transfer inferred in this study is a conservative estimate. In the current study, our approach shows a low false positive discovery rate of 1.23% (e.g., trees that return the incorrect monophyly of chromalveolates and A Phylogenomic Approach for Studying Plastid Endosymbiosis 175 animals). In a preliminary study, we generated simulated eight-taxon protein sets (sample size = 100, sequence length = 1000 amino acids) that are evolved homogeneously at various degrees of sequence conservation. Our phylogenomic approach yielded 0% false positive in recovering the target monophyletic relationships (data not shown), with 0.17% false negative rate in cases where sequences are highly divergent (average substitution per site = 2). Under a more-realistic evolutionary regime, e.g., heterogeneous evolution with varied substitution rates along the same or different lineages, the false positive and negative rates are expected to be higher. Based on bioinformatic predictions and analysis at a high statistical (bootstrap) confidence, our findings suggest that genes that show a history of EGT from red algae into T. pseudonana extend beyond plastid-related (e.g., photosynthetic) functions, and thus these transferred genes might make a much greater impact in genome innovation of T. pseudonana than previously thought. Nevertheless, the extent of such an impact in plastid endosymbiosis remains to be verified by experimental approaches. The current approach is suitable for an high-throughput detection of whole-gene transfer within broader biological contexts at a multi-genome scale. 5. Authors’ contributions AM designed and implemented the phylogenomic pipeline, conducted the phylogenomic analysis and contributed to the preparation of the manuscript draft. CXC conducted downstream functional analysis of the gene families, wrote and prepared the table, figures, and the manuscript draft. Both AM and CXC contributed to the analysis of the results. MD, DZ, HA, NJ and TS conducted gene-by-gene phylogenetic analysis to validate the results from the pipeline. DB conceived of and supervised this study. AM, CXC and DB conceived, edited and approved the final manuscript. 6. Acknowledgments This work was supported by a grant from the National Institutes of Health (R01ES013679) awarded to DB. We acknowledge the intellectual input of Adrián Reyes-Prieto and Valérie Reeb (University of Iowa) in this project. References [1] R. G. Beiko, T. J. Harlow and M. A. Ragan, Proc. Natl. Acad. Sci. U.S.A. 102, 14332 (2005). [2] E. Lerat, V. Daubin, H. Ochman and N. A. Moran, PLoS Biology 3, Art. e130 (2005). [3] T. Nosenko and D. Bhattacharya, BMC Evol. Biol. 7, Art. 173 (2007). [4] D. Bhattacharya and T. Nosenko, J. Phycol. 44, 7 (2008). [5] V. M. D’Costa, K. M. McGrann, D. W. Hughes and G. D. Wright, Science 311, 374 (2006). [6] D. Bhattacharya, H. S. Yoon and J. D. Hackett, Bioessays 26, 50 (2004). [7] G. I. McFadden, J. Phycol. 37, 951 (2001). 176 A. Moustafa et al. [8] T. Cavalier-Smith, J. Eukaryot. Microbiol. 46, 347 (1999). [9] A. Reyes-Prieto, A. P. M. Weber and D. Bhattacharya, Ann. Rev. Genet. 41, 147 (2007). [10] D. Bhattacharya, J. M. Archibald, A. P. M. Weber and A. Reyes-Prieto, Bioessays 29, 1239 (2007). [11] S. B. Gould, R. F. Waller and G. I. McFadden, Annu Rev Plant Biol 59, 491 (2008). [12] J. D. Hackett, H. S. Yoon, M. B. Soares, M. F. Bonaldo, T. L. Casavant, T. E. Scheetz, T. Nosenko and D. Bhattacharya, Curr. Biol. 14, 213 (2004). [13] J. D. Hackett, H. S. Yoon, S. Li, A. Reyes-Prieto, S. E. Rummele and D. Bhattacharya, Mol. Biol. Evol. 24, 1702 (2007). [14] A. Reyes-Prieto, A. Moustafa and D. Bhattacharya, Curr Biol 18, 956 (2008). [15] D. M. Nelson, P. Tréguer, M. A. Brzezinski, A. Leynaert and B. Quéguiner, Global Biogeochem. Cycl. 9, 359 (1995). [16] M. A. Brzezinski, C. J. Pride, V. M. Franck, D. M. Sigman, J. L. Sarmiento, K. Matsumoto, N. Gruber, G. H. Rau and K. H. Coale, Geophys. Res. Lett. 29, 1564 (2002). [17] E. V. Armbrust, J. A. Berges, C. Bowler, B. R. Green, D. Martinez, N. H. Putnam, S. G. Zhou, A. E. Allen, K. E. Apt, M. Bechner, M. A. Brzezinski, B. K. Chaal, A. Chiovitti, A. K. Davis, M. S. Demarest, J. C. Detter, T. Glavina, D. Goodstein, M. Z. Hadi, U. Hellsten, M. Hildebrand, B. D. Jenkins, J. Jurka, V. V. Kapitonov, N. Kröger, W. W. Y. Lau, T. W. Lane, F. W. Larimer, J. C. Lippmeier, S. Lucas, M. Medina, A. Montsant, M. Obornik, M. S. Parker, B. Palenik, G. J. Pazour, P. M. Richardson, T. A. Rynearson, M. A. Saito, D. C. Schwartz, K. Thamatrakoln, K. Valentin, A. Vardi, F. P. Wilkerson and D. S. Rokhsar, Science 306, 79 (2004). [18] J. A. Eisen and C. M. Fraser, Science 300, 1706 (2003). [19] J. Huang, G. S. V. Aller, A. N. Taylor, J. J. Kerrigan, W. S. Liu, J. M. Trulli, Z. Lai, D. Holmes, K. M. Aubart, J. R. Brown and M. Zalacain, J. Bacteriol. 188, 5249 (2006). [20] U. John, B. Beszteri, E. Derelle, Y. V. de Peer, B. Read, H. Moreau and A. Cembella, Protist 159, 21 (2008). [21] S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman, J Mol Biol 215, 403 (1990). [22] R. C. Edgar, Nucl. Acids Res. 32, 1792 (2004). [23] N. Saitou and M. Nei, Mol. Biol. Evol. 4, 406 (1987). [24] A. Stamatakis, Bioinformatics 22, 2688 (2006). [25] A. Moustafa and D. Bhattacharya, BMC Evol. Biol. 8, Art. 6 (2008). [26] M. Matsuzaki, O. Misumi, I. T. Shin, S. Maruyama, M. Takahara, S. Y. Miyagishima, T. Mori, K. Nishida, F. Yagisawa, Y. Yoshida, Y. Nishimura, S. Nakao, T. Kobayashi, Y. Momoyama, T. Higashiyama, A. Minoda, M. Sano, H. Nomoto, K. Oishi, H. Hayashi, F. Ohta, S. Nishizaka, S. Haga, S. Miura, T. Morishita, Y. Kabeya, K. Terasawa, Y. Suzuki, Y. Ishii, S. Asakawa, H. Takano, N. Ohta, H. Kuroiwa, K. Tanaka, N. Shimizu, S. Sugano, N. Sato, H. Nozaki, N. Ogasawara, Y. Kohara and T. Kuroiwa, Nature 428, 653 (2004). [27] S. Whelan and N. Goldman, Mol. Biol. Evol. 18, 691 (2001). [28] A. Conesa, S. Götz, J. M. Garcı́a-Gómez, J. Terol, M. Talón and M. Robles, Bioinformatics 21, 3674 (2005). [29] P. Horton, K. J. Park, T. Obayashi, N. Fujita, H. Harada, C. Adams-Collier and K. Nakai, Nucl. Acids Res. 35, W585 (2007). [30] I. Small, N. Peeters, F. Legeai and C. Lurin, Proteomics 4, 1581 (2004). [31] F. J. Massey, J. Am. Stat. Assoc. 46, 68 (1951).