Academia.eduAcademia.edu

Rensing et al 2008 Supplementary Online Material

Supporting Online Material for Rensing et al. 2008, 319, 64 (2008) A) Materials, Methods and Analysis DNA was isolated from cultures derived from a single spore of the Gransden wild)type strain in 2004 (Gransden 2004 strain). Nine)day old protonemal tissue was grown on BCD+ ammonium tartrate medium overlaid with cellophane . Tissue was frozen in liquid nitrogen and ground to a coarse powder in a mortar and pestle. Nuclei were isolated from the frozen powder using the methods of Luo and Wing . The nuclear pellet was suspended in the residual buffer (1 ml) and served as the starting material for DNA isolation. The DNA was extracted using the Nucleon Phytopure plant DNA extraction kit (RPN 8511) from Amersham Bioscience. The initial data set was derived from 11 whole)genome shotgun (WGS) libraries: two with an insert size of 2)3 Kbp, four with an insert size of 6)8 Kbp, and five with an insert size of 35)40 Kbp. The reads were screened for vector using cross match, then trimmed for vector and quality. Reads shorter than 100 bases after trimming were excluded. Data sets before and after trimming are described below: Library 2)3 Kbp 6)8 Kbp 35)40 Kbp Reads (raw) 2,968,735 (3,312,360) 3,351,584 (3,567,314) 411,741 (508,990) Sequence (raw), Mbp 2,133 (3,466) 2,539 (3,588) 245 (523) The data were assembled using release 2.9.3 of Jazz, a WGS assembler developed at the Joint Genome Institute ( ). A word size of 15 was used for seeding alignments between reads. The unhashability threshold was set to 40, preventing words present in the data set in more than 40 copies from being used to seed alignments. A mismatch penalty of )30.0 was used, which will tend to assemble sequences that are more than about 97% identical. The assembly is represented by 2,106 scaffolds, the N50 being 111 scaffolds, the L50 1.32 Mbp. The largest scaffold is 5.39 Mbp in size; the total scaffold length is 480 Mbp and contains 5.4% gaps. In addition to the nuclear genome, we built 215 chloroplast and 25 mitochondrion scaffolds in the released assembly. The sequence depth derived from the assembly is 8.63 ± 0.10. To estimate the completeness of the assembly, a set of 251,086 ESTs was aligned to both the unassembled trimmed data set, and the assembly itself. A total of 247,484 ESTs (98.6%) were covered to more than 80% of their length by the unassembled data, while 247,613 ESTs (98.6%) yielded hits to the assembly. Based on the presence of start and stop codons, 4,517 genes (29%) are putatively full)length. Several genome analyses, gene prediction, and annotation methods were integrated into the JGI annotation pipeline to annotate the genome of . First, predicted transposable elements were masked in the genome assembly using RepeatMasker ( ) and a repeat library composed from a non)redundant set of (i) overrepresented oligonucleotides identified during the assembly process, (ii) fragments of draft gene models homologous to known transposable elements, and (iii) manually curated repeats. Second, gene models were built using several approaches. Initially, 3,154 putative full length genes with ORFs of 150 bp or longer were derived from 31,951 clusters of ESTs and mapped to the genomic sequence. Next, protein sequences from Genbank and IPI ( ) were aligned against the scaffolds using BLASTX ( ) and post) processed to co)linearize high scoring hits and to select the best non)overlapping set of BLAST alignments. These alignments were used primarily as seeds for the gene prediction tools Genewise ( ) and Fgenesh+ ( ). All resulting Genewise models were then extended to include the nearest 5’ methionine and 3’ stop codons. Subsequently, gene models were predicted using Fgenesh ( ) with parameters derived from training using known genes. In addition, 220,055 ESTs and the consensus sequences of their clusters were aligned with the scaffolds using BLAT ( ) and used to extend and correct predicted gene models where exons in the ESTs/cDNAs overlap and extend the gene model into flanking UTR. Over 225,000 putative gene models 1 Supporting Online Material for Rensing et al. 2008, 319, 64 (2008) were generated using the above mentioned gene predictors. Their translated amino acid sequences were aligned against known proteins from the NCBI non)redundant set and other databases such as KEGG ( ). In addition, each predicted model was analyzed for domain content/structure using InterproScan ( ) with a suite of tools such as Blast/HMM/ScanRegEx against the domain libraries Prints, Prosite, PFAM, ProDom and SMART. Finally, to produce a non)redundant set of 35,938 gene models, for every locus with overlapping models, the “best” model was selected according to homology with known proteins and EST support. Annotations for this set of genes were summarized in terms of Gene Ontology ( ), eukaryotic clusters of orthologs, KOGs, ( ) and KEGG pathways ( ). Predicted gene models and their annotations were further manually curated and submitted to GenBank. The average/median protein lengths are 363 aa/300 aa. The average/median transcript lengths are 1,196 bp/1,215 bp. 30,170 (84%) of the predicted proteins appear complete, based on the presence of start and stop codons; 4,517 genes (29%) are putatively full length (contain both 5’ and 3’ UTR). The majority of predicted genes are supported by various types of evidence: 35% of genes are supported by 220,055 ESTs and full length cDNAs; 37% are homologous to Swissprot proteins (table S4). Additionally, 12,129 genes (34%) were annotated in terms of Gene Ontology (GO) ( ), 15,932 (44 %) were assigned to eukaryotic orthologous groups (KOGs) ( ), and 789 distinct EC numbers were assigned to 4,110 (11%) proteins mapped to KEGG pathways ( ). Sequences from other origin than the desired source are a common problem of large scale sequencing projects. An obvious strategy to isolate such contaminant sequences is the determination of identity or homology to sequences of already sequenced organisms. The success of this approach relies on the availability of genomic sequence data of the contaminant or close relatives. A distribution plot of scaffold G/C content colored with the taxonomic information gathered by MegaBLAST searches revealed a suspicious secondary peak which was used to exclude scaffolds of obvious prokaryotic origin. However, some candidate genes from the remaining main genome scaffolds could not be amplified from genomic DNA, indicating remaining contaminants. In order to identify the scaffolds representing the contamination, we collected multiple parameters describing the scaffolds (EST alignment evidence, taxonomic information, gene model statistics, scaffold length, G/C content). Analysis of the taxonomic information gathered previously indicated the genus . Thus, we used a model to predict open reading frames on all scaffolds and annotated the predicted peptides by homology. Manual inspection revealed operon)like structures for suspected contaminant scaffolds and nearly no or only fragmentary ORFs for true scaffolds. In total, 27 parameters were used in a multivariate analysis, combining principal component analysis (PCA) and k)means clustering. Using this method, we were able to define four different fractions in the scaffolds (fig. S7). The predictions from the analysis were tested in experimentally. A total of 24 primer pairs were designed to test the separation of the clusters and to probe for the source of contamination. Based on this data we were able to confirm that cluster 2 accurately represents a bacterial contamination derived from an unknown species. By using the primers on the original DNA that was used to create the sequencing libraries we confirmed that this DNA was contaminated. However, there was and wet)lab evidence for further contaminations within cluster 3a and 3b. Initial evidence suggested that these sequences may originate in some mislabeled or switched plates, i.e. that organisms sequenced at the same time than pollute the data to some extent. We therefore carried out megaBLAST searches with the ! scaffolds against the publicly available microbial genomes that have been sequenced by JGI. There is evidence for several bacterial species (" # /$ % & ' # , ( , )* # , # # , +# , & #* ,# ) contributing to scaffolds within cluster 3a/b. In order to finish the v1.0 genome release, all 407 scaffolds belonging to cluster 2 were removed. In addition, 23 further scaffolds identified as contaminated by megaBLAST/PCR were removed. Using this procedure the ! partition represented in JGI’s genome browser was voided of the detected contaminants. 2 Supporting Online Material for Rensing et al. 2008, 319, 64 (2008) Version 1.1 of the genome assembly and annotation can be accessed through the JGI Genome Portal at http://www.jgi.doe.gov/Physcomitrella, where manual curation of this genome continues. The data are stored in a MySQL database with an interactive genome portal interface that allows a distributed group of international collaborators to view the genome, predictions, supporting evidence and other underlying data and make decisions about a particular transcript in any given pathway, gene family or system. This Whole Genome Shotgun project has been deposited at DDBJ/EMBL/GenBank under the project accession ABEU00000000. The version described in this paper is the first version, ABEU01000000. Protein encoding genes are identified by a unique, six digit number. An approach based on RECON ( ) was used to identify potential repetitive elements within the genome sequence by virtue of their abundance within the assembly. RECON identifies potential repeat elements and attempts to group identified elements into related families; RECON does not rely on, nor is influenced by, collections of known repeats or similarity searches to known sequences. An iterative approach was taken: abundant sequence elements were identified within a 35 Mbp portion of the genome, a second 35 Mbp portion was added to the first, and the combined collection of 70 Mbp was masked with the elements identified within the first 35 Mbp portion. New elements were identified within the unmasked regions of the 70 Mbp portion, and these were combined with the first set of repeat elements and used to mask the collection of sequences representing the previous 70 Mbp of genome plus an additional 35 Mbp portion. This process was continued until all portions of the genome assembly had been assessed. The entire collection of identified elements, their lengths, and their family groupings are represented in table S20. Distributions of family sizes (A) and identified element sizes (B) are plotted in fig. S8. The scatter plot of family element number vs. element length (fig. S8C) demonstrates that most families comprise few elements of modest size (~1kbp). While families with many members (>100) are present, larger families tend to have smaller element lengths. The number of repetitive nucleotides is 79,373,843 (16.3%). LTRs were detected by different methods (table S21). The Method A pipeline uses LTRseq ( ) to identify LTRs followed by a HMMer search of transposable element (TE))related domains. 4,795 full)length LTR retrotransposons, including several nested copies, which all have at least one TE)related domain where found by Method A. Those that have reverse)transcriptase domains followed by an integrase domain in their internal region were classified as “Gypsy”; those with the integrase domain followed by the reverse)transcriptase domain were classified as “Copia”; while the rest were classified as “Unknown”. Method B used the program LTR_STRUC ( ) with default parameters. Method C1 also relies on LTR_STRUC, but avoids the splitting of sequences after N>5 stretches, which occur often in unfinished genome sequences. Under these conditions LTR_STRUC yielded 1,204 full)length LTR sequences, which were classified by a HMMer (http://hmmer.janelia.org) search for typical retrotransposon protein domains (GAG, PR, INT, RT). 1,080 (90%) of them remained after overlap removal and a quality check by the following criteria: the existence of at least one retrotransposon protein domain, simple sequence percent <=20, inner N percent <=30, soloLTR percent <=2, left + right soloLTR length <=80 percent of sequence length. They cover 2% (9.7 Mb) of the genome. According to their protein signatures 43 % could be assigned to the gypsy and 4 % to the copia LTR type, the remaining are ambiguous (table S21, S22). Diverged LTR elements and their fragments where detected by RepeatMasker Open)3)1)7 ( ) using a non)redundant set of the novel method C1 LTR retrotransposons as repeat library (1,060 sequences, 9.5 Mb). The evolutionary distance between 5’ and 3’ soloLTR was calculated from a ClustalW alignment by the emboss distmat package using the Kimura two parameter method. For the conversion of distance to insertion age, a substitution rate of 1.3E)8 was used. Data integration, final annotation and data extraction were carried out with the ANGELA (Automated Nested Genetic Element Annotation) pipeline (manuscript in preparation) (fig. S2). 2,108 full length LTRs were detected by similarity to the LTR retrotransposons library in addition to the 1,080 from LTR_STRUC, thus adding up to 3,188 full length LTRs for which the insertion age could be calculated (average age 3.3 mio years, median 3.0). 12% of those full length LTRs are fragmented by the 3 Supporting Online Material for Rensing et al. 2008, 319, 64 (2008) insertion of another LTR element. They generally represent an older fraction of the full length LTRs with average and median ages of 4.6 and 4.3 mio years (tables S9, S22). About half of the genome consists of LTR retrotransposons (157,127 elements, 51.3 % of the sequence length). Only 5% (3,188) of them still exist as intact full) length elements; the remainder are diverged and partial remnants are often fragmented by mutual insertions. Nested regions are very common, with 14% of the LTR elements inserted into another LTR element (table S9). ! Helitron transposable elements were sought by structural criteria as follows: the program searches for Helitron 3' end structures, and then aligns any cases where the same structure is found more than once. If this alignment indicates additional Helitron properties (e.g. insertion within 5')AT)3', extension of homology into the 5' direction, etc.), then the element is judged to be a Helitron. PASA ( ) was used to identify all potential AS events based on the qualified EST/genome alignments generated by GMAP ( ) (Criteria: maximum intron length = 4kb, minimal percentage of cDNA aligned = 80%, minimal average percentage of alignment identity = 97%). To make our results comparable to Wang and Brendel ( ), only five splicing events used in their study (AltA, AltD, AltP, IntronR, and ExonS) were included for further analyses. In total, 27,055 potential gene models were detected by EST to genome alignments and subsequently analyzed. Based on PASA, 21.4% of the analyzed genes show alternative splicing (AS, table S6), a similar frequency to - , and . ' ( ). Most AS events in use an alternative acceptor, rather retaining an intron in the mRNA. Only 7.1% of genes have intron retention events in contrast to , (14.3%) and . ' (14.6%). Longer introns and/or shorter exons in may favor splicings primarily by exon definition (as in humans) rather than by intron definition, which is implied by the larger number of intron retention events seen in . ' and - , . Exon skipping events are the dominant alternative splicing isoform in humans (~50%), but are rare in plants, including ,- , , and . ' . We first identified all paralogs according to the criteria used in Li et al. ( ), and calculated the Ks values of each paralogous gene pair following the method described in Maere et al. ( ). Since i)ADHoRe runs on whole assembled chromosomes, we concatenated all the scaffolds into 25 ‘pseudo)linkage groups’, each separated by stretches of Ns. As TAGs consist of gene family members and thus are paralogs, we started to detect tandem arrayed genes by clustering the protein sequences of the gene models. In a first step, paralogous proteins were detected using the clustering software BLASTCLUST ( ) with stringent parameters (minimum 75% identity and 80% length coverage). The resulting gene models were filtered using homology support, and genes associated with transposable elements (TIGR Plant Repeat Database Project and Repbase) as well as genes with a high proportion of polyN)stretches and with internal stops were excluded. A maximum of ten spacer genes was allowed. Details about the TAG clusters are presented in fig. S4. As the fragmentation of the current genome release could impact the detection of TAGs, we calculated the average density of TAGs in the N50 scaffolds per Mbp. Based on this data the genome was predicted to contain ~190 TAGs, while 201 were observed, which is no significant deviation. Therefore, the fragmentary nature of the genome assembly seems to have no impact on the TAG detection process. KEGG annotation of the TAG genes revealed that 44% of all photosynthetic antenna proteins are encoded by TAGs. 4 Supporting Online Material for Rensing et al. 2008, 319, 64 (2008) " To determine the degree of lineage)specific gains among gene families and to address the question whether genes with certain domains tend to expand at higher rates than others in , we identified gene families based on similarities between protein sequences from and - , and defined orthologous groups (OGs) where each group represents an ancestral gene common to the and the - , lineages and contains genes derived from speciation and all subsequent duplication and retention events. Based on the E values in all)against)all BLAST ( ) searches of and - , protein sequences, we defined similarity clusters with Markov Clustering ( ) and found 5,456 clusters which were identified in both and - , . In each cluster (referred to as gene family), OGs were defined both based on phylogenetic tree topology ( ) referred to as tree)based) and based on an iterative search algorithm applied on a sequence similarity matrix ( ) referred to as similarity)based). No apparent bias was introduced by using the NJ method for tree inference, as only 0)10% differences in the number of gains and losses were found when comparing the results of Bayesian inference on several gene families. Each OG represents a single ancestral gene from the progenitor of and - , and all lineage)specific duplicates of this ancestral gene. To determine whether genes with certain protein domains tend to expand at higher rates than expected randomly, we identified domains with HMMER 2.3.2 ( ) based on the Release 20.0 of the Pfam database (Pfam_ls; www.sanger.ac.uk/Software/Pfam). Domains with significant lineage)specific expansion were identified by determining if the number of genes in expanded OGs is significantly higher than 2 unexpanded OGs in each domain family with a χ test ( ). The values were corrected for multiple testing with the /0value software based on false discovery rates ( ). To rule out the possibility that some of the two component genes may be bacterial or fungal contaminants, we eliminated genes annotated as two component regulators that are more similar to bacterial or fungal genes than they are to plant genes. Even after applying this conservative criterion, there is still significant over)representation of HisKA and response regulator domain containing genes in . #" The aldehyde dehydrogenase (ALDHs) superfamily is involved in osmotic protection, NADPH generation, aldehyde detoxification, and intermediary metabolism ( ). The ALDH superfamily comprises 14 genes in 9 protein families in - , , and 20 genes in 10 protein families in . At least two protein families are not found in other eukaryotic genomes. has members within 8 of the 9 protein families found in - , , and three of these protein families are expanded in . The expansion and variety of ALDH gene members suggest that their presence results in an active and robust γ)aminobutyric acid (GABA) shunt metabolic pathway and the GAPN glycolytic bypass ( ). The WRKY transcription factor family, regulating responses to stress and a number of developmental processes in angiosperms, is expanded in (40 members) as compared to unicellular algae (no more than three genes), while angiosperms typically contain 75)125 members (table S13). Many algae and bryophytes share the ancestral trait of having flagellated male gametes, although this trait has been lost in flowering plants ( ). Consequently, proteins for delta and epsilon tubulins, required for forming the basal bodies of flagella ( ), are found in (St 93, 94). Genes were also found for most proteins of the inner, but not the outer dynein arms (St 91, 92), which are the motors for the motility of flagella. This observation suggests a lack of outer arms in flagella, as has been shown to be the case for other land plants ( ). Cytoplasmic dynein genes and their regulatory dynactin complex genes are absent, suggesting that the dynein)mediated transport system was probably lost in or prior to the last common ancestor of and flowering plants. $% % In vascular plants, photomorphogenic signals are perceived by three sensory photoreceptor families: phytochrome, cryptochrome and phototropin. possesses four canonical phototropins, UV/A)blue light photoreceptors that help optimize photosynthesis in shade while avoiding damage in sunlight ( ). has seven phytochromes, more than any organism reported to date. Of the potential phytochrome partners, 5 Supporting Online Material for Rensing et al. 2008, 319, 64 (2008) neither FHY1, PIF3 nor the PKS family of phytochrome)interacting proteins are present in , whereas two copies of NDPK2, implicated in phytochrome signaling in vascular plants ( ), are represented. UV/A)blue light sensitive cryptochromes and the related photolyase DNA)repair family are represented in all known bacteria and eukaryotes. Accordingly, in addition to two HY4)like cryptochrome photomorphogenic photoreceptors ( ), has one UVR3)like 6)4 photolyase, one ssDNA CRY3)like and several dsDNA PHR)like cyclobutane pyrimidine dimer photolyases that restore nucleotide structure with the help of UV/A)blue light following UV/B) induced damage. Circadian oscillators are found in most organisms, and genes related to TOC1/PRR pseudo)response regulators (St 69) and LHY/CCA1 single)myb domain transcription factors (St 30) of flowering plant clocks are present in both and . # &. # . In terms of interpretation of seasonal cues, has sequences related to the key photoperiodic regulators CONSTANS (St 69) ( ), ( ), and FT (St 74), as well as the CONSTANS)regulating cycling DOF factors (St 19), but not their downstream targets. Thus, these signaling pathways appear to have an ancient origin, with the evolution of specific downstream targets occurring later, after the divergence from the last common ancestor of land plants. & % In order to accurately describe the evolutionary history of the gene families discussed, phylogenetic inference was performed. The overall pipeline approach to construct gene families starting from candidate queries was carried out as previously described ( ). The non)redundant search space used for the PSI)BLAST ( ) searches consisted of the predicted proteins of 45 completely sequenced genomes covering organisms from all super kingdoms, with special focus on plants and algae (table S23). Using maximally four PSI)BLAST iterations, the database was searched for candidate gene family members (E)value cutoff 1E)4; hit inclusion cutoff 1E)5), the resulting hits were filtered based on 35% identity and 80 amino acids hit length. Overlapping filtered result sets were merged to recover family relations by single linkage clustering using a stringent hit)coverage)based distance measure (>=80 aa overlap on the shared hit). Neighbor joining trees inferred from the automatically generated clusters were manually checked and curated if necessary by reduction to the subfamily of interest or subclustering by splitting the cluster into multiple subfamilies. In the latter case, the original cluster id was extended (e.g. 58_A and 58_B). Based on the manually curated gene families, multiple alignments were calculated using MAFFT L)INSI ( ). In the case of the WRKY and B3 families, which are defined by a short protein domain and thus are difficult to represent by phylogenies based on whole protein alignments, the corresponding PFAM ( ) domain (PF03106 and PF02362) HMMer (http://hmmer.janelia.org) fs profile was used to extract the conserved domain sequence from the gene family members using hmmerpfam with the trusted cutoff. The domain sequences were aligned using MAFFT L)INSI. Maximum likelihood tree topologies were created from the final gene families using the RAxML software ( ). For each multiple alignment, the optimal evolutionary model was selected using the ProtTest software ( ). The best)known likelihood (BKL) tree was selected from a PROTMIX tree search with 100 randomized maximum parsimony starting topologies, optimization of individual site substitution rates, classification of four discrete rate categories, and final evaluation using the previously selected model of rate heterogeneity with full parameter estimation. The BKL tree topology was annotated with confidence (bootstrap) values derived from a multiple non)parametric bootstrap approach using the PROTCAT procedure and the family)specific model. All generated trees were mid)point rooted at the longest internal branch, annotated with species information and stored in NHX format. The annotated tree topologies can be accessed and viewed using the ATV java applet via http://www.cosmoss.org/bm/supplementary_trees/Rensing_et_al_2007/ 6 Supporting Online Material for Rensing et al. 2008, 319, 64 (2008) B) Authorship The order of the 70 authors was divided into three tiers, the first tier (1)23) being those scientists who actually contributed directly to the production of the sequences, their assembly, annotation, analyses and in the writing of the paper. Their order is according to the extent of their contribution, the first author making the greatest contribution overall. The second tier (24)61) is composed of authors arranged alphabetically who analyzed characteristics of the assembled genome, specific genes and gene families described in the main text. The third tier (62)70) is composed of authors who assisted in and facilitated the writing of the paper, had administrative/contact responsibility at the Joint Genome Institute and at the laboratories of the members of the Moss Genome Consortium (www.mossgenome.org). The corresponding author had a major role in facilitating and organizing the final assembly of the authors, annotators and writers of this manuscript. 7 Supporting Online Material for Rensing et al. 2008, 319, 64 (2008) C) Figure Legends Figure S1: % LTR#retrotransposon length distribution (LTR_STRUC) of $% and rice Length distributions of the full length LTR retrotransposons for ,- , LTR_STRUCT software. The blue vertical line indicates the arithmetic mean. Figure S2: , and rice as predicted by the Nesting architecture and spatial distribution of selected repeat elements The Apollo Genome Viewer is used with customized color codes for the selective visualization of genetic elements. 1 : ANGELA repeat annotation with nesting display. 1 : transposon protein domains. 1 : full length LTR retrotransposons with age color code. 1 : solo LTRs. Figure S3: $% Ks distribution plot Age distribution of paralogous genes. The height of the bars reflects the amount of gene pairs in the respective bin relative to the total amount of Ks values in the distribution. Figure S4: Tandemly arrayed gene (TAG) properties Distribution of 10 tandemly arrayed gene properties. 1 %# % # , : cluster_size (number of paralogous genes; 75% identity and 80% coverage), original scaffold TAG size (number of genes in array on the same scaffold, allowing unlimited intervening genes), scaffold TAG size (number of genes in array on the same scaffold, allowing maximally 10 intervening genes; the following features refer to this stringent definition), delta number of exons (number of divergent exons between TAG pairs), delta gene length (differences in gene length between TAG pairs). 1 %# % # , : delta CDS length (differences in coding sequence lengths), orientation (strand orientation), number of genes in between (number of genes between TAG pairs), distance (TAG pair distance in bp), distance excluding intermediate genes (TAG pair distance in bp excluding the lengths of intervening genes). Figure S5: TAG functional annotation: Deviating KEGG pathways Bar chart comparing the significantly deviating KEGG pathway annotations between the TAGs (light blue) and the non)overlapping remainder of the genes (dark yellow). Differences were compared using Fisher tests corrected for multiple testing using the Benjamini and Hochberg (BH) method as implemented in R. Figure S6: G#proteins of $% compared with other eukaryotes A: For each of the green plant genomes, a box represents a gene present in the genome that encodes a small G)protein of the indicated phylogenetic group. The closest human homolog is shown at the bottom. Species abbreviations: -# , - , 2 ,* , ; ", # , ", *& # , #& ; . , . # #;. ,. # # ;3 ,3 . B: Each of the organisms is represented by a column of boxes where each box represents a gene present in the genome that encodes a SNARE (top) or SM)family protein (bottom), with the color of the box indicating the type of SNARE protein (orange, Qa; purple, Qb; green, Qb+Qc; red, Qc; blue, R) or SM [brown, Sly1 (ER); cyan, Vps45 (Golgi/endosomes); light green, Vps33 (vacuole/lysosome); violet, Sec1 (PM)]. Clusters are separated into the three main functional unit of the endomembrane system based upon homology with proteins of known function in yeast, mammals and plants. Species abbreviations: -# ,, -# & , ; #, # , # ; .#* , .#*4 ' ; ,* , ; ", # , ", *& # , #& ; 5 , 5 ' 6 # #, . , . # #; . , . # # ; "* , "* & ,*4 # ; +, , +, # & ; , #, , & * # # 2 ,* , ,* , , # 7 ; ,*# , ,* , , # 8 Supporting Online Material for Rensing et al. 2008, # # ; $ &, $ ;" ," Figure S7: * #, & & & ; ; $# , , $# , # , * # ' #; 3 ; , ,3 , 319, 64 (2008) ,4 , # * . Contaminant isolation using multivariate clustering analysis of 27 scaffold features Multivariate clustering analysis of 27 scaffold features, combining principal component analysis (PCA) and k) means clustering, allowed the isolation of prokaryotic contaminant sequences from the genome assembly. Cluster 1 (red): true genomic regions; cluster 2 (blue): Bacterial contaminant from a yet unsequenced species introduced with the genomic DNA (removed entirely from the released assembly); cluster 3: longer a) (green) / shorter b) (black) repetitive genomic regions (e.g. transposons) without protein coding genes or EST evidence mixed with some longer a) (green) / shorter b) (black) bacterial sequences possibly introduced by plate)switch or mis)labelling during sequencing (experimentally confirmed scaffolds were removed from the released assembly). Figure S8: RECON repeat family analysis A: Distribution plot of repeat family sizes as determined using the RECON repeat finder software. B: Distribution plot of the average length (bp) of repeat families as determined using the RECON repeat finder software. C: Two)dimensional comparison of the RECON repeat families using their sizes (number of elements) and average element length (bp). 9 Supporting Online Material for Rensing et al. 2008, 319, 64 (2008) D) Figures Figure S1: Figure S2: LTR#retrotransposon length distribution (LTR_STRUC) of $% and rice , Nesting architecture and spatial distribution of selected repeat elements 1 2 3 4 1.2 Mb of scaffold_4 tier 1: ANGELA repeat annotation tier 2: hmm domains ) 1: complete Angela annotation with 2: transposon hmm domains 3: full length LTRs (age color coded) 4: solo LTRs nesting tier 3: LTR age !"# !"$ %"# &"' ( &"' ) * 1 2 3 4 0.5 Mb of scaffold_2 1 2 3 4 0.5 Mb of scaffold_1 10 Supporting Online Material for Rensing et al. 2008, Figure S3: $% Figure S4: Tandemly arrayed gene properties 319, 64 (2008) Ks distribution plot 11 Supporting Online Material for Rensing et al. 2008, Figure S5: TAG functional annotation: Deviating KEGG pathways Figure S6: G#proteins of $% 319, 64 (2008) compared with other eukaryotes A: 12 Supporting Online Material for Rensing et al. 2008, 319, 64 (2008) B: Figure S7: Contaminant isolation using multivariate clustering analysis of 27 scaffold features 13 Supporting Online Material for Rensing et al. 2008, Figure S8: 319, 64 (2008) RECON repeat family analysis A: B: 14 Supporting Online Material for Rensing et al. 2008, 319, 64 (2008) C: 15 Supporting Online Material for Rensing et al. 2008, 319, 64 (2008) E) Tables Table S1: Transcript evidence resources used for genome annotation Genome size (Mb) 480 Known cDNA 3,154 ESTs from NR 120,702 ESTs from collaborators 96,133 EST clusters from JGI 31,951 Number of EST clusters aligned 31,146 97% The above transcript evidence resources where mapped to the genome using BLAT and were used for genome structure prediction. Table S2: $% v1.1 gene model support Model Types Number Percentage Known genes 210 1% Models based on homology)methods 13,150 37% - 22,578 63% genes 35,938 Total genes Composition of the final set of gene models forming the released v1.1 genome annotation. Table S3: $% Model Statistics v1.1 gene properties Average Gene length (bp) 2,389.42 Transcript length (bp) 1,195.77 Protein length (aa) 362.84 Exons per gene 4.87 Exon length (bp) 245.62 Intron length (bp) 310.57 Genes per Mbp 74.9 Some properties of the structure and organization of genes within the Table S4: genome v1.1. Functional annotation of the v1.1 gene models Model Support Number Percentage Distinct Categories Supported by multiple methods 3,754 10% Supported by homology 13,360 37% Models with EST support 12,593 35% Models with Swissprot alignments 13,340 37% Models with Pfam alignments 13,613 38% Models with EC assignments 4,110 11% 789 Models with KOG assignments 15,932 44% 3,603 Models with GO assignments 12,129 34% 3,092 Outcome of the functional annotation of the v1.1 gene models using various data sources and methods. 16 Supporting Online Material for Rensing et al. 2008, Table S5: 319, 64 (2008) V1.1 gene model quality Model Quality Number Percentage Multi)exon genes 30,928 86% Truncated (missing both 5'M 3'*) 2,206 6% Partial models (either 5'M or 3'*) 3,562 10% Complete models (5'M 3'*) 30,170 84% Models extend to either 5' or 3' UTR 8,418 23% Complete models extend both to 5' and 3' UTR 4,517 13% Six parameters assessing the v1.1 gene model quality. Completeness of gene models is measured by considering the existence of a translation initiating 5’ methionine (5’M) and a 3’ terminal stop codon (3’*). Table S6: Summary statistics of genome#wide alternative splicing in $' Type of alternative splicing genes Events Genes* AltA 3,272 (28.1%) 1,446 (5.3%) AltD 28,22 (24.3%) 1221 (4.5%) AltP 2,050 (17.6%) 761 (2.8%) IntronR 2,892 (24.9%) 1913 (7.1%) ExonS 598 (5.1%) 465 (1.7%) 11,634 5,806 (21.4%) Total Overview of the alternative splicing variants observed in using the PASA software. The number of genes described refers to gene loci in terms of PASA subclusters (*). Table S7: RECON repeat family sizes and element lengths Average Number of elements Element size Low High 10 1 857 1,292 300 43,280 Average and range of element numbers and sizes observed within the 1,381 repeat families identified. Only families with a minimum of 10 elements were retained for analysis, but all sequences less than 300bp were not used for masking or subsequent statistics, hence some families are ultimately represented by only one sequence. Table S8: Composition and contribution of the 15 RECON repeat families Repeat Family ID Bases represented [bp] Family sizes Mean element length [bp] Largest element length [bp] Smallest element length [bp] Family hits within the genome AT_rich#low_complexity 18,074,591 1)6 13,896,516 178 1,551.21 5,585 388 9,834 1)5 10,985,211 88 1,421.78 3,770 386 14,717 1)7 7,973,118 60 1,064.65 1,511 432 9,832 2)6 2,957,609 756 1,435.24 7,116 300 2,260 2)1 2,453,886 857 440.28 689 300 10,096 309,731 17 Supporting Online Material for Rensing et al. 2008, 1)17 1,910,382 (TA)n#Simple_repeat 1,630,550 1)12 1,603,066 66 948.45 1,696 327 319, 64 (2008) 2,893 45,976 47 2,376.57 43,280 331 2,355 2)15 1,268,580 483 756.14 1,371 300 2,427 2)550 1,247,494 310 1,184.24 5,091 303 853 1)47 990,337 11 1,700.73 7,032 333 1,412 2)33 863,697 77 1,809.74 6,855 320 401 1)16 580,258 9 1,472.44 2,930 853 680 2)3 529,037 41 652.54 1,236 310 757 Overview of the individual family composition and their contribution to the repetitive fraction of the P. patens genome. Table S9: Nesting level of transposable elements insert level # # [%] Nucleotides [bp] nucleotides [%] 0 135,376 86.16 195,529,390 84.02 1 20,286 12.91 34,303,645 14.74 2 1,408 0.9 2,769,226 1.19 3 56 0.04 113,885 0.05 4 1 0 1,328 0 1# 4 21,751 13.85 37,188,084 15.98 Sum 157,127 100 232,717,474 100 Level of nesting which was observed among transposable elements in the genome. Insert Level 0 means that the element is not inserted into another element. Level 1 elements are inserted into level 0 elements, level 2 elements into level 1 elements and so on. The insertion of a child element into a parent element fragments the parent into two parts. Table S10: intact truncated Helitrons id from to scaffold_366_P 158,402 164,572 scaffold_65_P 445,133 451,276 scaffold_277_N 512,482 518,535 scaffold_201_P 88,530 94,573 scaffold_18_N 2,033,993 2,040,103 scaffold_42_N 1,958,341 1,964,492 scaffold_2_N 3,298,214 3,304,190 scaffold_5_P 159,282 169,833 scaffold_11_N 857,172 868,013 scaffold_14_P 326,051 348,053 18 Supporting Online Material for Rensing et al. 2008, scaffold_33_P 932,035 938,246 scaffold_70_P 1,408,895 1,420,173 scaffold_158_P 988,368 994,295 scaffold_183_N 632,506 638,670 scaffold_188_N 487,167 492,531 scaffold_250_N 429,279 442,590 scaffold_269_P 470,661 483,899 scaffold_295_N 218,887 225,071 scaffold_319_P 6,339 9,759 Loci of the single family of Helitrons (rolling)circle DNA transposons) found in the represent positive or negative strand. 319, 64 (2008) genome. P and N 19 Supporting Online Material for Rensing et al. 2008, Table S11: 319, 64 (2008) Comparison of tandemly arrayed genes (TAGs) to non#TAG genes TAGs normality [p] Gene models normality [p] Wilcoxon rank sum test [p] TAGs max TAGs mean TAGs median TAGs min TAGs σ Gene models max Gene models mean Gene models median Gene models min Gene models σ Gene length [bp] 1.23E)55 0 3.07E)29 25,629.0 2,198.20 1,706.0 252.0 1,900.56 39,890.0 3,082.53 2,519.0 240.0 2,377.11 CDS length [bp] 5.36E)48 0 2.30E)11 4,002.0 1,065.92 891.0 252.0 676.54 14,577.0 1,306.53 1,080.0 180.0 980.84 0 0 8.49E)41 27.0 3.98 3.0 1 3.45 77.0 6.73 5.0 1.0 5.76 1.16E)82 0 0 1,965.0 420.44 312.7 73.2 361.32 4,176.0 308.08 189.5 50.2 329.32 Exons Average exon length [bp] Cluster size 0 0 0 25.0 5.11 4.0 2.0 4.26 25.0 1.99 1.0 1.0 1.97 6.99E)204 0 2.37E)37 24,774.0 851.55 440.0 0 1,561.11 24,774.0 1,475.82 1,067.0 0 1,569.09 Average intron length [bp] 0 0 1.42E)06 12,387.0 253.26 197.8 0 661.93 12,387.0 243.10 227.2 0 259.91 Introns 0 0 8.49E)41 26.0 2.98 2.0 0 3.45 76.0 5.73 4.0 0 5.76 Introns length [bp] GC exons [%] 0.00186 0 0 69.2 54.76 54.7 30.6 5.86 74.3 49.41 48.8 30.6 3.97 GC introns [%] 3.39E)184 0 0 71.8 36.71 42.8 0 20.72 71.8 35.43 38.8 0 13.76 GC gene [%] 0.07164 0 0 67.2 51.88 51.8 8.0 6.91 67.2 45.28 44.2 8.0 4.73 GC CDS total [%] 7.51E)03 0 0 67.2 55.46 55.6 30.5 5.68 67.2 49.69 49.0 30.5 3.89 0 0 1.68E)07 647.0 31.52 7.5 0 70.52 1,042.0 12.67 5.0 1 32.02 Gene model EST support [%] Gene model cDNA support [%] 0 0 6.47E)03 4.0 0.28 0 0 0.64 4.0 0.18 0 0 0.43 Gene model GenPept best HSP length [bp] 3.04E)43 0 6.51E)05 1,330.0 329.10 273.0 50.0 214.45 4,943.0 380.52 315.0 80.0 298.51 Gene model GenPept best HSP identity [%] 0 0 0 100.0% 72.1% 75.0% 32.3% 17.8% 100.0% 58.1% 56.1% 35.0% 14.9% TIGR and plantrep HSP length [bp] 0 0 0 69.0 0.71 0 0 6.21 79.0 0.04 0 0 1.75 TIGR and plantrep HSP identity [%] 0 0 0 100.0% 0.8% 0.0% 0.0% 7.2% 34.8% 0.0% 0.0% 0.0% 0.8% The above table compares 18 features of tandemly arrayed genes (TAGs) with those of non)TAG genes (gene models). First, normality was tested for the distribution of each feature using the Pearson chi)square test for normality. None of the features were distributed normally. Thus, biased features between the two populations were compared using the Wilcoxon rank sum test (less; more). In addition, an overview of the distributions is given showing minimal (min), maximal (max), median, average (mean) values and the standard deviation (σ) for both TAGs and non)TAG gene models. 20 Supporting Online Material for Rensing et al. 2008, Table S12: Subfamily & & Type I and type II MADS#box and MADS#like genes in $% (& . Genomic locus ( #box) Gene Name (& Scaffold Start End Strand $$ ) scaffold_118 1,026,583 1,026,404 + $$ * scaffold_55 1,832,462 1,832,283 + 348,851 349,030 ) (&& $ ) scaffold_267 (&& $ + scaffold_171 406,784 406,605 + (&& $$ &, scaffold_26 773,307 773,486 ) (&& $$ &- scaffold_209 758,925 758,746 + (&. $ * scaffold_118 802,139 802,318 ) (&. $ / scaffold_55 1,740,464 1,740,285 + (&. $$ / scaffold_34 1,943,470 1,943,291 + (&. $$ 0 scaffold_163 560,281 560,460 ) (&. $$ - scaffold_8 781,587 781,766 ) (&. $$ 1 scaffold_313 148,169 147,990 + (&. $$ , scaffold_34 1,967,363 1,967,179 + (&. $$ 2 scaffold_8 789,036 789,215 ) (&. $$ 3 scaffold_55 1,750,072 1,749,893 + (&. $$ )4 scaffold_90 799,382 799,561 ) (&. $$ )) scaffold_163 554,447 554,626 ) (&. $$ )* scaffold_273 362,369 362,548 ) Type I $$ ) scaffold_68 1,691,186 1,691,365 ) Type I $$ * scaffold_81 1,205,177 1,204,998 + Type I $$ / scaffold_88 1,179,645 1,179,824 ) Type I $$ 0 scaffold_198 705,696 705,517 + Type I $$ , scaffold_198 708,785 708,964 ) #like $$ ) scaffold_15 1,752,266 1,752,439 ) #like $$ * scaffold_37 2,364,940 2,365,119 ) #like $$ / scaffold_122 861,365 861,186 + Loci of )-$ )box domains in the 319, 64 (2008) genome v1.1 21 Supporting Online Material for Rensing et al. 2008, Table S13: 319, 64 (2008) WRKY transcription factor gene families A comparison of the WRKY transcription factor gene families from with those of ", *& # , #& . # # , . # # and -# & , . The total number of genes for each subfamily is shown. * indicates that the members of the subfamily form a distinct subgroup in a combined phylogenetic tree. Table S14: ABC subfamily A B Inventory of ABC transporter genes in $% Gene name ABC subfamily group1 Accession number EST support2 TAIR loci of closest % homologue PpABCA1 AOH Phypa_221752 yes AT2G41700 PpABCA2 ATH Phypa_190702 yes AT3G47730 PpABCA3 ATH Phypa_190218 yes AT3G47780 PpABCA4 ATH Phypa_180906 yes AT3G47790 PpABCA5 ATH Phypa_145836 yes AT3G47790 PpABCA6 ATH Phypa_147779 no AT3G47780 PpABCA7 AOH Phypa_234064 no AT2G41700 PpABCB1 LLP Phypa_115784 yes At5G03910 PpABCB3 TAP Phypa_129034 yes AT5G39040 PpABCB4 TAP Phypa_174637 yes AT5G39040 PpABCB5 TAP Phypa_224391 yes AT1G70610 PpABCB6 TAP Phypa_224785 yes AT1G70610 PpABCB7 TAP Phypa_63650 yes AT5G39040 PpABCB8 TAP Phypa_193090 yes AT4G25450 PpABCB9 ATM Phypa_108321 yes AT5G58270 PpABCB10 ATM Phypa_225750 yes AT5G58270 PpABCB11 MDR Phypa_199955 yes AT3G28345 PpABCB12 MDR Phypa_198750 yes AT3G28345 PpABCB13 MDR Phypa_227047 yes AT2G47000 PpABCB14 MDR Phypa_59717 yes AT1G02520 PpABCB15 MDR Phypa_110943 yes AT3G28860 PpABCB16 MDR Phypa_170613 yes AT3G28860 PpABCB18 MDR Phypa_56126 no AT3G28860 PpABCB20 MDR Phypa_119621 no AT2G39480 22 Supporting Online Material for Rensing et al. 2008, C D F G PpABCB22 LLP Phypa_8856 no AT3G28860 PpABCB23 ATM Phypa_91386 no AT5G58270 PpABCB24 MDR Phypa_140970 PpABCC1 MRP Phypa_135574 yes AT2G07680 PpABCC2 MRP Phypa_194836 yes AT2G34660 PpABCC3 MRP Phypa_199102 yes AT2G34660 PpABCC4 MRP Phypa_216010 yes AT3G62700 PpABCC5 MRP Phypa_187434 yes AT3G62700 PpABCC6 MRP Phypa_137284 yes AT3G62700 PpABCC7 MRP Phypa_224600 yes AT3G21250 PpABCC8 MRP Phypa_167276 yes AT1G04120 319, 64 (2008) AT3G28860 PpABCC9 MRP Phypa_221970 yes AT1G04120 PpABCC10 half MRP Phypa_153801 yes AT1G30410 PpABCC11 MRP Phypa_145373 no AT2G34660 PpABCC12 MRP Phypa_61991 no AT3G59140 PpABCC13 MRP Phypa_117638 no AT3G21250 PpABCC15 MRP Phypa_101994 no AT1G04120 PpABCD1 PMP Phypa_125471 yes AT4G39850 PpABCD2 PMP Phypa_134601 yes AT1G54350 PpABCD3 PMP Phypa_207071 yes AT1G54350 PpABCD4 PMP Phypa_130679 yes AT1G54350 PpABCD5 double PMP Phypa_218012 yes AT4G39850 PpABCD7 PMP Phypa_144681 no AT1G54350 PpABCF1 GCN Phypa_208576 yes AT1G64550 PpABCF2 GCN Phypa_223577 yes AT5G60790 PpABCF3 GCN Phypa_192602 yes AT5G60790 PpABCF4 GCN Phypa_161003 yes AT5G60790 PpABCF5 GCN Phypa_185776 yes AT3G54540 PpABCF6 GCN Phypa_231060 yes AT3G54540 PpABCF7 GCN Phypa_30640 yes AT5G64840 PpABCF8 GCN Phypa_201003 yes AT5G64840 PpABCF10 GCN Phypa_107004 yes AT5G64840 PpABCG1 WBC Phypa_112649 yes AT5G60740 PpABCG2 WBC Phypa_147149 yes AT2G01320 PpABCG3 WBC Phypa_196641 yes AT4G27420 PpABCG4 WBC Phypa_127566 yes AT5G06530 PpABCG5 WBC Phypa_197808 yes AT2G13610 PpABCG6 WBC Phypa_151127 yes AT1G17840 PpABCG7 WBC Phypa_59855 yes none PpABCG8 WBC Phypa_97018 yes AT1G17840 23 Supporting Online Material for Rensing et al. 2008, I3 PpABCG9 WBC Phypa_11555 yes AT5G13580 PpABCG10 WBC Phypa_128675 yes AT3G53510 PpABCG11 WBC Phypa_41420 yes AT3G53510 PpABCG13 WBC Phypa_153252 yes AT1G53270 PpABCG14 WBC Phypa_215170 yes AT1G53270 PpABCG15 PDR Phypa_175287 yes AT2G29940 PpABCG16 PDR Phypa_128826 yes AT1G59870 PpABCG17 PDR Phypa_176017 yes AT1G15210 PpABCG18 PDR Phypa_121512 yes AT1G15210 PpABCG19 PDR Phypa_140793 yes AT1G15210 PpABCG20 PDR Phypa_210034 yes AT1G59870 PpABCG21 PDR Phypa_192434 yes AT1G59870 PpABCG22 PDR Phypa_226738 yes AT1G66950 PpABCG23 PDR Phypa_171206 yes AT3G16340 PpABCG24 WBC Phypa_129635 no AT5G60740 PpABCG25 WBC Phypa_140499 no AT5G60740 PpABCG26 PDR Phypa_102109 no AT2G29940 PpABCG27 PDR Phypa_116286 no AT1G15210 PpABCG28 WBC Phypa_151478 no AT1G17840 PpABCG29 WBC Phypa_131586 no AT2G39350 PpABCG30 WBC Phypa_41350 no AT3G53510 PpABCG31 WBC Phypa_135027 no AT2G13610 PpABCG32 PDR Phypa_112247 no AT2G29940 PpABCG33 PDR Phypa_118223 no AT1G59870 PpABCG34 PDR Phypa_139762 no AT1G59870 PpABCG35 PDR Phypa_128793 no AT1G59870 PpABCG36 WBC Phypa_131592 no AT1G17840 PpABCG37 WBC Phypa_146773 no AT1G17840 PpABCG38 WBC Phypa_71431 no AT1G17840 PpABCG39 WBC Phypa_114177 no AT5G13580 PpABCG40 WBC Phypa_134830 no AT5G13580 PpABCG41 WBC Phypa_140592 no AT4G27420 AT5G46540 PpABCI1 NO Phypa_134304 yes PpABCI2 MKL Phypa_116997 yes PpABCI3 MKL Phypa_180730 yes 319, 64 (2008) AT1G65410 PpABCI4 ADT Phypa_179405 yes AT1G03905 PpABCI5 CCM Phypa_116239 yes AT1G63270 24 Supporting Online Material for Rensing et al. 2008, PpABCI6 O4 CBY Phypa_17451 yes PpABCI7 CBY Phypa_149024 yes PpABCI8 ABCX Phypa_106270 yes AT4G33460 AT3G10670 PpABCI9 ADT Phypa_218855 yes AT5G44110 PpABCI10 ABCX Phypa_3208 yes AT1G32500 PpABCI11 ABCX Phypa_121886 yes AT4G04770 PpABCI12 CBY Phypa_203642 yes AT3G21580 PpABCI13 CCM Phypa_146726 no AT2G07681 PpABCI14 MKL Phypa_127149 yes AT1G19800 PpABCI15 ABCX Phypa_111022 yes AT4G04770 PpABCI16 NO Phypa_157748 no AT1G67940 Phypa_158315 no AT1G02520 Phypa_235054 no AT5G61700 Phypa_158388 no AT1G28010 ATM)like fragment ATH)like fragment MDR)like fragment PpABCB17 PpABCA8 PpABCB25 319, 64 (2008) Inventory of ABC transporters in the v1.1 genome. Footnote annotations: 1 The ABC transporter subfamilies are defined in table S15. 2 On comparison with EST collection as of October 2006. 3 Components of ABC transporters with homology to prokaryotic ABC proteins. 4 Includes fragments of ABCs which align with main subfamilies. Table S15: Subfamily A B ABC subfamily group domain structure Group Domain structure AOH TMD)NBD)TMD)NBD ATH TMD)NBD MDR(PGP) TMD)NBD)TMD)NBD ATM(HMT) TMD)NBD TAP TMD)NBD LLP TMD)NBD C MRP TMD)NBD)TMD)NBD D PMP TMD)NBD)TMD)NBD F G GCN NBD)NBD WBC NBD)TMD PDR NBD)TMD)NBD)TMD Domain structure of the ABC subfamily groups (TMD = transmembrane domain; NBD = nucleotide binding domain). 25 Supporting Online Material for Rensing et al. 2008, Table S16: Abbreviation 319, 64 (2008) Full names of the chlorophyll and carotenoid biosynthetic enzymes shown in Figure 4 Full Name GTS glutamyl)tRNA synthetase GTR glutamyl)tRNA reductase GSA glutamate)1)semialdehyde aminotransferase ALAD 5)aminolevulinic acid dehydratase PBGD porphobilinogen deaminase UROS uroporphyrinogen III synthase UMT uroporphyrinogen III methyltransferase UROD uroporphyrinogen III decarboxylase CPX coproporphyrinogen III oxidase PPX protoporphyrinogen IX oxidase FC ferrochelatase CHLD protoporphyrin IX Mg)chelatase subunit D CHLI protoporphyrin IX Mg)chelatase subunit I CHLH protoporphyrin IX Mg)chelatase subunit H PPMT Mg)protoporphyrin IX methyltransferase CHL27 Mg)protoporphyrin IX monomethylester cyclase subunit 1 DCR divinylprotochlorophyllide reductase POR light)dependent NADPH:protochlorophyllide oxidoreductase CHS chlorophyll synthase CAO chlorophyllide GGR geranylgeranyl reductase oxygenase DXS 1)deoxy)D)xylulose)5)phosphate synthase DXR 1)deoxy)D)xylulose)5)phosphate reductoisomerase CMS 4)diphosphocytidyl)2)C)methyl)D)erythritol synthase CMK 4)diphosphocytidyl)2)C)methyl)D)erythritol kinase MCS 2)C)methyl)D)erythritol 2,4)cyclodiphosphate synthase HDS 1)hydroxy)2)methyl)2)(E))butenyl)4)diphosphate synthase IDS isopentenyl) / dimethylallyl)diphosphate synthase IDI isopentenyl diphosphate isomerase GGPS geranylgeranyl pyrophosphate synthase PSY phytoene synthase PDS phytoene desaturase ZDS )carotene desaturase CRTISO carotenoid isomerase LCYB lycopene )cyclase LCYE lycopene )cyclase CHYB carotene )hydoxylase (non)heme iron) CYP97A carotene )hydoxylase (cytochrome P450) CYP97C carotene )hydoxylase (cytochrome P450) ZEP zeaxanthin epoxidase VDE violaxanthin de)epoxidase 26 Supporting Online Material for Rensing et al. 2008, 319, 64 (2008) Table S17. Gene families involved in auxin homeostasis and signaling LCA land plants $% LCA flowering plants % $ 5 TIR1/AFB auxin receptors 1 4 4 6 8 7 Auxin response factors 3 14 ~12 24 27 28 Aux/IAA repressors 1 2 7)10 29 35 32 Auxin binding proteins 1 1 1 1 2 2 PIN auxin efflux carriers 1)2 3 6)9 8 16 13 AUX1/LAX auxin influx transporters 1)3 8 3 4 8 5 YUCCA/FLOOZY monoxygenases 1)2 6 5)7 11 12 14 Class II GH3 IAA amidosynthetases 0 0* 4)5 8 9 9 IRL1/ILL IAA amidohydrolases 0 0* 4)6 7 11 9 Small Auxin)Up RNA (SAUR) 2)3 18 ~20 76 102 56 55 174 230 175 Total protein coding loci 39,796 26,751 45,555 42,653 Proportion (Auxin signaling) 0.14% 0.65% 0.50% 0.41% Total auxin)related genes The numbers of genes in the ancestral land plant refer to the last common ancestor (LCA) of and flowering plants, the ancestral flowering plant LCA to those of monocots and eudicots. These numbers were estimated from the topologies of RAxML)inferred phylogenetic trees (St 25, 33_A/B, 41, 45, 71, 73, 77, 85, 88, and 89). *Similar proteins do not group within or directly sister to the flowering plants genes implicated in auxin homeostasis. 27 Supporting Online Material for Rensing et al. 2008, Total % $% & 5 5 &% 5 $ % $% A % Taxonomic profile of LHC protein families among 15 plastid#bearing organisms with sequenced nuclear genome Other Table S18: 319, 64 (2008) P#value (Fisher test) Tailed? Seed plant average $% adjusted using seed plant σ Tree 58_A 0 47 23 16 24 23 14 14 0 5 6 0 0 0 0 0 172 0.004980 greater 21 42.64110 LHCI 0 13 8 7 9 8 5 5 0 0 0 0 0 0 0 0 55 0.349788 greater 8 12 Lhca1 LHCI type 1 0 3 1 1 2 1 1 1 0 0 0 0 0 0 0 0 10 0.596273 greater 1.33333 2.42265 Lhca2 LHCI type 2 0 5 3 2 3 1 2 2 0 0 0 0 0 0 0 0 18 0.700974 greater 2.66667 4.42265 Lhca3 LHCI type 3 0 4 1 1 1 1 1 1 0 0 0 0 0 0 0 0 10 0.340580 greater 1 4 Lhca4 LHCI type 4 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 6 1 less 1 0 Lhca5 0 1 1 1 1 3 0 0 0 0 0 0 0 0 0 0 7 1 two.sided 1 1 Lhca6 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 4 1 less 1 0 LHCII major 0 19 9 5 8 0 0 0 0 0 0 0 0 0 0 0 41 0.044250 greater 7.33333 16.91833 Lhcb1 LHCII type 1 0 18 5 3 4 0 0 0 0 0 0 0 0 0 0 0 30 0.011537 greater 4 17 Lhcb2 LHCII type 2 0 0 3 1 2 0 0 0 0 0 0 0 0 0 0 0 6 0.472528 less 2 1 Lhcb3 LHCII type 3 0 1 1 1 2 0 0 0 0 0 0 0 0 0 0 0 5 1 two.sided 1.33333 1.57735 0 11 6 4 7 3 3 3 0 0 0 0 0 0 0 0 37 0.296902 greater 5.66667 9.47247 Lhcb4 CP29 LHCII type 4 0 4 3 1 3 1 1 1 0 0 0 0 0 0 0 0 14 0.660229 greater 2.33333 2.84530 Lhcb5 CP26 LHCII type 5 0 4 1 1 1 1 1 0 0 0 0 0 0 0 0 0 9 0.339356 greater 1 4 Lhcb6 CP29 LHCII type 6 0 2 1 1 2 0 0 1 0 0 0 0 0 0 0 0 7 1 greater 1.33333 1.42265 Lhcb7/Lhcq 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 7 1 two.sided 1 1 Other LHCII#like 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0.466667 greater 0 2 Algal LCHPs 0 2 0 0 0 12 6 6 0 5 6 0 0 0 0 0 37 0.493684 greater 0 2 LhcbM 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 9 1 two.sided 0 0 Lhcx/LI818 0 2 0 0 0 3 1 1 0 5 6 0 0 0 0 0 18 0.487909 greater 0 2 LHCII minor 28 Supporting Online Material for Rensing et al. 2008, 5 5 & $% % Photoprotective LHC#like 0 30 7 9 12 17 6 8 0 0 0 0 0 7 5 4 105 0.002593 greater 9.33333 27.48339 PsbS CP22 0 1 1 3 1 4 0 0 0 0 0 0 0 0 0 0 10 1 less 1.66667 2.15470 Lil1 ELIP 0 20 2 3 3 9 4 5 0 0 0 0 0 0 0 0 46 0.001728 greater 2.66667 19.42265 LIL2 OHP1 0 3 0 0 1 0 0 0 0 0 0 0 0 5 3 4 16 0.233613 greater 0.33333 2.42265 LIL3 LIL3 0 3 2 1 4 1 1 1 0 0 0 0 0 0 0 0 13 1 greater 2.33333 1.47247 LIL4 SEP1 ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) LIL5 SEP2 0 2 1 1 1 1 0 2 0 0 0 0 0 0 0 0 8 1 greater 1 2 LIL6 OHP2 0 1 1 1 2 2 1 0 0 0 0 0 0 2 2 0 12 1 two.sided 1.33333 1.57735 Total % % &% $% adjusted using seed plant σ $ Seed plant average 5 Tailed? $% P#value (Fisher test) Other B 319, 64 (2008) Tree 58_B The two phylogenetic trees (RAxML, based on a filtered L)INSI alignment) were manually annotated (the original accession numbers are preserved in {brackets}, see trees: 58A/B). The groups of sequences whose taxonomic profiles are shown above are based on these annotations and the clustering provided by the tree topology. "Other" refers to the 30 non)plastid bearing organisms, which were present in the PSI)BLAST search space used to build the initial clusters. P)values were calculated using Fisher tests ("tailed?" shows the alternate hypothesis used for the to the average gene family size in the three seed plants (-# ,, # and test; p<0.05) to compare the number of genes found in .#* ). Additionally, differences between and the seed plants are shown by comparing the "seed plant average" vs. the frequencies adjusted using the standard deviation σ of the three seed plant frequencies (phypa_adjusted>seed plant average and phypa_adjusted<seed plant average, last two columns). For species names see table S23. 29 Supporting Online Material for Rensing et al. 2008, Table S19: 319, 64 (2008) LHCP genes present in TAGs Left model Left name Right model Right name Genes inbetween TAG orientation Phypa_144392 LHCA3 Phypa_60069 LHCA3 0 divergent Phypa_228001 LHCB4 Phypa_228003 LHCB4 0 convergent Phypa_220036 LHCP Phypa_89671 LHCP 0 divergent Phypa_163091 LHCP Phypa_124625 LHCP 0 convergent Phypa_155384 LHCP Phypa_173457 LHCP 0 divergent Phypa_52279 LHCB5 Phypa_52281 LHCB5 0 convergent Phypa_119427 LHCB6 Phypa_56132 LHCB6 2 divergent Phypa_149967 ELIP Phypa_149966 ELIP 0 divergent Phypa_149966 ELIP Phypa_149976 ELIP 0 convergent Locus scaffold_214:737529) 744820 scaffold_472:146498) 150112 scaffold_186:221732) 231141 scaffold_51:1795431) 1815645 scaffold_463:103253) 127675 scaffold_6:2604863) 2612626 scaffold_28:2016809) 2046814 scaffold_308:493010) 512327 scaffold_308:493010) 512327 ,* # LHCP genes occurring in tandem arrays. The table above provides the accession and genomic location for each LHCP gene tandem array. Additionally, the transcriptional orientation and the number of genes lying between a TAG pair are given. Table S20: groupings The entire collection of identified repeat elements, their lengths, and their family Because of its large size, the table is provided as a separate MS Excel spreadsheet file table_S20.xls. Table S21: Results of different LTR retrotransposon detection methods Method A B C1 C2 Description Focus LTR_par overlap to genes LTR_STRUC default LTR_STRUC no N)split ANGELA with method C library comparison to other plants full length LTRs per genome Copia# like [%] Gypsy# like [%] Undefined [%] 4,795 2.4 45.9 51.7 791 library compilation 1,080 4.4 43.1 52.5 exhaustive annotation for further analyses 3,188 8.7 61.0 30.3 Overview of the results of 4 different LTR retrotransposon detection methods applied to the v1.1 genome. 30 Supporting Online Material for Rensing et al. 2008, Table S22: Classification of novel $' 319, 64 (2008) LTR retrotransposons Number % Average insertion age Median insertion age Gypsy#like 465 43.1 2.4 1.9 GAG)PR)RT)INT, at least RT)INT Copia#like 48 4.4 3.2 3.1 GAG)PR)INT)RT, at least INT)RT Mixed 241 22.3 3.1 2.7 Undefined 303 28.1 2.6 2.4 too many and double domains for clear assignment too few domains for clear assignment 23 2.1 2.5 2.2 no domains (Transposon PFAM) found 1,080 100 2.6 2.3 LTR types from hmm domains No HMM hit Total LTR type definition The LTR transposon types where defined by the composition of their protein signatures. (Capsid protein (GAG); protease (PR); Reverse transcriptase (RT); Integrase (IN)) Table S23: Completely sequenced genomes used as a database for the phylogenies 5#letter code # protein sequences Species name & strain Plants ORYSA .#*4 9 ARATH -# ' # 8 ' 66,710 & , POPTR # , PHYPA ,* :# & CHLRE ", OSTLU . # OSTTA . # CYAME "* & : #& 30,480 # 58,036 # # 35,938 Algae &% % % *& , #& 15,143 # 7,618 # 7,725 % ,*4 GUITH 6 # # , # , & 5,014 * 485 % THAPS +, PHATR , # & & * # 11,397 # 10,025 Protists # ENTHI ; , * # 3)0 () 19,547 " PLAFA & % # 10,261 ( TRYCR +#* # 4 # "1! # # 19,642 31 Supporting Online Material for Rensing et al. 2008, 319, 64 (2008) &% MONBR ) NAEGR 9 # ' 9,196 ! # # # 15,753 Sum 322,970 Metazoa FUGRU < XENTR = # # 26,721 # CAEEL " #, DROME $# , HOMSA 3 27,916 & 23,220 # 19,778 34,180 Fungi SACCE SCHPO , # ,4 * , # # ' 5,784 5,045 * 6 PHACH , # PHYBL ,* DICDI $ * , ,#* # 10,048 7 * > 14,792 Mycetozoa & & 13,377 Sum 180,861 Archaea & # % AERPE - # PYRAE *# SULSO % *# ! # 6 ! ! 1,841 # % , # ! # () ! 2,605 2,977 % METAC ) , PYRAB *# THEAC +, # NANEQ 9 CAUCR " # ! ! * !:; ' # ! & !" - 4,540 1,898 , !$ ) 1,482 % # , ! / !? 0) 536 Eubacteria α+$ #! # !" 3,737 32 Supporting Online Material for Rensing et al. 2008, ERYLI ;#* ,# AGRTU - # , ANAVA - NOSSP 9 #! # # !3+"" ! % !" 319, 64 (2008) 3,011 !@A 5,402 & SYNSP !' # ! * , !-+"" 5,661 ! "" * 6,130 ! ! "" 3,569 & # !"0 4,066 8 BACHA !, BACSU ! CLOPE " ESCCO ; ! #& ! #%# ! ! # 4,105 !-+"" 2,876 γ +$ , # , ! & PSESP XANOO ! = , "" !? ! *# 4,243 ! '! , - 5,170 ! #*4 ! ' ! #*4 !?4,080 Sum 67,929 Total 571,760 Completely sequenced genomes comprising the search space for the gene family tree reconstruction. 33 Supporting Online Material for Rensing et al. 2008, 319, 64 (2008) F) References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. N. W. Ashton, D. J. Cove, ) : # : 154, 87 (1977). C. D. Knight, D. J. Cove, A. C. Cuming, R. S. Quatrano, in ) # *. (2002), vol. 2, pp. 285. M. Luo, R. A. Wing, in < : . (2003), vol. 2. S. Aparicio , 297, 1301 (2002). A.)F. A. Smit, R. Hubley, P. Green, , BCCDDD # > # # (2004). P. J. Kersey , # 4, 1985 (2004). K. D. Pruitt, T. Tatusova, D. R. Maglott, 9 - & E # , 35, D61 (2007). S. F. Altschul ,9 - & E 25, 3389 (1997). E. Birney, M. Clamp, R. Durbin, : E 14, 988 (May, 2004). A. A. Salamov, V. V. Solovyev, : E 10, 516 (Apr, 2000). W. J. Kent, : E 12, 656 (Apr, 2002). M. Kanehisa, S. Goto, S. Kawashima, Y. Okuno, M. Hattori, 9 - & E 32, D277 (Jan 1, 2004). E. Quevillon ,9 - & E 33, W116 (Jul 1, 2005). M. Ashburner ,9 # : 25, 25 (May, 2000). R. L. Tatusov , )" % # 4, 41 (2003). Z. Bao, S. R. Eddy, : E # , 12, 1269 (2002). A. Kalyanaraman, S. Aluru, 8 % # " 4, 197 (Apr, 2006). E. M. McCarthy, J. F. McDonald, % # 19, 362 (Feb 12, 2003). B. J. Haas ,9 - & E 31, 5654 (Oct 1, 2003). T. D. Wu, C. K. Watanabe, % # 21, 1859 (May 1, 2005). B. B. Wang, V. Brendel, # 9 - & @ - 103, 7175 (May 2, 2006). W. H. Li, Z. Gu, H. Wang, A. Nekrutenko, 9 # 409, 847 (2001). S. Maere , # 9 - & @ - 102, 5454 (2005). I. Dondoshansky, Y. Wolf, in 9" ( %D # $ ' + > . S. M. Van Dongen, Ph.D., University of Utrecht (2000). S. H. Shiu, J. K. Byrnes, R. Pan, P. Zhang, W. H. Li, # 9 - & 103, 2232 (2006). S. H. Shiu, M. C. Shih, W. H. Li, ,* * 139, 18 (2005). S. R. Eddy, % # 14, 755 (1998). J. D. Storey, R. Tibshirani, # 9 - & @ - 100, 9440 (Aug 5, 2003). W. Plaxton, E ' D % ,* * & ) # * 47, 185 (1996). H. H. Kirch, D. Bartels, Y. Wei, P. S. Schnable, A. J. Wood, +# & 9, 371 (2004). J. Hyams, C. Campbell, " ( E 9, 841 (1985). S. Dutcher, " ## . ) # * 6, 634 (2003). M. Kasahara, T. Kagawa, S. Yoshikatsu, K. Tomohiro, M. Wada, ,* 135, 1 (2004). G. Choi , 9 # 401, 610 (1999). T. Imaizumi, A. Kadota, M. Hasebe, M. Wada, " 14, 373 (2002). F.)Y. Bouget, F. Corellou, M. Moulager, C. Schwartz, L. Garnier, paper presented at the FESPB, France 2006. M. Shimizu, K. Ichikawa, S. Aoki, , ,* E " 324, 1296 (2004). O. Zobell, G. Coupland, B. Reiss, 7, 266 (2005). S. Richardt, D. Lang, W. Frank, R. Reski, S. A. Rensing, ,* * 143, 1452 (2007). K. Katoh, K. Kuma, H. Toh, T. Miyata, 9 - & E 33, 511 (2005). A. Bateman ,9 - & E # , 32 Database issue, D138 (Jan 1, 2004). A. Stamatakis, % # 22, 2688 (2006). F. Abascal, R. Zardoya, D. Posada, % # 21, 2104 (2005). 34