Supporting Online Material for Rensing et al. 2008,
319, 64 (2008)
A) Materials, Methods and Analysis
DNA was isolated from cultures derived from a single spore of the Gransden wild)type strain
in 2004
(Gransden 2004 strain). Nine)day old protonemal tissue was grown on BCD+ ammonium tartrate medium
overlaid with cellophane
. Tissue was frozen in liquid nitrogen and ground to a coarse powder in a mortar and
pestle. Nuclei were isolated from the frozen powder using the methods of Luo and Wing
. The nuclear pellet
was suspended in the residual buffer (1 ml) and served as the starting material for DNA isolation. The DNA was
extracted using the Nucleon Phytopure plant DNA extraction kit (RPN 8511) from Amersham Bioscience.
The initial data set was derived from 11 whole)genome shotgun (WGS) libraries: two with an insert size
of 2)3 Kbp, four with an insert size of 6)8 Kbp, and five with an insert size of 35)40 Kbp. The reads were
screened for vector using cross match, then trimmed for vector and quality. Reads shorter than 100 bases after
trimming were excluded. Data sets before and after trimming are described below:
Library
2)3 Kbp
6)8 Kbp
35)40 Kbp
Reads (raw)
2,968,735 (3,312,360)
3,351,584 (3,567,314)
411,741 (508,990)
Sequence (raw), Mbp
2,133 (3,466)
2,539 (3,588)
245 (523)
The data were assembled using release 2.9.3 of Jazz, a WGS assembler developed at the Joint
Genome Institute (
). A word size of 15 was used for seeding alignments between reads. The unhashability
threshold was set to 40, preventing words present in the data set in more than 40 copies from being used to
seed alignments. A mismatch penalty of )30.0 was used, which will tend to assemble sequences that are more
than about 97% identical. The assembly is represented by 2,106 scaffolds, the N50 being 111 scaffolds, the L50
1.32 Mbp. The largest scaffold is 5.39 Mbp in size; the total scaffold length is 480 Mbp and contains 5.4% gaps.
In addition to the nuclear genome, we built 215 chloroplast and 25 mitochondrion scaffolds in the released
assembly. The sequence depth derived from the assembly is 8.63 ± 0.10. To estimate the completeness of the
assembly, a set of 251,086 ESTs was aligned to both the unassembled trimmed data set, and the assembly
itself. A total of 247,484 ESTs (98.6%) were covered to more than 80% of their length by the unassembled data,
while 247,613 ESTs (98.6%) yielded hits to the assembly. Based on the presence of start and stop codons,
4,517 genes (29%) are putatively full)length.
Several genome analyses, gene prediction, and annotation methods were integrated into the JGI
annotation pipeline to annotate the genome of
. First, predicted transposable elements were masked in
the
genome assembly using RepeatMasker ( ) and a repeat library composed from a non)redundant
set of (i) overrepresented oligonucleotides identified during the assembly process, (ii) fragments of draft
gene models homologous to known transposable elements, and (iii) manually curated repeats. Second, gene
models were built using several approaches. Initially, 3,154 putative full length genes with ORFs of 150 bp or
longer were derived from 31,951 clusters of
ESTs and mapped to the genomic sequence. Next,
protein sequences from Genbank and IPI (
) were aligned against the scaffolds using BLASTX ( ) and post)
processed to co)linearize high scoring hits and to select the best non)overlapping set of BLAST alignments.
These alignments were used primarily as seeds for the gene prediction tools Genewise ( ) and Fgenesh+ ( ).
All resulting Genewise models were then extended to include the nearest 5’ methionine and 3’ stop codons.
Subsequently,
gene models were predicted using Fgenesh ( ) with parameters derived from training
using known
genes. In addition, 220,055 ESTs and the consensus sequences of their clusters were
aligned with the scaffolds using BLAT ( ) and used to extend and correct predicted gene models where exons
in the ESTs/cDNAs overlap and extend the gene model into flanking UTR. Over 225,000 putative gene models
1
Supporting Online Material for Rensing et al. 2008,
319, 64 (2008)
were generated using the above mentioned gene predictors. Their translated amino acid sequences were
aligned against known proteins from the NCBI non)redundant set and other databases such as KEGG ( ). In
addition, each predicted model was analyzed for domain content/structure using InterproScan ( ) with a suite of
tools such as Blast/HMM/ScanRegEx against the domain libraries Prints, Prosite, PFAM, ProDom and SMART.
Finally, to produce a non)redundant set of 35,938 gene models, for every locus with overlapping models, the
“best” model was selected according to homology with known proteins and EST support. Annotations for this set
of genes were summarized in terms of Gene Ontology ( ), eukaryotic clusters of orthologs, KOGs, ( ) and
KEGG pathways ( ). Predicted gene models and their annotations were further manually curated and
submitted to GenBank.
The average/median protein lengths are 363 aa/300 aa. The average/median transcript lengths are
1,196 bp/1,215 bp. 30,170 (84%) of the predicted proteins appear complete, based on the presence of start and
stop codons; 4,517 genes (29%) are putatively full length (contain both 5’ and 3’ UTR). The majority of predicted
genes are supported by various types of evidence: 35% of genes are supported by 220,055
ESTs and
full length cDNAs; 37% are homologous to Swissprot proteins (table S4). Additionally, 12,129 genes (34%) were
annotated in terms of Gene Ontology (GO) ( ), 15,932 (44 %) were assigned to eukaryotic orthologous groups
(KOGs) ( ), and 789 distinct EC numbers were assigned to 4,110 (11%) proteins mapped to KEGG pathways
( ).
Sequences from other origin than the desired source are a common problem of large scale sequencing
projects. An obvious strategy to isolate such contaminant sequences is the determination of identity or homology
to sequences of already sequenced organisms. The success of this approach relies on the availability of
genomic sequence data of the contaminant or close relatives.
A distribution plot of
scaffold G/C content colored with the taxonomic information gathered by
MegaBLAST searches revealed a suspicious secondary peak which was used to exclude scaffolds of obvious
prokaryotic origin. However, some candidate genes from the remaining
main genome scaffolds could
not be amplified from genomic DNA, indicating remaining contaminants. In order to identify the scaffolds
representing the contamination, we collected multiple parameters describing the scaffolds (EST alignment
evidence, taxonomic information, gene model statistics, scaffold length, G/C content). Analysis of the taxonomic
information gathered previously indicated the genus
. Thus, we used a
model to predict open
reading frames on all scaffolds and annotated the predicted peptides by homology. Manual inspection revealed
operon)like structures for suspected contaminant scaffolds and nearly no or only fragmentary ORFs for true
scaffolds. In total, 27 parameters were used in a multivariate analysis, combining principal component
analysis (PCA) and k)means clustering. Using this method, we were able to define four different fractions in the
scaffolds (fig. S7).
The predictions from the
analysis were tested in experimentally. A total of 24 primer pairs were designed
to test the separation of the clusters and to probe for the source of contamination. Based on this data we were
able to confirm that cluster 2 accurately represents a bacterial contamination derived from an unknown
species. By using the primers on the original DNA that was used to create the sequencing libraries we confirmed
that this DNA was contaminated. However, there was
and wet)lab evidence for further contaminations
within cluster 3a and 3b. Initial evidence suggested that these sequences may originate in some mislabeled or
switched plates, i.e. that organisms sequenced at the same time than
pollute the data to some extent.
We therefore carried out megaBLAST searches with the
!
scaffolds against the publicly available
microbial genomes that have been sequenced by JGI. There is evidence for several bacterial species
("
#
/$ %
& ' #
, (
, )*
#
,
# #
,
+# , &
#* ,#
) contributing to scaffolds within cluster 3a/b.
In order to finish the v1.0 genome release, all 407 scaffolds belonging to cluster 2 were removed. In
addition, 23 further scaffolds identified as contaminated by megaBLAST/PCR were removed. Using this
procedure the
!
partition represented in JGI’s genome browser was voided of the detected
contaminants.
2
Supporting Online Material for Rensing et al. 2008,
319, 64 (2008)
Version 1.1 of the
genome assembly and annotation can be accessed through the JGI
Genome Portal at http://www.jgi.doe.gov/Physcomitrella, where manual curation of this genome continues. The
data are stored in a MySQL database with an interactive genome portal interface that allows a distributed group
of international collaborators to view the genome, predictions, supporting evidence and other underlying data
and make decisions about a particular transcript in any given pathway, gene family or system. This Whole
Genome Shotgun project has been deposited at DDBJ/EMBL/GenBank under the project accession
ABEU00000000. The version described in this paper is the first version, ABEU01000000. Protein encoding
genes are identified by a unique, six digit number.
An
approach based on RECON ( ) was used to identify potential repetitive elements within the
genome sequence by virtue of their abundance within the assembly. RECON identifies potential
repeat elements and attempts to group identified elements into related families; RECON does not rely on, nor is
influenced by, collections of known repeats or similarity searches to known sequences. An iterative approach
was taken: abundant sequence elements were identified within a 35 Mbp portion of the genome, a second 35
Mbp portion was added to the first, and the combined collection of 70 Mbp was masked with the elements
identified within the first 35 Mbp portion. New elements were identified within the unmasked regions of the 70
Mbp portion, and these were combined with the first set of repeat elements and used to mask the collection of
sequences representing the previous 70 Mbp of
genome plus an additional 35 Mbp portion. This
process was continued until all portions of the
genome assembly had been assessed. The entire
collection of identified elements, their lengths, and their family groupings are represented in table S20.
Distributions of family sizes (A) and identified element sizes (B) are plotted in fig. S8. The scatter plot of
family element number vs. element length (fig. S8C) demonstrates that most families comprise few elements of
modest size (~1kbp). While families with many members (>100) are present, larger families tend to have smaller
element lengths. The number of repetitive nucleotides is 79,373,843 (16.3%).
LTRs were detected by different methods (table S21). The Method A pipeline uses LTRseq ( ) to
identify LTRs followed by a HMMer search of transposable element (TE))related domains. 4,795 full)length LTR
retrotransposons, including several nested copies, which all have at least one TE)related domain where found
by Method A. Those that have reverse)transcriptase domains followed by an integrase domain in their internal
region were classified as “Gypsy”; those with the integrase domain followed by the reverse)transcriptase domain
were classified as “Copia”; while the rest were classified as “Unknown”. Method B used the program
LTR_STRUC ( ) with default parameters. Method C1 also relies on LTR_STRUC, but avoids the splitting of
sequences after N>5 stretches, which occur often in unfinished genome sequences. Under these conditions
LTR_STRUC yielded 1,204 full)length LTR sequences, which were classified by a HMMer
(http://hmmer.janelia.org) search for typical retrotransposon protein domains (GAG, PR, INT, RT). 1,080 (90%)
of them remained after overlap removal and a quality check by the following criteria: the existence of at least one
retrotransposon protein domain, simple sequence percent <=20, inner N percent <=30, soloLTR percent <=2,
left + right soloLTR length <=80 percent of sequence length. They cover 2% (9.7 Mb) of the
genome.
According to their protein signatures 43 % could be assigned to the gypsy and 4 % to the copia LTR type, the
remaining are ambiguous (table S21, S22).
Diverged LTR elements and their fragments where detected by RepeatMasker Open)3)1)7 ( ) using a
non)redundant set of the novel method C1
LTR retrotransposons as repeat library (1,060 sequences,
9.5 Mb). The evolutionary distance between 5’ and 3’ soloLTR was calculated from a ClustalW alignment by the
emboss distmat package using the Kimura two parameter method. For the conversion of distance to insertion
age, a substitution rate of 1.3E)8 was used. Data integration, final annotation and data extraction were carried
out with the ANGELA (Automated Nested Genetic Element Annotation) pipeline (manuscript in preparation) (fig.
S2). 2,108 full length LTRs were detected by similarity to the LTR retrotransposons library in addition to the
1,080 from LTR_STRUC, thus adding up to 3,188 full length LTRs for which the insertion age could be
calculated (average age 3.3 mio years, median 3.0). 12% of those full length LTRs are fragmented by the
3
Supporting Online Material for Rensing et al. 2008,
319, 64 (2008)
insertion of another LTR element. They generally represent an older fraction of the full length LTRs with average
and median ages of 4.6 and 4.3 mio years (tables S9, S22).
About half of the
genome consists of LTR retrotransposons (157,127 elements, 51.3 % of the
sequence length). Only 5% (3,188) of them still exist as intact full) length elements; the remainder are diverged
and partial remnants are often fragmented by mutual insertions. Nested regions are very common, with 14% of
the LTR elements inserted into another LTR element (table S9).
!
Helitron transposable elements were sought by structural criteria as follows: the program searches for
Helitron 3' end structures, and then aligns any cases where the same structure is found more than once. If this
alignment indicates additional Helitron properties (e.g. insertion within 5')AT)3', extension of homology into the 5'
direction, etc.), then the element is judged to be a Helitron.
PASA ( ) was used to identify all potential AS events based on the qualified EST/genome alignments
generated by GMAP ( ) (Criteria: maximum intron length = 4kb, minimal percentage of cDNA aligned = 80%,
minimal average percentage of alignment identity = 97%). To make our results comparable to Wang and
Brendel ( ), only five splicing events used in their study (AltA, AltD, AltP, IntronR, and ExonS) were included for
further analyses. In total, 27,055 potential gene models were detected by EST to genome alignments and
subsequently analyzed.
Based on PASA, 21.4% of the analyzed genes show alternative splicing (AS, table S6), a similar
frequency to - ,
and .
' ( ). Most AS events in
use an alternative acceptor, rather
retaining an intron in the mRNA. Only 7.1% of
genes have intron retention events in contrast to ,
(14.3%) and .
' (14.6%). Longer introns and/or shorter exons in
may favor splicings
primarily by exon definition (as in humans) rather than by intron definition, which is implied by the larger number
of intron retention events seen in .
' and - ,
. Exon skipping events are the dominant alternative
splicing isoform in humans (~50%), but are rare in plants, including
,- ,
, and .
' .
We first identified all paralogs according to the criteria used in Li et al. ( ), and calculated the Ks values
of each paralogous gene pair following the method described in Maere et al. ( ). Since i)ADHoRe runs on
whole assembled chromosomes, we concatenated all the scaffolds into 25 ‘pseudo)linkage groups’, each
separated by stretches of Ns.
As TAGs consist of gene family members and thus are paralogs, we started to detect tandem arrayed
genes by clustering the protein sequences of the
gene models. In a first step, paralogous proteins
were detected using the clustering software BLASTCLUST ( ) with stringent parameters (minimum 75%
identity and 80% length coverage). The resulting gene models were filtered using homology support, and genes
associated with transposable elements (TIGR Plant Repeat Database Project and Repbase) as well as genes
with a high proportion of polyN)stretches and with internal stops were excluded. A maximum of ten spacer genes
was allowed. Details about the TAG clusters are presented in fig. S4. As the fragmentation of the current
genome release could impact the detection of TAGs, we calculated the average density of TAGs in the N50
scaffolds per Mbp. Based on this data the genome was predicted to contain ~190 TAGs, while 201 were
observed, which is no significant deviation. Therefore, the fragmentary nature of the genome assembly seems to
have no impact on the TAG detection process. KEGG annotation of the TAG genes revealed that 44% of all
photosynthetic antenna proteins are encoded by TAGs.
4
Supporting Online Material for Rensing et al. 2008,
319, 64 (2008)
"
To determine the degree of lineage)specific gains among
gene families and to address the
question whether genes with certain domains tend to expand at higher rates than others in
, we
identified gene families based on similarities between protein sequences from
and - ,
and
defined orthologous groups (OGs) where each group represents an ancestral gene common to the
and the - ,
lineages and contains genes derived from speciation and all subsequent duplication and
retention events. Based on the E values in all)against)all BLAST ( ) searches of
and - ,
protein sequences, we defined similarity clusters with Markov Clustering ( ) and found 5,456 clusters which
were identified in both
and - ,
. In each cluster (referred to as gene family), OGs were defined
both based on phylogenetic tree topology ( ) referred to as tree)based) and based on an iterative search
algorithm applied on a sequence similarity matrix ( ) referred to as similarity)based). No apparent bias was
introduced by using the NJ method for tree inference, as only 0)10% differences in the number of gains and
losses were found when comparing the results of Bayesian inference on several gene families. Each OG
represents a single ancestral gene from the progenitor of
and - ,
and all lineage)specific
duplicates of this ancestral gene. To determine whether genes with certain protein domains tend to expand at
higher rates than expected randomly, we identified domains with HMMER 2.3.2 ( ) based on the Release 20.0
of the Pfam database (Pfam_ls; www.sanger.ac.uk/Software/Pfam). Domains with significant lineage)specific
expansion were identified by determining if the number of genes in expanded OGs is significantly higher than
2
unexpanded OGs in each domain family with a χ test ( ). The values were corrected for multiple testing
with the /0value software based on false discovery rates ( ). To rule out the possibility that some of the two
component genes may be bacterial or fungal contaminants, we eliminated genes annotated as two component
regulators that are more similar to bacterial or fungal genes than they are to plant genes. Even after applying this
conservative criterion, there is still significant over)representation of HisKA and response regulator domain
containing genes in
.
#"
The aldehyde dehydrogenase (ALDHs) superfamily is involved in osmotic protection, NADPH
generation, aldehyde detoxification, and intermediary metabolism ( ). The ALDH superfamily comprises 14
genes in 9 protein families in - ,
, and 20 genes in 10 protein families in
. At least two
protein families are not found in other eukaryotic genomes.
has members within 8 of the 9 protein
families found in - ,
, and three of these protein families are expanded in
. The expansion and
variety of ALDH gene members suggest that their presence results in an active and robust γ)aminobutyric acid
(GABA) shunt metabolic pathway and the GAPN glycolytic bypass ( ).
The WRKY transcription factor family, regulating responses to stress and a number of developmental processes
in angiosperms, is expanded in
(40 members) as compared to unicellular algae (no more than three
genes), while angiosperms typically contain 75)125 members (table S13).
Many algae and bryophytes share the ancestral trait of having flagellated male gametes, although this
trait has been lost in flowering plants ( ). Consequently, proteins for delta and epsilon tubulins, required for
forming the basal bodies of flagella (
), are found in
(St 93, 94). Genes were also found for most
proteins of the inner, but not the outer dynein arms (St 91, 92), which are the motors for the motility of flagella.
This observation suggests a lack of outer arms in flagella, as has been shown to be the case for other land
plants ( ). Cytoplasmic dynein genes and their regulatory dynactin complex genes are absent, suggesting that
the dynein)mediated transport system was probably lost in or prior to the last common ancestor of
and
flowering plants.
$%
%
In vascular plants, photomorphogenic signals are perceived by three sensory photoreceptor families:
phytochrome, cryptochrome and phototropin.
possesses four canonical phototropins, UV/A)blue light
photoreceptors that help optimize photosynthesis in shade while avoiding damage in sunlight ( ).
has seven phytochromes, more than any organism reported to date. Of the potential phytochrome partners,
5
Supporting Online Material for Rensing et al. 2008,
319, 64 (2008)
neither FHY1, PIF3 nor the PKS family of phytochrome)interacting proteins are present in
, whereas
two copies of NDPK2, implicated in phytochrome signaling in vascular plants ( ), are represented. UV/A)blue
light sensitive cryptochromes and the related photolyase DNA)repair family are represented in all known bacteria
and eukaryotes. Accordingly, in addition to two HY4)like cryptochrome photomorphogenic photoreceptors ( ),
has one UVR3)like 6)4 photolyase, one ssDNA CRY3)like and several dsDNA PHR)like cyclobutane
pyrimidine dimer photolyases that restore nucleotide structure with the help of UV/A)blue light following UV/B)
induced damage.
Circadian oscillators are found in most organisms, and genes related to TOC1/PRR pseudo)response
regulators (St 69) and LHY/CCA1 single)myb domain transcription factors (St 30) of flowering plant clocks are
present in both
and .
#
&.
#
. In terms of interpretation of seasonal cues,
has sequences related to the key photoperiodic regulators CONSTANS (St 69) ( ), ( ), and FT (St 74),
as well as the CONSTANS)regulating cycling DOF factors (St 19), but not their downstream targets. Thus, these
signaling pathways appear to have an ancient origin, with the evolution of specific downstream targets occurring
later, after the divergence from the last common ancestor of land plants.
&
%
In order to accurately describe the evolutionary history of the gene families discussed, phylogenetic
inference was performed. The overall pipeline approach to construct gene families starting from candidate
queries was carried out as previously described ( ). The non)redundant search space used for the PSI)BLAST
( ) searches consisted of the predicted proteins of 45 completely sequenced genomes covering organisms from
all super kingdoms, with special focus on plants and algae (table S23).
Using maximally four PSI)BLAST iterations, the database was searched for candidate gene family
members (E)value cutoff 1E)4; hit inclusion cutoff 1E)5), the resulting hits were filtered based on 35% identity
and 80 amino acids hit length. Overlapping filtered result sets were merged to recover family relations by single
linkage clustering using a stringent hit)coverage)based distance measure (>=80 aa overlap on the shared hit).
Neighbor joining trees inferred from the automatically generated clusters were manually checked and curated if
necessary by reduction to the subfamily of interest or subclustering by splitting the cluster into multiple
subfamilies. In the latter case, the original cluster id was extended (e.g. 58_A and 58_B). Based on the manually
curated gene families, multiple alignments were calculated using MAFFT L)INSI ( ). In the case of the WRKY
and B3 families, which are defined by a short protein domain and thus are difficult to represent by phylogenies
based on whole protein alignments, the corresponding PFAM ( ) domain (PF03106 and PF02362) HMMer
(http://hmmer.janelia.org) fs profile was used to extract the conserved domain sequence from the gene family
members using hmmerpfam with the trusted cutoff. The domain sequences were aligned using MAFFT L)INSI.
Maximum likelihood tree topologies were created from the final gene families using the RAxML software
( ). For each multiple alignment, the optimal evolutionary model was selected using the ProtTest software ( ).
The best)known likelihood (BKL) tree was selected from a PROTMIX tree search with 100 randomized maximum
parsimony starting topologies, optimization of individual site substitution rates, classification of four discrete rate
categories, and final evaluation using the previously selected model of rate heterogeneity with full parameter
estimation. The BKL tree topology was annotated with confidence (bootstrap) values derived from a multiple
non)parametric bootstrap approach using the PROTCAT procedure and the family)specific model. All generated
trees were mid)point rooted at the longest internal branch, annotated with species information and stored in NHX
format. The annotated tree topologies can be accessed and viewed using the ATV java applet via
http://www.cosmoss.org/bm/supplementary_trees/Rensing_et_al_2007/
6
Supporting Online Material for Rensing et al. 2008,
319, 64 (2008)
B) Authorship
The order of the 70 authors was divided into three tiers, the first tier (1)23) being those scientists who
actually contributed directly to the production of the sequences, their assembly, annotation, analyses and in the
writing of the paper. Their order is according to the extent of their contribution, the first author making the
greatest contribution overall. The second tier (24)61) is composed of authors arranged alphabetically who
analyzed characteristics of the assembled genome, specific genes and gene families described in the main text.
The third tier (62)70) is composed of authors who assisted in and facilitated the writing of the paper, had
administrative/contact responsibility at the Joint Genome Institute and at the laboratories of the members of the
Moss Genome Consortium (www.mossgenome.org). The corresponding author had a major role in facilitating
and organizing the final assembly of the authors, annotators and writers of this manuscript.
7
Supporting Online Material for Rensing et al. 2008,
319, 64 (2008)
C) Figure Legends
Figure S1:
%
LTR#retrotransposon length distribution (LTR_STRUC) of $%
and rice
Length distributions of the full length LTR retrotransposons for
,- ,
LTR_STRUCT software. The blue vertical line indicates the arithmetic mean.
Figure S2:
,
and rice as predicted by the
Nesting architecture and spatial distribution of selected repeat elements
The Apollo Genome Viewer is used with customized color codes for the selective visualization of genetic
elements. 1
: ANGELA repeat annotation with nesting display. 1
: transposon protein domains. 1
:
full length LTR retrotransposons with age color code. 1
: solo LTRs.
Figure S3:
$%
Ks distribution plot
Age distribution of paralogous genes. The height of the bars reflects the amount of gene pairs in the respective
bin relative to the total amount of Ks values in the distribution.
Figure S4:
Tandemly arrayed gene (TAG) properties
Distribution of 10 tandemly arrayed gene properties. 1
%#
%
# , : cluster_size (number of paralogous
genes; 75% identity and 80% coverage), original scaffold TAG size (number of genes in array on the same
scaffold, allowing unlimited intervening genes), scaffold TAG size (number of genes in array on the same
scaffold, allowing maximally 10 intervening genes; the following features refer to this stringent definition), delta
number of exons (number of divergent exons between TAG pairs), delta gene length (differences in gene length
between TAG pairs). 1
%#
%
# , : delta CDS length (differences in coding sequence lengths),
orientation (strand orientation), number of genes in between (number of genes between TAG pairs), distance
(TAG pair distance in bp), distance excluding intermediate genes (TAG pair distance in bp excluding the lengths
of intervening genes).
Figure S5:
TAG functional annotation: Deviating KEGG pathways
Bar chart comparing the significantly deviating KEGG pathway annotations between the TAGs (light blue) and
the non)overlapping remainder of the genes (dark yellow). Differences were compared using Fisher tests
corrected for multiple testing using the Benjamini and Hochberg (BH) method as implemented in R.
Figure S6:
G#proteins of $%
compared with other eukaryotes
A: For each of the green plant genomes, a box represents a gene present in the genome that encodes a small
G)protein of the indicated phylogenetic group. The closest human homolog is shown at the bottom. Species
abbreviations: -# , - ,
2 ,* ,
; ", # , ",
*&
# , #& ; .
, . #
#;.
,. #
#
;3
,3
.
B: Each of the organisms is represented by a column of boxes where each box represents a gene present in the
genome that encodes a SNARE (top) or SM)family protein (bottom), with the color of the box indicating the type
of SNARE protein (orange, Qa; purple, Qb; green, Qb+Qc; red, Qc; blue, R) or SM [brown, Sly1 (ER); cyan,
Vps45 (Golgi/endosomes); light green, Vps33 (vacuole/lysosome); violet, Sec1 (PM)]. Clusters are separated
into the three main functional unit of the endomembrane system based upon homology with proteins of known
function in yeast, mammals and plants. Species abbreviations: -# ,, -# &
,
;
#,
# ,
# ; .#* , .#*4
' ; ,* ,
; ", # , ",
*&
# , #& ; 5
, 5 ' 6
# #,
.
, . #
#; .
, . #
#
; "*
, "* &
,*4
#
; +,
,
+,
#
&
; , #, ,
& *
# #
2 ,* , ,* , , #
7 ; ,*# , ,* , , #
8
Supporting Online Material for Rensing et al. 2008,
#
#
; $ &, $
;"
,"
Figure S7:
*
#,
&
&
&
;
; $#
,
, $#
, #
,
*
# '
#; 3
;
,
,3
,
319, 64 (2008)
,4
, #
*
.
Contaminant isolation using multivariate clustering analysis of 27 scaffold features
Multivariate clustering analysis of 27 scaffold features, combining principal component analysis (PCA) and k)
means clustering, allowed the isolation of prokaryotic contaminant sequences from the genome assembly.
Cluster 1 (red): true
genomic regions; cluster 2 (blue): Bacterial contaminant from a yet unsequenced
species introduced with the genomic DNA (removed entirely from the released assembly); cluster 3:
longer a) (green) / shorter b) (black) repetitive genomic regions (e.g. transposons) without protein coding genes
or EST evidence mixed with some longer a) (green) / shorter b) (black) bacterial sequences possibly introduced
by plate)switch or mis)labelling during sequencing (experimentally confirmed scaffolds were removed from the
released assembly).
Figure S8:
RECON repeat family analysis
A: Distribution plot of repeat family sizes as determined using the RECON repeat finder software.
B: Distribution plot of the average length (bp) of repeat families as determined using the RECON repeat finder
software.
C: Two)dimensional comparison of the RECON repeat families using their sizes (number of elements) and
average element length (bp).
9
Supporting Online Material for Rensing et al. 2008,
319, 64 (2008)
D) Figures
Figure S1:
Figure S2:
LTR#retrotransposon length distribution (LTR_STRUC) of $%
and rice
,
Nesting architecture and spatial distribution of selected repeat elements
1
2
3
4
1.2 Mb of scaffold_4
tier 1: ANGELA repeat annotation
tier 2: hmm domains
)
1: complete Angela annotation with
2: transposon hmm domains
3: full length LTRs (age color coded)
4: solo LTRs
nesting
tier 3: LTR age
!"#
!"$
%"#
&"'
( &"'
)
*
1
2
3
4
0.5 Mb of scaffold_2
1
2
3
4
0.5 Mb of scaffold_1
10
Supporting Online Material for Rensing et al. 2008,
Figure S3:
$%
Figure S4:
Tandemly arrayed gene properties
319, 64 (2008)
Ks distribution plot
11
Supporting Online Material for Rensing et al. 2008,
Figure S5:
TAG functional annotation: Deviating KEGG pathways
Figure S6:
G#proteins of $%
319, 64 (2008)
compared with other eukaryotes
A:
12
Supporting Online Material for Rensing et al. 2008,
319, 64 (2008)
B:
Figure S7:
Contaminant isolation using multivariate clustering analysis of 27 scaffold features
13
Supporting Online Material for Rensing et al. 2008,
Figure S8:
319, 64 (2008)
RECON repeat family analysis
A:
B:
14
Supporting Online Material for Rensing et al. 2008,
319, 64 (2008)
C:
15
Supporting Online Material for Rensing et al. 2008,
319, 64 (2008)
E) Tables
Table S1:
Transcript evidence resources used for genome annotation
Genome size (Mb)
480
Known cDNA
3,154
ESTs from NR
120,702
ESTs from collaborators
96,133
EST clusters from JGI
31,951
Number of EST clusters aligned
31,146
97%
The above transcript evidence resources where mapped to the genome using BLAT and were used for genome
structure prediction.
Table S2:
$%
v1.1 gene model support
Model Types
Number
Percentage
Known genes
210
1%
Models based on homology)methods
13,150
37%
-
22,578
63%
genes
35,938
Total genes
Composition of the final set of gene models forming the released v1.1 genome annotation.
Table S3:
$%
Model Statistics
v1.1 gene properties
Average
Gene length (bp)
2,389.42
Transcript length (bp)
1,195.77
Protein length (aa)
362.84
Exons per gene
4.87
Exon length (bp)
245.62
Intron length (bp)
310.57
Genes per Mbp
74.9
Some properties of the structure and organization of genes within the
Table S4:
genome v1.1.
Functional annotation of the v1.1 gene models
Model Support
Number
Percentage
Distinct Categories
Supported by multiple methods
3,754
10%
Supported by homology
13,360
37%
Models with EST support
12,593
35%
Models with Swissprot alignments
13,340
37%
Models with Pfam alignments
13,613
38%
Models with EC assignments
4,110
11%
789
Models with KOG assignments
15,932
44%
3,603
Models with GO assignments
12,129
34%
3,092
Outcome of the functional annotation of the v1.1 gene models using various data sources and methods.
16
Supporting Online Material for Rensing et al. 2008,
Table S5:
319, 64 (2008)
V1.1 gene model quality
Model Quality
Number
Percentage
Multi)exon genes
30,928
86%
Truncated (missing both 5'M 3'*)
2,206
6%
Partial models (either 5'M or 3'*)
3,562
10%
Complete models (5'M 3'*)
30,170
84%
Models extend to either 5' or 3' UTR
8,418
23%
Complete models extend both to 5' and 3' UTR
4,517
13%
Six parameters assessing the v1.1 gene model quality. Completeness of gene models is measured by
considering the existence of a translation initiating 5’ methionine (5’M) and a 3’ terminal stop codon (3’*).
Table S6:
Summary statistics of genome#wide alternative splicing in $'
Type of alternative splicing
genes
Events
Genes*
AltA
3,272 (28.1%)
1,446 (5.3%)
AltD
28,22 (24.3%)
1221 (4.5%)
AltP
2,050 (17.6%)
761 (2.8%)
IntronR
2,892 (24.9%)
1913 (7.1%)
ExonS
598 (5.1%)
465 (1.7%)
11,634
5,806 (21.4%)
Total
Overview of the alternative splicing variants observed in
using the PASA software. The number of
genes described refers to gene loci in terms of PASA subclusters (*).
Table S7:
RECON repeat family sizes and element lengths
Average
Number of elements
Element size
Low
High
10
1
857
1,292
300
43,280
Average and range of element numbers and sizes observed within the 1,381 repeat families identified. Only
families with a minimum of 10 elements were retained for analysis, but all sequences less than 300bp were not
used for masking or subsequent statistics, hence some families are ultimately represented by only one
sequence.
Table S8:
Composition and contribution of the 15 RECON repeat families
Repeat Family ID
Bases
represented
[bp]
Family
sizes
Mean
element
length [bp]
Largest
element
length [bp]
Smallest
element
length [bp]
Family hits
within the
genome
AT_rich#low_complexity
18,074,591
1)6
13,896,516
178
1,551.21
5,585
388
9,834
1)5
10,985,211
88
1,421.78
3,770
386
14,717
1)7
7,973,118
60
1,064.65
1,511
432
9,832
2)6
2,957,609
756
1,435.24
7,116
300
2,260
2)1
2,453,886
857
440.28
689
300
10,096
309,731
17
Supporting Online Material for Rensing et al. 2008,
1)17
1,910,382
(TA)n#Simple_repeat
1,630,550
1)12
1,603,066
66
948.45
1,696
327
319, 64 (2008)
2,893
45,976
47
2,376.57
43,280
331
2,355
2)15
1,268,580
483
756.14
1,371
300
2,427
2)550
1,247,494
310
1,184.24
5,091
303
853
1)47
990,337
11
1,700.73
7,032
333
1,412
2)33
863,697
77
1,809.74
6,855
320
401
1)16
580,258
9
1,472.44
2,930
853
680
2)3
529,037
41
652.54
1,236
310
757
Overview of the individual family composition and their contribution to the repetitive fraction of the P. patens
genome.
Table S9:
Nesting level of transposable elements
insert level
#
# [%]
Nucleotides [bp]
nucleotides [%]
0
135,376
86.16
195,529,390
84.02
1
20,286
12.91
34,303,645
14.74
2
1,408
0.9
2,769,226
1.19
3
56
0.04
113,885
0.05
4
1
0
1,328
0
1# 4
21,751
13.85
37,188,084
15.98
Sum
157,127
100
232,717,474
100
Level of nesting which was observed among transposable elements in the
genome. Insert Level 0
means that the element is not inserted into another element. Level 1 elements are inserted into level 0 elements,
level 2 elements into level 1 elements and so on. The insertion of a child element into a parent element
fragments the parent into two parts.
Table S10:
intact
truncated
Helitrons
id
from
to
scaffold_366_P
158,402
164,572
scaffold_65_P
445,133
451,276
scaffold_277_N
512,482
518,535
scaffold_201_P
88,530
94,573
scaffold_18_N
2,033,993
2,040,103
scaffold_42_N
1,958,341
1,964,492
scaffold_2_N
3,298,214
3,304,190
scaffold_5_P
159,282
169,833
scaffold_11_N
857,172
868,013
scaffold_14_P
326,051
348,053
18
Supporting Online Material for Rensing et al. 2008,
scaffold_33_P
932,035
938,246
scaffold_70_P
1,408,895
1,420,173
scaffold_158_P
988,368
994,295
scaffold_183_N
632,506
638,670
scaffold_188_N
487,167
492,531
scaffold_250_N
429,279
442,590
scaffold_269_P
470,661
483,899
scaffold_295_N
218,887
225,071
scaffold_319_P
6,339
9,759
Loci of the single family of Helitrons (rolling)circle DNA transposons) found in the
represent positive or negative strand.
319, 64 (2008)
genome. P and N
19
Supporting Online Material for Rensing et al. 2008,
Table S11:
319, 64 (2008)
Comparison of tandemly arrayed genes (TAGs) to non#TAG genes
TAGs
normality [p]
Gene models
normality
[p]
Wilcoxon
rank sum
test [p]
TAGs
max
TAGs
mean
TAGs
median
TAGs
min
TAGs
σ
Gene models
max
Gene models
mean
Gene models
median
Gene
models
min
Gene
models
σ
Gene length [bp]
1.23E)55
0
3.07E)29
25,629.0
2,198.20
1,706.0
252.0
1,900.56
39,890.0
3,082.53
2,519.0
240.0
2,377.11
CDS length [bp]
5.36E)48
0
2.30E)11
4,002.0
1,065.92
891.0
252.0
676.54
14,577.0
1,306.53
1,080.0
180.0
980.84
0
0
8.49E)41
27.0
3.98
3.0
1
3.45
77.0
6.73
5.0
1.0
5.76
1.16E)82
0
0
1,965.0
420.44
312.7
73.2
361.32
4,176.0
308.08
189.5
50.2
329.32
Exons
Average exon length [bp]
Cluster size
0
0
0
25.0
5.11
4.0
2.0
4.26
25.0
1.99
1.0
1.0
1.97
6.99E)204
0
2.37E)37
24,774.0
851.55
440.0
0
1,561.11
24,774.0
1,475.82
1,067.0
0
1,569.09
Average intron length [bp]
0
0
1.42E)06
12,387.0
253.26
197.8
0
661.93
12,387.0
243.10
227.2
0
259.91
Introns
0
0
8.49E)41
26.0
2.98
2.0
0
3.45
76.0
5.73
4.0
0
5.76
Introns length [bp]
GC exons [%]
0.00186
0
0
69.2
54.76
54.7
30.6
5.86
74.3
49.41
48.8
30.6
3.97
GC introns [%]
3.39E)184
0
0
71.8
36.71
42.8
0
20.72
71.8
35.43
38.8
0
13.76
GC gene [%]
0.07164
0
0
67.2
51.88
51.8
8.0
6.91
67.2
45.28
44.2
8.0
4.73
GC CDS total [%]
7.51E)03
0
0
67.2
55.46
55.6
30.5
5.68
67.2
49.69
49.0
30.5
3.89
0
0
1.68E)07
647.0
31.52
7.5
0
70.52
1,042.0
12.67
5.0
1
32.02
Gene model EST support [%]
Gene model cDNA support [%]
0
0
6.47E)03
4.0
0.28
0
0
0.64
4.0
0.18
0
0
0.43
Gene model GenPept best HSP length [bp]
3.04E)43
0
6.51E)05
1,330.0
329.10
273.0
50.0
214.45
4,943.0
380.52
315.0
80.0
298.51
Gene model GenPept best HSP identity [%]
0
0
0
100.0%
72.1%
75.0%
32.3%
17.8%
100.0%
58.1%
56.1%
35.0%
14.9%
TIGR and plantrep HSP length [bp]
0
0
0
69.0
0.71
0
0
6.21
79.0
0.04
0
0
1.75
TIGR and plantrep HSP identity [%]
0
0
0
100.0%
0.8%
0.0%
0.0%
7.2%
34.8%
0.0%
0.0%
0.0%
0.8%
The above table compares 18 features of tandemly arrayed genes (TAGs) with those of non)TAG genes (gene models). First, normality was
tested for the distribution of each feature using the Pearson chi)square test for normality. None of the features were distributed normally. Thus,
biased features between the two populations were compared using the Wilcoxon rank sum test (less; more). In addition, an overview of the
distributions is given showing minimal (min), maximal (max), median, average (mean) values and the standard deviation (σ) for both TAGs and
non)TAG gene models.
20
Supporting Online Material for Rensing et al. 2008,
Table S12:
Subfamily
&
&
Type I and type II MADS#box and MADS#like genes in $%
(&
.
Genomic locus
(
#box)
Gene Name
(&
Scaffold
Start
End
Strand
$$ )
scaffold_118
1,026,583
1,026,404
+
$$ *
scaffold_55
1,832,462
1,832,283
+
348,851
349,030
)
(&&
$
)
scaffold_267
(&&
$
+
scaffold_171
406,784
406,605
+
(&&
$$ &,
scaffold_26
773,307
773,486
)
(&&
$$ &-
scaffold_209
758,925
758,746
+
(&.
$
*
scaffold_118
802,139
802,318
)
(&.
$
/
scaffold_55
1,740,464
1,740,285
+
(&.
$$ /
scaffold_34
1,943,470
1,943,291
+
(&.
$$ 0
scaffold_163
560,281
560,460
)
(&.
$$ -
scaffold_8
781,587
781,766
)
(&.
$$ 1
scaffold_313
148,169
147,990
+
(&.
$$
,
scaffold_34
1,967,363
1,967,179
+
(&.
$$
2
scaffold_8
789,036
789,215
)
(&.
$$
3
scaffold_55
1,750,072
1,749,893
+
(&.
$$
)4
scaffold_90
799,382
799,561
)
(&.
$$
))
scaffold_163
554,447
554,626
)
(&.
$$
)*
scaffold_273
362,369
362,548
)
Type I
$$
)
scaffold_68
1,691,186
1,691,365
)
Type I
$$
*
scaffold_81
1,205,177
1,204,998
+
Type I
$$
/
scaffold_88
1,179,645
1,179,824
)
Type I
$$
0
scaffold_198
705,696
705,517
+
Type I
$$
,
scaffold_198
708,785
708,964
)
#like
$$
)
scaffold_15
1,752,266
1,752,439
)
#like
$$
*
scaffold_37
2,364,940
2,365,119
)
#like
$$
/
scaffold_122
861,365
861,186
+
Loci of )-$ )box domains in the
319, 64 (2008)
genome v1.1
21
Supporting Online Material for Rensing et al. 2008,
Table S13:
319, 64 (2008)
WRKY transcription factor gene families
A comparison of the WRKY transcription factor gene families from
with those of ",
*&
# , #&
. #
#
, . #
# and -# &
,
. The total number of
genes for each subfamily is shown. * indicates that the members of the subfamily form a distinct subgroup in
a combined phylogenetic tree.
Table S14:
ABC
subfamily
A
B
Inventory of ABC transporter genes in $%
Gene name
ABC
subfamily
group1
Accession number
EST
support2
TAIR loci of closest
%
homologue
PpABCA1
AOH
Phypa_221752
yes
AT2G41700
PpABCA2
ATH
Phypa_190702
yes
AT3G47730
PpABCA3
ATH
Phypa_190218
yes
AT3G47780
PpABCA4
ATH
Phypa_180906
yes
AT3G47790
PpABCA5
ATH
Phypa_145836
yes
AT3G47790
PpABCA6
ATH
Phypa_147779
no
AT3G47780
PpABCA7
AOH
Phypa_234064
no
AT2G41700
PpABCB1
LLP
Phypa_115784
yes
At5G03910
PpABCB3
TAP
Phypa_129034
yes
AT5G39040
PpABCB4
TAP
Phypa_174637
yes
AT5G39040
PpABCB5
TAP
Phypa_224391
yes
AT1G70610
PpABCB6
TAP
Phypa_224785
yes
AT1G70610
PpABCB7
TAP
Phypa_63650
yes
AT5G39040
PpABCB8
TAP
Phypa_193090
yes
AT4G25450
PpABCB9
ATM
Phypa_108321
yes
AT5G58270
PpABCB10
ATM
Phypa_225750
yes
AT5G58270
PpABCB11
MDR
Phypa_199955
yes
AT3G28345
PpABCB12
MDR
Phypa_198750
yes
AT3G28345
PpABCB13
MDR
Phypa_227047
yes
AT2G47000
PpABCB14
MDR
Phypa_59717
yes
AT1G02520
PpABCB15
MDR
Phypa_110943
yes
AT3G28860
PpABCB16
MDR
Phypa_170613
yes
AT3G28860
PpABCB18
MDR
Phypa_56126
no
AT3G28860
PpABCB20
MDR
Phypa_119621
no
AT2G39480
22
Supporting Online Material for Rensing et al. 2008,
C
D
F
G
PpABCB22
LLP
Phypa_8856
no
AT3G28860
PpABCB23
ATM
Phypa_91386
no
AT5G58270
PpABCB24
MDR
Phypa_140970
PpABCC1
MRP
Phypa_135574
yes
AT2G07680
PpABCC2
MRP
Phypa_194836
yes
AT2G34660
PpABCC3
MRP
Phypa_199102
yes
AT2G34660
PpABCC4
MRP
Phypa_216010
yes
AT3G62700
PpABCC5
MRP
Phypa_187434
yes
AT3G62700
PpABCC6
MRP
Phypa_137284
yes
AT3G62700
PpABCC7
MRP
Phypa_224600
yes
AT3G21250
PpABCC8
MRP
Phypa_167276
yes
AT1G04120
319, 64 (2008)
AT3G28860
PpABCC9
MRP
Phypa_221970
yes
AT1G04120
PpABCC10
half MRP
Phypa_153801
yes
AT1G30410
PpABCC11
MRP
Phypa_145373
no
AT2G34660
PpABCC12
MRP
Phypa_61991
no
AT3G59140
PpABCC13
MRP
Phypa_117638
no
AT3G21250
PpABCC15
MRP
Phypa_101994
no
AT1G04120
PpABCD1
PMP
Phypa_125471
yes
AT4G39850
PpABCD2
PMP
Phypa_134601
yes
AT1G54350
PpABCD3
PMP
Phypa_207071
yes
AT1G54350
PpABCD4
PMP
Phypa_130679
yes
AT1G54350
PpABCD5
double PMP
Phypa_218012
yes
AT4G39850
PpABCD7
PMP
Phypa_144681
no
AT1G54350
PpABCF1
GCN
Phypa_208576
yes
AT1G64550
PpABCF2
GCN
Phypa_223577
yes
AT5G60790
PpABCF3
GCN
Phypa_192602
yes
AT5G60790
PpABCF4
GCN
Phypa_161003
yes
AT5G60790
PpABCF5
GCN
Phypa_185776
yes
AT3G54540
PpABCF6
GCN
Phypa_231060
yes
AT3G54540
PpABCF7
GCN
Phypa_30640
yes
AT5G64840
PpABCF8
GCN
Phypa_201003
yes
AT5G64840
PpABCF10
GCN
Phypa_107004
yes
AT5G64840
PpABCG1
WBC
Phypa_112649
yes
AT5G60740
PpABCG2
WBC
Phypa_147149
yes
AT2G01320
PpABCG3
WBC
Phypa_196641
yes
AT4G27420
PpABCG4
WBC
Phypa_127566
yes
AT5G06530
PpABCG5
WBC
Phypa_197808
yes
AT2G13610
PpABCG6
WBC
Phypa_151127
yes
AT1G17840
PpABCG7
WBC
Phypa_59855
yes
none
PpABCG8
WBC
Phypa_97018
yes
AT1G17840
23
Supporting Online Material for Rensing et al. 2008,
I3
PpABCG9
WBC
Phypa_11555
yes
AT5G13580
PpABCG10
WBC
Phypa_128675
yes
AT3G53510
PpABCG11
WBC
Phypa_41420
yes
AT3G53510
PpABCG13
WBC
Phypa_153252
yes
AT1G53270
PpABCG14
WBC
Phypa_215170
yes
AT1G53270
PpABCG15
PDR
Phypa_175287
yes
AT2G29940
PpABCG16
PDR
Phypa_128826
yes
AT1G59870
PpABCG17
PDR
Phypa_176017
yes
AT1G15210
PpABCG18
PDR
Phypa_121512
yes
AT1G15210
PpABCG19
PDR
Phypa_140793
yes
AT1G15210
PpABCG20
PDR
Phypa_210034
yes
AT1G59870
PpABCG21
PDR
Phypa_192434
yes
AT1G59870
PpABCG22
PDR
Phypa_226738
yes
AT1G66950
PpABCG23
PDR
Phypa_171206
yes
AT3G16340
PpABCG24
WBC
Phypa_129635
no
AT5G60740
PpABCG25
WBC
Phypa_140499
no
AT5G60740
PpABCG26
PDR
Phypa_102109
no
AT2G29940
PpABCG27
PDR
Phypa_116286
no
AT1G15210
PpABCG28
WBC
Phypa_151478
no
AT1G17840
PpABCG29
WBC
Phypa_131586
no
AT2G39350
PpABCG30
WBC
Phypa_41350
no
AT3G53510
PpABCG31
WBC
Phypa_135027
no
AT2G13610
PpABCG32
PDR
Phypa_112247
no
AT2G29940
PpABCG33
PDR
Phypa_118223
no
AT1G59870
PpABCG34
PDR
Phypa_139762
no
AT1G59870
PpABCG35
PDR
Phypa_128793
no
AT1G59870
PpABCG36
WBC
Phypa_131592
no
AT1G17840
PpABCG37
WBC
Phypa_146773
no
AT1G17840
PpABCG38
WBC
Phypa_71431
no
AT1G17840
PpABCG39
WBC
Phypa_114177
no
AT5G13580
PpABCG40
WBC
Phypa_134830
no
AT5G13580
PpABCG41
WBC
Phypa_140592
no
AT4G27420
AT5G46540
PpABCI1
NO
Phypa_134304
yes
PpABCI2
MKL
Phypa_116997
yes
PpABCI3
MKL
Phypa_180730
yes
319, 64 (2008)
AT1G65410
PpABCI4
ADT
Phypa_179405
yes
AT1G03905
PpABCI5
CCM
Phypa_116239
yes
AT1G63270
24
Supporting Online Material for Rensing et al. 2008,
PpABCI6
O4
CBY
Phypa_17451
yes
PpABCI7
CBY
Phypa_149024
yes
PpABCI8
ABCX
Phypa_106270
yes
AT4G33460
AT3G10670
PpABCI9
ADT
Phypa_218855
yes
AT5G44110
PpABCI10
ABCX
Phypa_3208
yes
AT1G32500
PpABCI11
ABCX
Phypa_121886
yes
AT4G04770
PpABCI12
CBY
Phypa_203642
yes
AT3G21580
PpABCI13
CCM
Phypa_146726
no
AT2G07681
PpABCI14
MKL
Phypa_127149
yes
AT1G19800
PpABCI15
ABCX
Phypa_111022
yes
AT4G04770
PpABCI16
NO
Phypa_157748
no
AT1G67940
Phypa_158315
no
AT1G02520
Phypa_235054
no
AT5G61700
Phypa_158388
no
AT1G28010
ATM)like
fragment
ATH)like
fragment
MDR)like
fragment
PpABCB17
PpABCA8
PpABCB25
319, 64 (2008)
Inventory of ABC transporters in the
v1.1 genome. Footnote annotations:
1
The ABC transporter subfamilies are defined in table S15.
2
On comparison with EST collection as of October 2006.
3
Components of ABC transporters with homology to prokaryotic ABC proteins.
4
Includes fragments of ABCs which align with main subfamilies.
Table S15:
Subfamily
A
B
ABC subfamily group domain structure
Group
Domain structure
AOH
TMD)NBD)TMD)NBD
ATH
TMD)NBD
MDR(PGP)
TMD)NBD)TMD)NBD
ATM(HMT)
TMD)NBD
TAP
TMD)NBD
LLP
TMD)NBD
C
MRP
TMD)NBD)TMD)NBD
D
PMP
TMD)NBD)TMD)NBD
F
G
GCN
NBD)NBD
WBC
NBD)TMD
PDR
NBD)TMD)NBD)TMD
Domain structure of the ABC subfamily groups (TMD = transmembrane domain; NBD = nucleotide binding
domain).
25
Supporting Online Material for Rensing et al. 2008,
Table S16:
Abbreviation
319, 64 (2008)
Full names of the chlorophyll and carotenoid biosynthetic enzymes shown in Figure 4
Full Name
GTS
glutamyl)tRNA synthetase
GTR
glutamyl)tRNA reductase
GSA
glutamate)1)semialdehyde aminotransferase
ALAD
5)aminolevulinic acid dehydratase
PBGD
porphobilinogen deaminase
UROS
uroporphyrinogen III synthase
UMT
uroporphyrinogen III methyltransferase
UROD
uroporphyrinogen III decarboxylase
CPX
coproporphyrinogen III oxidase
PPX
protoporphyrinogen IX oxidase
FC
ferrochelatase
CHLD
protoporphyrin IX Mg)chelatase subunit D
CHLI
protoporphyrin IX Mg)chelatase subunit I
CHLH
protoporphyrin IX Mg)chelatase subunit H
PPMT
Mg)protoporphyrin IX methyltransferase
CHL27
Mg)protoporphyrin IX monomethylester cyclase subunit 1
DCR
divinylprotochlorophyllide reductase
POR
light)dependent NADPH:protochlorophyllide oxidoreductase
CHS
chlorophyll synthase
CAO
chlorophyllide
GGR
geranylgeranyl reductase
oxygenase
DXS
1)deoxy)D)xylulose)5)phosphate synthase
DXR
1)deoxy)D)xylulose)5)phosphate reductoisomerase
CMS
4)diphosphocytidyl)2)C)methyl)D)erythritol synthase
CMK
4)diphosphocytidyl)2)C)methyl)D)erythritol kinase
MCS
2)C)methyl)D)erythritol 2,4)cyclodiphosphate synthase
HDS
1)hydroxy)2)methyl)2)(E))butenyl)4)diphosphate synthase
IDS
isopentenyl) / dimethylallyl)diphosphate synthase
IDI
isopentenyl diphosphate isomerase
GGPS
geranylgeranyl pyrophosphate synthase
PSY
phytoene synthase
PDS
phytoene desaturase
ZDS
)carotene desaturase
CRTISO
carotenoid isomerase
LCYB
lycopene )cyclase
LCYE
lycopene )cyclase
CHYB
carotene )hydoxylase (non)heme iron)
CYP97A
carotene )hydoxylase (cytochrome P450)
CYP97C
carotene )hydoxylase (cytochrome P450)
ZEP
zeaxanthin epoxidase
VDE
violaxanthin de)epoxidase
26
Supporting Online Material for Rensing et al. 2008,
319, 64 (2008)
Table S17. Gene families involved in auxin homeostasis and signaling
LCA
land
plants
$%
LCA
flowering
plants
%
$
5
TIR1/AFB auxin receptors
1
4
4
6
8
7
Auxin response factors
3
14
~12
24
27
28
Aux/IAA repressors
1
2
7)10
29
35
32
Auxin binding proteins
1
1
1
1
2
2
PIN auxin efflux carriers
1)2
3
6)9
8
16
13
AUX1/LAX auxin influx
transporters
1)3
8
3
4
8
5
YUCCA/FLOOZY
monoxygenases
1)2
6
5)7
11
12
14
Class II GH3 IAA
amidosynthetases
0
0*
4)5
8
9
9
IRL1/ILL IAA amidohydrolases
0
0*
4)6
7
11
9
Small Auxin)Up RNA (SAUR)
2)3
18
~20
76
102
56
55
174
230
175
Total protein coding loci
39,796
26,751
45,555
42,653
Proportion (Auxin signaling)
0.14%
0.65%
0.50%
0.41%
Total auxin)related genes
The numbers of genes in the ancestral land plant refer to the last common ancestor (LCA) of
and flowering plants, the ancestral flowering plant LCA to those of monocots and eudicots. These
numbers were estimated from the topologies of RAxML)inferred phylogenetic trees (St 25, 33_A/B, 41,
45, 71, 73, 77, 85, 88, and 89). *Similar
proteins do not group within or directly sister to the
flowering plants genes implicated in auxin homeostasis.
27
Supporting Online Material for Rensing et al. 2008,
Total
%
$%
&
5
5
&%
5
$
%
$%
A
%
Taxonomic profile of LHC protein families among 15 plastid#bearing organisms with sequenced nuclear genome
Other
Table S18:
319, 64 (2008)
P#value (Fisher
test)
Tailed?
Seed plant average
$%
adjusted using
seed
plant σ
Tree 58_A
0
47
23
16
24
23
14
14
0
5
6
0
0
0
0
0
172
0.004980
greater
21
42.64110
LHCI
0
13
8
7
9
8
5
5
0
0
0
0
0
0
0
0
55
0.349788
greater
8
12
Lhca1
LHCI type 1
0
3
1
1
2
1
1
1
0
0
0
0
0
0
0
0
10
0.596273
greater
1.33333
2.42265
Lhca2
LHCI type 2
0
5
3
2
3
1
2
2
0
0
0
0
0
0
0
0
18
0.700974
greater
2.66667
4.42265
Lhca3
LHCI type 3
0
4
1
1
1
1
1
1
0
0
0
0
0
0
0
0
10
0.340580
greater
1
4
Lhca4
LHCI type 4
0
0
1
1
1
1
1
1
0
0
0
0
0
0
0
0
6
1
less
1
0
Lhca5
0
1
1
1
1
3
0
0
0
0
0
0
0
0
0
0
7
1
two.sided
1
1
Lhca6
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
4
1
less
1
0
LHCII major
0
19
9
5
8
0
0
0
0
0
0
0
0
0
0
0
41
0.044250
greater
7.33333
16.91833
Lhcb1
LHCII type 1
0
18
5
3
4
0
0
0
0
0
0
0
0
0
0
0
30
0.011537
greater
4
17
Lhcb2
LHCII type 2
0
0
3
1
2
0
0
0
0
0
0
0
0
0
0
0
6
0.472528
less
2
1
Lhcb3
LHCII type 3
0
1
1
1
2
0
0
0
0
0
0
0
0
0
0
0
5
1
two.sided
1.33333
1.57735
0
11
6
4
7
3
3
3
0
0
0
0
0
0
0
0
37
0.296902
greater
5.66667
9.47247
Lhcb4
CP29 LHCII type 4
0
4
3
1
3
1
1
1
0
0
0
0
0
0
0
0
14
0.660229
greater
2.33333
2.84530
Lhcb5
CP26 LHCII type 5
0
4
1
1
1
1
1
0
0
0
0
0
0
0
0
0
9
0.339356
greater
1
4
Lhcb6
CP29 LHCII type 6
0
2
1
1
2
0
0
1
0
0
0
0
0
0
0
0
7
1
greater
1.33333
1.42265
Lhcb7/Lhcq
0
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
7
1
two.sided
1
1
Other LHCII#like
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0.466667
greater
0
2
Algal LCHPs
0
2
0
0
0
12
6
6
0
5
6
0
0
0
0
0
37
0.493684
greater
0
2
LhcbM
0
0
0
0
0
9
0
0
0
0
0
0
0
0
0
0
9
1
two.sided
0
0
Lhcx/LI818
0
2
0
0
0
3
1
1
0
5
6
0
0
0
0
0
18
0.487909
greater
0
2
LHCII minor
28
Supporting Online Material for Rensing et al. 2008,
5
5
&
$%
%
Photoprotective LHC#like
0
30
7
9
12
17
6
8
0
0
0
0
0
7
5
4
105
0.002593
greater
9.33333
27.48339
PsbS
CP22
0
1
1
3
1
4
0
0
0
0
0
0
0
0
0
0
10
1
less
1.66667
2.15470
Lil1
ELIP
0
20
2
3
3
9
4
5
0
0
0
0
0
0
0
0
46
0.001728
greater
2.66667
19.42265
LIL2
OHP1
0
3
0
0
1
0
0
0
0
0
0
0
0
5
3
4
16
0.233613
greater
0.33333
2.42265
LIL3
LIL3
0
3
2
1
4
1
1
1
0
0
0
0
0
0
0
0
13
1
greater
2.33333
1.47247
LIL4
SEP1
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
LIL5
SEP2
0
2
1
1
1
1
0
2
0
0
0
0
0
0
0
0
8
1
greater
1
2
LIL6
OHP2
0
1
1
1
2
2
1
0
0
0
0
0
0
2
2
0
12
1
two.sided
1.33333
1.57735
Total
%
%
&%
$%
adjusted using
seed
plant σ
$
Seed plant average
5
Tailed?
$%
P#value (Fisher
test)
Other
B
319, 64 (2008)
Tree 58_B
The two phylogenetic trees (RAxML, based on a filtered L)INSI alignment) were manually annotated (the original accession numbers are
preserved in {brackets}, see trees: 58A/B). The groups of sequences whose taxonomic profiles are shown above are based on these annotations
and the clustering provided by the tree topology. "Other" refers to the 30 non)plastid bearing organisms, which were present in the PSI)BLAST
search space used to build the initial clusters. P)values were calculated using Fisher tests ("tailed?" shows the alternate hypothesis used for the
to the average gene family size in the three seed plants (-# ,,
# and
test; p<0.05) to compare the number of genes found in
.#* ). Additionally, differences between
and the seed plants are shown by comparing the "seed plant average" vs. the
frequencies adjusted using the standard deviation σ of the three seed plant frequencies (phypa_adjusted>seed plant average and
phypa_adjusted<seed plant average, last two columns). For species names see table S23.
29
Supporting Online Material for Rensing et al. 2008,
Table S19:
319, 64 (2008)
LHCP genes present in TAGs
Left model
Left
name
Right
model
Right
name
Genes
inbetween
TAG
orientation
Phypa_144392
LHCA3
Phypa_60069
LHCA3
0
divergent
Phypa_228001
LHCB4
Phypa_228003
LHCB4
0
convergent
Phypa_220036
LHCP
Phypa_89671
LHCP
0
divergent
Phypa_163091
LHCP
Phypa_124625
LHCP
0
convergent
Phypa_155384
LHCP
Phypa_173457
LHCP
0
divergent
Phypa_52279
LHCB5
Phypa_52281
LHCB5
0
convergent
Phypa_119427
LHCB6
Phypa_56132
LHCB6
2
divergent
Phypa_149967
ELIP
Phypa_149966
ELIP
0
divergent
Phypa_149966
ELIP
Phypa_149976
ELIP
0
convergent
Locus
scaffold_214:737529)
744820
scaffold_472:146498)
150112
scaffold_186:221732)
231141
scaffold_51:1795431)
1815645
scaffold_463:103253)
127675
scaffold_6:2604863)
2612626
scaffold_28:2016809)
2046814
scaffold_308:493010)
512327
scaffold_308:493010)
512327
,*
#
LHCP genes occurring in tandem arrays. The table above provides the accession and
genomic location for each LHCP gene tandem array. Additionally, the transcriptional orientation and the
number of genes lying between a TAG pair are given.
Table S20:
groupings
The entire collection of identified repeat elements, their lengths, and their family
Because of its large size, the table is provided as a separate MS Excel spreadsheet file table_S20.xls.
Table S21:
Results of different LTR retrotransposon detection methods
Method
A
B
C1
C2
Description
Focus
LTR_par
overlap to genes
LTR_STRUC
default
LTR_STRUC
no N)split
ANGELA
with method C
library
comparison to
other plants
full length
LTRs per
genome
Copia#
like [%]
Gypsy#
like [%]
Undefined
[%]
4,795
2.4
45.9
51.7
791
library compilation
1,080
4.4
43.1
52.5
exhaustive annotation
for further analyses
3,188
8.7
61.0
30.3
Overview of the results of 4 different LTR retrotransposon detection methods applied to the v1.1 genome.
30
Supporting Online Material for Rensing et al. 2008,
Table S22:
Classification of novel $'
319, 64 (2008)
LTR retrotransposons
Number
%
Average
insertion
age
Median
insertion
age
Gypsy#like
465
43.1
2.4
1.9
GAG)PR)RT)INT, at least RT)INT
Copia#like
48
4.4
3.2
3.1
GAG)PR)INT)RT, at least INT)RT
Mixed
241
22.3
3.1
2.7
Undefined
303
28.1
2.6
2.4
too many and double domains for clear
assignment
too few domains for clear assignment
23
2.1
2.5
2.2
no domains (Transposon PFAM) found
1,080
100
2.6
2.3
LTR types
from hmm domains
No HMM hit
Total
LTR type definition
The LTR transposon types where defined by the composition of their protein signatures. (Capsid protein
(GAG); protease (PR); Reverse transcriptase (RT); Integrase (IN))
Table S23:
Completely sequenced genomes used as a database for the phylogenies
5#letter
code
# protein
sequences
Species name & strain
Plants
ORYSA
.#*4
9
ARATH
-#
'
#
8
'
66,710
&
,
POPTR
# ,
PHYPA
,*
:#
&
CHLRE
",
OSTLU
. #
OSTTA
. #
CYAME
"*
&
:
#&
30,480
#
58,036
#
#
35,938
Algae
&%
%
%
*&
, #&
15,143
#
7,618
#
7,725
%
,*4
GUITH
6
#
#
,
# ,
&
5,014
*
485
%
THAPS
+,
PHATR
,
#
&
&
*
#
11,397
#
10,025
Protists
#
ENTHI
;
,
*
# 3)0 ()
19,547
"
PLAFA
&
%
#
10,261
(
TRYCR
+#*
# 4
# "1! #
#
19,642
31
Supporting Online Material for Rensing et al. 2008,
319, 64 (2008)
&%
MONBR
)
NAEGR
9
# '
9,196
!
#
#
#
15,753
Sum
322,970
Metazoa
FUGRU
<
XENTR
=
#
#
26,721
#
CAEEL
"
#,
DROME
$#
,
HOMSA
3
27,916
&
23,220
#
19,778
34,180
Fungi
SACCE
SCHPO
, #
,4
*
, #
# '
5,784
5,045
*
6
PHACH
,
#
PHYBL
,*
DICDI
$ *
,
,#*
#
10,048
7
*
>
14,792
Mycetozoa
&
&
13,377
Sum
180,861
Archaea
&
#
%
AERPE
- #
PYRAE
*#
SULSO
%
*#
!
# 6
!
!
1,841
#
%
,
#
! # ()
!
2,605
2,977
%
METAC
) ,
PYRAB
*#
THEAC
+, #
NANEQ
9
CAUCR
"
#
!
!
*
!:;
' #
!
&
!" -
4,540
1,898
,
!$ )
1,482
%
# ,
! /
!?
0)
536
Eubacteria
α+$
#! #
!"
3,737
32
Supporting Online Material for Rensing et al. 2008,
ERYLI
;#* ,#
AGRTU
- #
,
ANAVA
-
NOSSP
9
#!
#
#
!3+""
!
%
!"
319, 64 (2008)
3,011
!@A
5,402
&
SYNSP
!' #
!
*
,
!-+""
5,661
! ""
*
6,130
!
! ""
3,569
& #
!"0
4,066
8
BACHA
!,
BACSU
!
CLOPE
"
ESCCO
;
!
#&
!
#%#
!
! #
4,105
!-+""
2,876
γ +$
, # , !
&
PSESP
XANOO
!
= ,
""
!?
! *#
4,243
! '! ,
-
5,170
! #*4
! ' ! #*4
!?4,080
Sum
67,929
Total
571,760
Completely sequenced genomes comprising the search space for the gene family tree reconstruction.
33
Supporting Online Material for Rensing et al. 2008,
319, 64 (2008)
F) References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
N. W. Ashton, D. J. Cove, ) :
# :
154, 87 (1977).
C. D. Knight, D. J. Cove, A. C. Cuming, R. S. Quatrano, in )
#
*. (2002), vol. 2,
pp. 285.
M. Luo, R. A. Wing, in <
:
. (2003), vol. 2.
S. Aparicio
,
297, 1301 (2002).
A.)F. A. Smit, R. Hubley, P. Green, , BCCDDD #
> # # (2004).
P. J. Kersey
, #
4, 1985 (2004).
K. D. Pruitt, T. Tatusova, D. R. Maglott, 9
- & E
# , 35, D61 (2007).
S. F. Altschul
,9
- & E 25, 3389 (1997).
E. Birney, M. Clamp, R. Durbin, :
E 14, 988 (May, 2004).
A. A. Salamov, V. V. Solovyev, :
E 10, 516 (Apr, 2000).
W. J. Kent, :
E 12, 656 (Apr, 2002).
M. Kanehisa, S. Goto, S. Kawashima, Y. Okuno, M. Hattori, 9
- & E 32, D277 (Jan 1,
2004).
E. Quevillon
,9
- & E 33, W116 (Jul 1, 2005).
M. Ashburner
,9 # :
25, 25 (May, 2000).
R. L. Tatusov
, )"
% #
4, 41 (2003).
Z. Bao, S. R. Eddy, :
E
# , 12, 1269 (2002).
A. Kalyanaraman, S. Aluru, 8
% # "
4, 197 (Apr, 2006).
E. M. McCarthy, J. F. McDonald,
% #
19, 362 (Feb 12, 2003).
B. J. Haas
,9
- & E 31, 5654 (Oct 1, 2003).
T. D. Wu, C. K. Watanabe,
% #
21, 1859 (May 1, 2005).
B. B. Wang, V. Brendel, # 9 - &
@ - 103, 7175 (May 2, 2006).
W. H. Li, Z. Gu, H. Wang, A. Nekrutenko, 9 # 409, 847 (2001).
S. Maere
, # 9 - &
@ - 102, 5454 (2005).
I. Dondoshansky, Y. Wolf, in 9" (
%D # $ '
+ > .
S. M. Van Dongen, Ph.D., University of Utrecht (2000).
S. H. Shiu, J. K. Byrnes, R. Pan, P. Zhang, W. H. Li, # 9 - &
103, 2232 (2006).
S. H. Shiu, M. C. Shih, W. H. Li,
,*
* 139, 18 (2005).
S. R. Eddy,
% #
14, 755 (1998).
J. D. Storey, R. Tibshirani, # 9 - &
@ - 100, 9440 (Aug 5, 2003).
W. Plaxton, E ' D %
,*
* &
)
#
* 47, 185 (1996).
H. H. Kirch, D. Bartels, Y. Wei, P. S. Schnable, A. J. Wood, +# &
9, 371
(2004).
J. Hyams, C. Campbell, "
( E 9, 841 (1985).
S. Dutcher, " ## .
) #
* 6, 634 (2003).
M. Kasahara, T. Kagawa, S. Yoshikatsu, K. Tomohiro, M. Wada,
,*
135, 1 (2004).
G. Choi
, 9 # 401, 610 (1999).
T. Imaizumi, A. Kadota, M. Hasebe, M. Wada,
" 14, 373 (2002).
F.)Y. Bouget, F. Corellou, M. Moulager, C. Schwartz, L. Garnier, paper presented at the FESPB,
France 2006.
M. Shimizu, K. Ichikawa, S. Aoki,
,
,* E "
324, 1296 (2004).
O. Zobell, G. Coupland, B. Reiss,
7, 266 (2005).
S. Richardt, D. Lang, W. Frank, R. Reski, S. A. Rensing,
,*
* 143, 1452 (2007).
K. Katoh, K. Kuma, H. Toh, T. Miyata, 9
- & E 33, 511 (2005).
A. Bateman
,9
- & E
# , 32 Database issue, D138 (Jan 1, 2004).
A. Stamatakis,
% #
22, 2688 (2006).
F. Abascal, R. Zardoya, D. Posada,
% #
21, 2104 (2005).
34