Bioinformatics and Its Applications in Plant Biology: Seung Yon Rhee, Julie Dickerson, and Dong Xu
Bioinformatics and Its Applications in Plant Biology: Seung Yon Rhee, Julie Dickerson, and Dong Xu
Bioinformatics and Its Applications in Plant Biology: Seung Yon Rhee, Julie Dickerson, and Dong Xu
1
Department of Plant Biology, Carnegie Institution, Stanford, California 94305;
email: [email protected]
2
Baker Center for Computational Biology, Electrical and Computer Engineering,
Iowa State University, Ames, Iowa 50011-3060; email: [email protected]
3
Digital Biology Laboratory, Computer Science Department and Life Sciences
Center, University of Missouri-Columbia, Columbia, Missouri 65211-2060;
email: [email protected]
335
ANRV274-PP57-13 ARI 5 April 2006 19:12
COMPUTATIONAL
tools and approaches for expanding the use
PROTEOMICS . . . . . . . . . . . . . . . . . 342
of biological, medical, behavioral or health
Electrophoresis Analysis . . . . . . . . . . 342
data, including those to acquire, represent, de-
Protein Identification Through
scribe, store, analyze, or visualize such data.”
by University of Delhi on 01/08/09. For personal use only.
ranging from database development to cura- highly repetitive sequences, although the cost
tion. In section seven, we discuss a few emerg- of sequencing is another limitation. Recently
ing research topics in bioinformatics. developed methods continue to reduce the
cost of sequencing, including sequencing by
using differential hybridization of oligonu-
SEQUENCE ANALYSIS cleotide probes (48, 62, 101), polymorphism
Biological sequence such as DNA, RNA, and ratio sequencing (16), four-color DNA
protein sequence is the most fundamental sequencing by synthesis on a chip (114), and
object for a biological system at the molecular the “454 method” based on microfabricated
level. Several genomes have been sequenced high-density picoliter reactors (87). Each of
to a high quality in plants, including Arabidop- these sequencing technologies has significant
sis thaliana (130) and rice (52, 147, 148). Draft analytical challenges for bioinformatics in
genome sequences are available for poplar terms of experimental design, data interpre-
Annu. Rev. Plant Biol. 2006.57:335-360. Downloaded from arjournals.annualreviews.org
truncatula, sorghum (11) and close relatives Gene Finding and Genome
of Arabidopsis thaliana. Researchers also gen- Annotation
erated expressed sequence tags (ESTs) from Gene finding refers to prediction of introns
many plants including lotus, beet, soybean, and exons in a segment of DNA sequence.
cotton, wheat, and sorghum (see http:// Dozens of computer programs for identifying
www.ncbi.nlm.nih.gov/dbEST/). protein-coding genes are available (150).
Some of the well-known ones include Gen-
scan (http://genes.mit.edu/GENSCAN.ht
Genome Sequencing ml), GeneMarkHMM (http://opal.biology.
Advances in sequencing technologies provide gatech.edu/GeneMark/ ), GRAIL (http://
opportunities in bioinformatics for manag- compbio.ornl.gov/Grail-1.3/ ), Genie
ing, processing, and analyzing the sequences. (http://www.fruitfly.org/seq tools/genie.
Shotgun sequencing is currently the most html), and Glimmer (http://www.tigr.org/
common method in genome sequencing: softlab/glimmer). Several new gene-finding
pieces of DNA are sheared randomly, cloned, tools are tailored for applications to plant
and sequenced in parallel. Software has been genomic sequences (112).
developed to piece together the random, Ab initio gene prediction remains a chal-
overlapping segments that are sequenced lenging problem, especially for large-sized eu-
separately into a coherent and accurate con- karyotic genomes. For a typical Arabidopsis
tiguous sequence (93). Numerous software thaliana gene with five exons, at least one
packages exist for sequence assembly (51), in- exon is expected to have at least one of its
cluding Phred/Phrap/Consed (http://www. borders predicted incorrectly by the ab ini-
phrap.org), Arachne (http://www.broad. tio approach (19). Transcript evidence from
mit.edu/wga/), and GAP4 (http://staden. full-length cDNA or EST sequences or sim-
sourceforge.net/overview.html). TIGR ilarity to potential protein homologs can sig-
developed a modular, open-source package nificantly reduce uncertainty of gene identi-
called AMOS (http://www.tigr.org/soft fication (154). Such methods are widely used
ware/AMOS/), which can be used for com- in “structural annotation” of genomes, which
parative genome assembly (102). Current refers to the identification of features such
limitations in shotgun sequencing and assem- as genes and transposons in a genomic se-
bly software remain largely in the assembly of quence using ab initio algorithms and other
itive sequences exist in almost any genome, that it is based on sequence similarity between
and are abundant in most plant genomes two strings of text, which may not correspond
(69). The identification and characterization to homology (relatedness to a common an-
of repeats is crucial to shed light on the evo- cestor in evolution), especially when the con-
lution, function and organization of genomes fidence level of a comparison result is low.
and to enable filtering for many types of Also, homology may not mean conservation in
homology searches. A small library of plant- function.
specific repeats can be found at ftp://ftp. Methods in sequence comparison can be
tigr.org/pub/data/TIGR Plant Repeats/; largely grouped into pair-wise, sequence-
this is likely to grow substantially as more profile, and profile-profile comparison. For
genomes are sequenced. One can use Repeat- pair-wise sequence comparison, FASTA
Masker (http://www.repeatmasker.org/) to (http://fasta.bioch.virginia.edu/) and
search repetitive sequences in a genome. BLAST (http://www.ncbi.nlm.nih.gov/
Working from a library of known repeats, blast/) are popular. To assess the confidence
RepeatMasker is built upon BLAST and level for an alignment to represent homol-
can screen DNA sequences for interspersed ogous relationship, a statistical measure
repeats and low complexity regions. Repeats (Expectation Value) was integrated into
with poorly conserved patterns or short pair-wise sequence alignments (71). Remote
sequences are hard to identify using Repeat- homologous relationships are often missed by
Masker due to the limitations of BLAST. pair-wise sequence alignment due to its insen-
To identify novel repeats, various algorithms sitivity. Sequence-profile alignment is more
were developed. Some widely used tools sensitive for detecting remote homologs.
include RepeatFinder (http://ser-loopp.tc. A protein sequence profile is generated by
cornell.edu/cbsu/repeatfinder.htm) and multiple sequence alignment of a group of
RECON (http://www.genetics.wustl.edu/ closely related proteins. A multiple sequence
eddy/recon/). However, due to the high alignment builds correspondence among
computational complexity of the problem, residues across all of the sequences simulta-
none of the programs can guarantee finding neously, where aligned positions in different
all possible repeats as all the programs use sequences probably show functional and/or
some approximations in computation, which structural relationship. A sequence profile
will miss some repeats with less distinctive is calculated using the probability of
occurrence for each amino acid at each align- family databases are built from multiple
ment position. PSI-BLAST (http://www. sources. InterPro (http://www.ebi.ac.uk/
ncbi.nlm.nih.gov/BLAST/) is a popular interpro/) is a database that integrates
example of a sequence-profile alignment tool. domain information from multiple protein
Some other sequence-profile comparison domain databases. Using protein family
methods are slower but even more accu- information to predict gene function is more
rate than PSI-BLAST, including HMMER reliable than using sequence comparison
(http://hmmer.wustl.edu/), SAM (http:// alone. On the other hand, very closely related
www.cse.ucsc.edu/research/compbio/sam. proteins may not guarantee a functional
html), and META-MEME (http://met relationship (97). One can use structure-
ameme.sdsc.edu/). A profile-profile align- or function-based protein families (when
ment is more sensitive than the sequence- available) to complement sequence-based
profile-based search programs in detecting family for additional function information.
Annu. Rev. Plant Biol. 2006.57:335-360. Downloaded from arjournals.annualreviews.org
Observing the patterns of transcriptional data across the different chips. Visualization
activity that occur under different conditions of the output from tiling arrays requires view-
such as genotypes or time courses reveals ing the probe sequences on the array together
genes that have highly correlated patterns of with the sequence assembly and the probe
expression. However, correlation cannot dis- expression data. The Arabidopsis Tiling Ar-
tinguish between genes that are under com- ray Transcriptome Express Tool (also known
mon regulatory control and those whose ex- as ChipViewer) (http://signal.salk.edu/cgi-
pression patterns just happen to correlate. bin/atta) displays information about what
Recent efforts in microarray analysis have fo- type of transcription occurred along the
cused on analysis of microarray data across Arabidopsis genome (143). Another tool is
experiments (91). A study by the Toxicoge- the Integrated Genome Browser (IGB) from
nomics research consortium indicates that Affymetrix, a Java program for exploring
“microarray results can be comparable across genomes and combining annotations from
Annu. Rev. Plant Biol. 2006.57:335-360. Downloaded from arjournals.annualreviews.org
multiple laboratories, especially when a com- multiple data sources. Another option for vi-
mon platform and set of procedures are used” sualizing such data are collaborations such as
(7). Meta-analysis can investigate the effect of those between Gramene (137) and PLEXdb
the same treatment across different studies to (116), which allow users to overlay probe ar-
by University of Delhi on 01/08/09. For personal use only.
arrive at a single estimate of the true effect of ray information onto a comparative sequence
the treatment (106, 123). viewer.
The major limitations of WGAs include
the requirement of a sequenced genome, the
Tiling Arrays large number of chips required for complete
Typical microarray sample known and pre- genome coverage, and analysis of recently du-
dicted genes. Tiling arrays cover the genome plicated (and thus highly homologous) genes.
at regular intervals to measure transcrip-
tion without bias toward known or predicted
gene structures, discovery of polymorphisms, Regulatory Sequence Analysis
analysis of alternative splicing, and identi- Interpreting the results of microarray exper-
fication of transcription factor-binding sites iments involves discovering why genes with
(90). Whole-genome arrays (WGAs) cover similar expression profiles behave in a coordi-
the entire genome with overlapping probes or nated fashion. Regulatory sequence analysis
probes with regular gaps. The WGA ensures approaches this question by extracting mo-
that the experimental results are not depen- tifs that are shared between the upstream se-
dent on the level of current genome annota- quences of these genes (134). Comparative
tion as well as discovering new transcripts and genomics studies of conserved noncoding se-
unusual forms of transcription. In plants, sim- quences (CNSs) may also help to find key
ilar studies have been performed for the en- motifs (56, 67). There are several methods
tire Arabidopsis genome (127, 143) and parts of to search over-represented motifs at the up-
the rice genome (70, 79). These studies iden- stream of coregulated genes. Roughly they
tified thousands of novel transcription units can be categorized into two classes: oligonu-
including genes within the centromeres, sub- cleotide frequency-based (68, 134) and prob-
stantial antisense gene transcription, and tran- abilistic sequence-based models (76, 85,
scription activity in intergenic regions. Tiling 108).
array data may also be used to validate pre- The oligonucleotide frequency-based
dicted intron/exon boundaries (132). method calculates the statistical significance
Further work is needed to establish the of a site based on oligonucleotide frequency
best practices for determining when transcrip- tables observed in all noncoding regions of
tion has occurred and how to normalize array the specific organism’s genome. Usually, the
sequences in the genomes (134). teins under different conditions (54). Several
For the probabilistic-based methods, the bioinformatics tools have been developed for
motif is represented as a position probability two-dimensional (2D) electrophoresis analy-
matrix, where the motifs are assumed to be sis (86). SWISS-2DPAGE can locate the pro-
by University of Delhi on 01/08/09. For personal use only.
hidden in the noisy background sequences. teins on the 2D PAGE maps from Swiss-
One of the strengths of probabilistic-based Prot (http://au.expasy.org/ch2d/). Melanie
methods is the ability to identify motifs with (http://au.expasy.org/melanie/) can ana-
complex patterns. Many potential motifs can lyze, annotate, and query complex 2D
be identified; however, it can be difficult to gel samples. Flicker (http://open2dprot.
separate unique motifs from this large pool sourceforge.net/Flicker/) is an open-source
of potential solutions. Probabilistic-based stand-alone program for visually compar-
methods also tend to be computationally ing 2D gel images. PDQuest (http://www.
intense as they must be run multiple times proteomeworks.bio-rad.com) is a popular
to get an optimal solution. AlignACE, Aligns commercial software package for comparing
Nucleic Acid Conserved Elements (http:// 2D gel images. Some software platforms han-
atlas.med.harvard.edu/), is a popular motif dle related data storage and management, in-
finding tool that was first developed for yeast cluding PEDRo (http://pedro.man.ac.uk/),
but has been expanded to other species (107). a software package for modeling, capturing,
and disseminating 2D gel data and other
proteomics experimental data. Main limita-
COMPUTATIONAL tions of electrophoresis analysis include lim-
PROTEOMICS ited ability to identify proteins and low accu-
Proteomics is a leading technology for the racy in detecting protein abundance.
qualitative and quantitative characterization
of proteins and their interactions on a genome
scale. The objectives of proteomics include Protein Identification Through Mass
large-scale identification and quantification of Spectrometry
all protein types in a cell or tissue, analysis of After protein separation using 2D elec-
post-translational modification and associa- trophoresis or liquid chromatography and
tion with other proteins, and characterization protein digestion using an enzyme (trypsin,
of protein activities and structures. Applica- pepsin, glu-C, etc.), proteins are identified by
tion of proteomics in plants is still in its ini- typically using mass spectrometry (MS) (1). In
tial phase, mostly in protein identification (24, contrast to other protein identification tech-
96). Other aspects of proteomics (reviewed niques, such as Edman degradation microse-
in 152), such as identification and prediction quencing, MS provides a high-throughput
approach for large-scale protein identifica- In this case, additional MS/MS experiments
tion. The data generated from mass spec- are needed to identify the proteins.
trometers are often complicated and compu-
tational analyses are critical in interpreting the
data for protein identification (17, 55). A ma- Tandem mass spectrometry. MS/MS fur-
jor limitation in MS protein identification is ther breaks each digested peptide into smaller
the lack of open-source software. Most widely fragments, whose spectra provide effective
used tools are expensive commercial pack- signatures of individual amino acids in the
ages. In addition, current statistical models for peptide for protein identification. Many
matches between MS spectra and protein se- tools have been developed for MS/MS-based
quences are generally oversimplified. Hence, peptide/protein identification, the most
the confidence assessments for computational popular ones being SEQUEST (http://
protein identification results are often unreli- fields.scripps.edu/sequest/) and Mascot
Annu. Rev. Plant Biol. 2006.57:335-360. Downloaded from arjournals.annualreviews.org
and combine information about connectivity, cellular compartment,” and “biological pro-
concentration balances, flux balances, cess.” GO is organized as a directed acyclic
metabolic control, and pathway optimization. graph, which is a type of hierarchy tree that
Ultimately, one may integrate all of the infor- allows a term to exist as a specific concept
mation and perform analysis and simulation in belonging to more than one general term.
a cellular modeling environment like E-Cell Other examples of ontologies currently in de-
(http://www.e-cell.org/) or CellDesigner velopment are the Sequence Ontology (SO)
(http://www.systems-biology.org). project (40) and the Plant Ontology (PO)
project (www.plantontology.org). The SO
project aims to explicitly define all the terms
ONTOLOGIES needed to describe features on a nucleotide
The data that are generated and analyzed as sequence, which can be used for genome se-
described in the previous sections need to be quence annotation for any organism. The PO
Annu. Rev. Plant Biol. 2006.57:335-360. Downloaded from arjournals.annualreviews.org
compared with the existing knowledge in the project aims to develop shared vocabularies
field in order to place the data in a biologically to describe anatomical structures for flower-
meaningful context and derive hypotheses. To ing plants to depict gene expression patterns
do this efficiently, data and knowledge need and plant phenotypes.
by University of Delhi on 01/08/09. For personal use only.
Types of Bio-Ontologies
A growing number of shared ontologies are Applications of Ontologies
being built and used in biology. Examples in- Ontologies are used mainly to annotate data
clude ontologies for describing gene and pro- such as sequences, gene expression clusters,
tein function (59), cell types (9), anatomies experiments, and strains. Ontologies that
and developmental stages of organisms (50, have such annotations to data in databases
135, 144), microarray experiments (126), can be used in numerous ways, including
and metabolic pathways (84, 151). A list of connecting different databases, refining
open-source ontologies used in biology can searching, providing a framework for inter-
be found on the Open Biological Ontolo- preting the results of functional genomics
gies Web site (http://obo.sourceforge.net/). experiments, and inferring knowledge (8, 10,
Many ontologies on this site are un- 47). For example, one can ask which functions
der development and are subject to fre- and processes are statistically significantly
quent change. The Gene Ontology (GO) over-represented in an expression cluster
(www.geneontology.org) is an example of of interest compared to the functions and
bio-ontologies that has garnered community processes carried out by all of the genes from
acceptance. It is a set of more than 16,000 a gene expression array. Because GO is one of
controlled vocabulary terms for the biolog- the more well-established ontologies, this sec-
ical domains of ‘‘molecular function,” “sub- tion focuses on GO to illustrate applications
of ontologies in biology. Ontologies have ies that attempt to define biological processes
been used by many model organism databases and functions from gene expression data us-
to annotate genes and gene products ing the GO annotations should ensure that
(http://www.geneontology.org/GO.curren no annotation with inferred from expression
t.annotations.shtml, http://www.geneonto pattern (IEP) evidence code is used. The other
logy.org/GO.biblio.shtml#annots). Func- caveat is that annotations to GO are not equiv-
tion annotations of genes using GO have alently represented throughout GO. When
been used mainly in two ways: predicting looking for statistical over-representation of
protein functions, processes, and localization GO terms in genes of an expression cluster,
patterns from various data sources (http:// there is low statistical power for detecting de-
www.geneontology.org/GO.biblio.shtml# viations from expectation for terms that are
predictions) and providing a biological annotated with a small number of genes (74).
framework or benchmark set for inter-
Annu. Rev. Plant Biol. 2006.57:335-360. Downloaded from arjournals.annualreviews.org
the first place researchers go to find infor- (e.g., http://www.phytome.org and Refer-
mation. Databases that are available via the ence 64). The third category of databases
Web also became an indispensable tool for bi- includes smaller-scale, and often short-lived,
ological research. In this section, we describe databases that are developed for project data
by University of Delhi on 01/08/09. For personal use only.
types and examples of biological databases, management during the funding period. Of-
how these databases are built and accessed, ten these databases and Web resources are not
how data among databases are exchanged, and maintained beyond the funding period of the
current challenges and opportunities in bi- project and currently there is no standard way
ological database development and mainte- of depositing or archiving these databases af-
nance. ter the funding period.
There are some issues in database man-
agement. First, there is a general lack of
Types of Biological Databases good documentation on the rationale of the
Three types of biological databases have design and implementation. More effort is
been established and are developed: large- needed to share the experiences via con-
scale public repositories, community-specific ferences and publications. Also, there are
databases, and project-specific databases. no accepted standards in making databases,
Nucleic Acids Research (http://nar.oxford schema, software, and standard operating
journals.org/) publishes a database issue in procedures available. In response to this, the
January of every year. Recently, Plant Phys- National Human Genome Research Institute
iology started publishing articles describing (NHGRI) has funded a collaborative project
databases (105). Large-scale public reposito- called the Generic Model Organism Database
ries are usually developed and maintained by (http://www.gmod.org) to promote the de-
government agencies or international con- velopment and sharing of software, schemas,
sortia and are places for long-term data and standard operation procedures. The
storage. Examples include GenBank for se- project’s major aim is to build a generic or-
quences (139), UniProt (113) for protein in- ganism database toolkit to allow researchers
formation, Protein Data Bank (32) for pro- to set up a genome database “off the shelf.”
tein structure information, and ArrayExpress Another major issue is that there is a gen-
(100) and Gene Expression Omnibus (GEO) eral lack of infrastructure of supporting,
(38) for microarray data. There are a num- managing, and using digital data archived in
ber of community-specific databases, which databases and Web sites in the long term (82).
typically contain information curated with One possibility to alleviate this problem is to
high standards and address the needs of create a public archive of biological databases
a particular community of researchers. A and Web sites to which finished projects
could deposit the database, software, and ways. Also, it is difficult to create rich seman-
Web sites. There are several projects that are tic relationships in relational databases to ask
building digital repository systems that can the database “what if ” types of queries with-
be models for such a repository such as D- out having extensive software built on top of
Space (http://dspace.org/) and the CalTech the database. Another limitation of relational
Collection of Open Digital Archives (CODA; databases is that it is very difficult, if not im-
http://library.caltech.edu/digital/). Some possible, to preserve all of the changes that
additional challenges in long-term archiving occur to attributes of entities.
of data were articulated in a recent National
Science Board report (http://www.nsf.gov/
nsb/documents/2005/LLDDC report.pdf). Data Access and Exchange
The most direct, powerful, and flexible way
of accessing data in a database is using
Annu. Rev. Plant Biol. 2006.57:335-360. Downloaded from arjournals.annualreviews.org
ware, and relational database software. Due suited for biologists to learn without a steep
to the increasing quantity of data that need to learning curve. However, to use SQL, users
be stored and made accessible using the In- need to know the database schema. In addi-
ternet, relational database management soft- tion, some queries that are based on less opti-
ware has become popular and has become mized database structure could result in slow
the de facto standard in biology. Relational performance and can even sometimes lock the
databases provide effective means of storing database system. In most databases, access to
and retrieving large quantities of data via the data is provided via database access soft-
indexes, normalization, referential integrity, ware and graphical user interface (GUI) that
triggers, and transactions. Notable relational allow searching and browsing of the data. In
database software that is freely available and addition to text-based search user interfaces,
quite popular in bioinformatics is MySQL more sophisticated ways of accessing data such
(http://www.mysql.com/) and PostgreSQL as graphical displays and tree-based browsers
(http://www.postgresql.org/). In relational are also common.
databases, data are represented as entities, at- Although accessing information from a
tributes (properties of the entities), and rela- database is fairly easy if one knows which
tionships between the entities. This type of database to go to, it is not as easy to find infor-
representation is called Entity-Relationship mation if one does not know which database
(ER) and database schemas are described to search. There are several ways to solve
using ER diagrams (e.g., TAIR schema this problem such as indexing the content
at http://arabidopsis.org/search/schemas. of database-driven pages, developing software
html). Entities and attributes become tables that will connect to individual databases di-
and columns in the physical implementation rectly, or developing a data warehouse of many
of the database, respectively. Data are the val- different data types or database in one site. A
ues that are stored in the fields of the tables. relatively new method that is gaining some
Although relational databases are power- attention is to use a registry system where dif-
ful ways of storing large quantities of data, ferent databases that specialize on particular
they have limitations. For example, it is not information can declare what data are avail-
trivial to represent complex relationships be- able in their system and register methods to
tween data such as signal transduction path- access their data. Users can send requests to
the registry system, which then contact the ap- Data Curation
propriate databases to retrieve the requested Data curation is defined as any activity de-
data. Conceptually, this is an elegant way of in- voted to selecting, organizing, assessing qual-
tegrating different databases without depend- ity, describing, and updating data that result
ing on the individual databases’ schema. How- in enhanced quality, trustworthiness, inter-
ever, this relies on the willingness of individual pretability, and longevity of the data. It is a
databases to participate in the registry system. crucial task in today’s research environment
This method is called Web services and has where data are being generated at an ever-
been accepted widely by the Internet industry increasing rate and an increasing amount of
but has not yet been commonly implemented. research is based on re-use of data. In general,
Projects like BioMOBY (141) and myGRID some level of curation is done by data gener-
(125) are implementing this idea for biological ators, but most curation activities are carried
databases, but they have not yet been widely
Annu. Rev. Plant Biol. 2006.57:335-360. Downloaded from arjournals.annualreviews.org
the culture of data and expertise sharing. teractions and biological research archived in
The now famous Bermuda principle (http:// the literature will remain accessible in prin-
www.gene.ucl.ac.uk/hugo/bermuda.htm) ciple but underutilized in practice. One key
was extended to large-scale data at a recent area of text mining is relationship extraction
meeting (131). In this meeting, the policy that finds relationships between entities such
for publicly releasing large-scale data pre- as genes and proteins. Examples include Med-
publication and appropriate conduct and Miner at the National Library of Medicine
acknowledgment of the uses of these data (128), PreBIND (35), the curated BIND sys-
by the scientific community were discussed. tem (2, 6), PathBinderH (155), and iHOP
Clearly articulated and community-accepted (63). (See Reference 26 for a complete sur-
policies are needed on how data from data vey of text mining applications.) Results on
repositories should be cited and referenced real-world tasks such as the automatic extrac-
and how the generators of the data should tion and assignment of GO annotations are
Annu. Rev. Plant Biol. 2006.57:335-360. Downloaded from arjournals.annualreviews.org
be acknowledged. Establishing this standard promising, but they are still far from reaching
should include journal publishers, database the required performance demanded by real-
scientists, data generators, funding bodies, world applications (15). One key difficulty
and representatives of the user community. that needs to be addressed in this field is the
by University of Delhi on 01/08/09. For personal use only.
Additional challenges and opportunities in complex nature of the names and terminology
database curation were recently articulated such as the large range of variants for protein
(82, 103). names and GO terms in free text. The current
generation of systems is beginning to combine
statistical methods with machine learning to
EMERGING AREAS IN capture expert knowledge on how genes and
BIOINFORMATICS proteins are referred to in scientific papers to
In addition to some of the challenges and op- create usable systems with high precision and
portunities mentioned in this review, there are recall for specialized tasks in the near future.
many exciting areas of research in bioinfor-
matics that are emerging. In this section, we
focus on a few of these areas such as text min- Computational Systems Biology
ing, systems biology, and the semantic web. Classical systems analysis in engineering
Some additional emerging areas such as im- treats a system as a black box whose in-
age analysis (117), grid computing (46, 49), ner structure and behavior can be analyzed
directed evolution (29), rational protein de- and modeled by varying internal or exter-
sign (81), microRNA-related bioinformatics nal conditions, and studying the effect of
(21), and modeling in epigenomics (43) are the variation on the external observables.
not covered due to the limitation of space. The result is an understanding of the in-
ner makeup and working mechanisms of the
system (72). Systems biology is the applica-
Text Mining tion of this theory to biology. The observ-
The size of the biological literature is expand- ables are measurements of what the organism
ing at an increasing rate. The Medline 2004 is doing, ranging from phenotypic descrip-
database had 12.5 million entries and is ex- tions to detailed metabolic profiling. A crit-
panding at a rate of 500,000 new citations ical issue is how to effectively integrate var-
each year (26). The goal of text mining is to ious types of data, such as sequence, gene
allow researchers to identify needed informa- expression, protein interactions, and pheno-
tion and shift the burden of searching from types to infer biological knowledge. Some
researchers to the computer. Without auto- areas that require more work include creat-
mated text mining, much of biomolecular in- ing coherent validated data sets, developing
common formats for pathway data [SBML implementation of applications using the
(65) and BioPAX (http://www.biopax.org)], semantic web is scarce at this point, there
and creating ontologies to define complex in- are some useful examples being developed
teractions, curation, and linkages with text- such as Haystack (a browser that retrieves
mining tools. The Systems Biology Work- data from multiple databases and allows users
bench project (http://sbw.kgi.edu/) aims to to annotate and manage the information to
develop an open-source software framework reflect their understanding) (http://www-db.
for sharing information between different cs.wisc.edu/cidr/cidr2005/papers/P02.pdf )
types of pathway models. Other issues are and BioDash (a drug development user inter-
that biological systems are underdefined (not face that associates diseases, drug progression
enough measurements are available to charac- stages, molecular biology, and pathway
terize the system) and samples are not taken knowledge for users) (http://www.w3.org/
often enough to capture time changes in a sys- 2005/04/swls/BioDash/Demo/).
Annu. Rev. Plant Biol. 2006.57:335-360. Downloaded from arjournals.annualreviews.org
a cell is still distant; however the tools that Research in nanotechnology and electron mi-
are being developed to integrate information croscopy is allowing researchers to select spe-
from a wide variety of sources will be valuable cific areas of cells and tissues and to image
in the short term. spatiotemporal distributions of signaling re-
ceptors, gene expression, and proteins. Laser
capture microdissection allows the selection
Semantic Web of specific tissue types for detailed analysis
Semantic web is a model to “create a univer- (42). This technique has been applied to spe-
sal mechanism for information exchange by cific plant tissues in maize and Arabidopsis
giving meaning, in a machine-interpretable (73, 94). Confocal imaging is being used to
way, to the content of documents and data model auxin transport and gene expression
on the Web” (95). This model will enable the patterns in Arabidopsis (60). Methods in elec-
development of searching tools that know tron microscopy are being applied to image
what type of information can be obtained the spatiotemporal distribution of signaling
from which documents and understand how receptors (149). Improved methods in laser
the information in each document relates to scanning microscopes may allow measure-
another, which will allow software agents that ments of fast diffusion and dynamic processes
can use reasoning and logic to make deci- in the microsecond-to-millisecond time range
sions automatically based on the constraints in live cells (34). These emerging capabili-
provided in the query (e.g., automatic travel ties will lead to new understanding of cell
agents, phenotype prediction) (12). Bioin- dynamics.
formatics could benefit enormously from
successful implementation of this model and
should play a leading role in realizing it (95). CONCLUSION
Current efforts to realize the concepts of the In this review, we attempt to highlight some of
semantic web have been focused on develop- the recent advances made in bioinformatics in
ing standards and specifications of identifying the basic areas of sequence, gene expression,
and describing data such as Universal protein, and metabolite analyses, databases,
Resource Identifier (URI) and Resource and ontologies, current limitations in these
Definition Framework (RDF), respectively areas, and some emerging areas. A number
(http://www.w3c.org/2001/sw). Although of unsolved problems exist in bioinformatics
today, including data and database integra- parative, connected, holistic views and ap-
tion, automated knowledge extraction, robust proaches in plant biology. We will also see
inference of phenotype from genotype, and more integration of plant research and other
training and retraining of students and estab- biological research, from microbes to human,
lished researchers in bioinformatics. Bioinfor- from a large-scale comparative genomics per-
matics is an approach that will be an essen- spective. Bioinformatics will provide the glue
tial part of plant research and we hope that with which all of these types of integration
every plant researcher will incorporate more will occur. However, it will be people, not
bioinformatics tools and approaches in their tools, who will enable the gluing. Ways in
research projects. which biological research will be conducted
If the next 50 years of plant biology can be in 2050 will be much different from the way
summed into one word, it would be “integra- in which it was done in 2000. Each researcher
tion.” We will see integration of basic research will spend more time on the computer and the
Annu. Rev. Plant Biol. 2006.57:335-360. Downloaded from arjournals.annualreviews.org
with applied research in which plant biotech- Internet to generate and describe data and ex-
nology will play an essential role in solving ur- periments, to analyze the data and find other
gent problems in our society such as develop- people’s data relevant for comparison, to find
ing renewable energy, reducing world hunger existing knowledge in the field and to relate it
by University of Delhi on 01/08/09. For personal use only.
and poverty, and preserving the environment. to his or her results into the current body of
We will see integration of disparate, special- knowledge, and to publish his or her results
ized areas of plant research into more com- to the world.
ACKNOWLEDGMENTS
We are grateful to Blake Meyers, Dan MacLean, Shijun Li, Scott Peck, Mark Lange, Bill
Beavis, Todd Vision, Stefanie Hartmann, Gary Stacey, Chris Town, Volker Brendel, and Nevin
Young for their critical comments on the manuscript. This work has been supported in part
by NSF grants DBI-99,78564, DBI-04,17062, DBI-03,21666 (SYR); ITR-IIS-04,07204 (DX);
DBI-02,09809 ( JD); USDA grants NRI-2002-35,300-12,619 ( JD) and CSREES 2004-25,604-
14,708 (DX); NIH grants NHGRI-HG002273, R01-GM65466 (SYR); National Center for
Soybean Biotechnology (DX); Pioneer-Hi-Bred (SYR); and Carnegie Canada (SYR).
LITERATURE CITED
1. Aebersold R, Mann M. 2003. Mass spectrometry-based proteomics. Nature 422:198–207
2. Alfarano C, Andrade CE, Anthony K, Bahroos N, Bajec M, et al. 2005. The Biomolecular
Interaction Network Database and related tools 2005 update. Nucleic Acids Res. 33:D418–
24
3. Allen JE, Pertea M, Salzberg SL. 2004. Computational gene prediction using multiple
sources of evidence. Genome Res. 14:142–48
4. Aris-Brosou S. 2005. Determinants of adaptive evolution at the molecular level: the
extended complexity hypothesis. Mol. Biol. Evol. 22:200–9
5. Ashburner M, Ball C, Blake J, Botstein D, Butler H, et al. 2000. Gene ontology: tool for
the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25:25–29
6. Bader G, Betel D, Hogue C. 2002. BIND: the Biomolecular Interaction Network
Database. Nucleic Acids Res. 31:248–50
7. Bammler T, Beyer RP, Bhattacharya S, Boorman GA, Boyles A, et al. 2005. Standardizing
global gene expression analysis between laboratories and across platforms. Nat. Methods
2:351–56
16. Blazej RG, Paegel BM, Mathies RA. 2003. Polymorphism ratio sequencing: a new
approach for single nucleotide polymorphism discovery and genotyping. Genome Res.
13:287–93
17. Blueggel M, Chamrad D, Meyer HE. 2004. Bioinformatics in proteomics. Curr. Pharm.
by University of Delhi on 01/08/09. For personal use only.
Biotechnol. 5:79–88
18. Boguski MS, Schuler GD. 1995. ESTablishing a human transcript map. Nat. Genet.
10:369–71
19. Brendel V, Zhu W. 2002. Computational modeling of gene structure in Arabidopsis
thaliana. Plant Mol. Biol. 48:49–58
20. Brenner S, Johnson M, Bridgham J, Golda G, Lloyd DH, et al. 2000. Gene expression
analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat.
Biotechnol. 18:630–34
21. Brown JR, Sanseau P. 2005. A computational view of microRNAs and their targets. Drug
Discov. Today 10:595–601
22. Brown P, Botstein D. 1999. Exploring the new world of the genome with DNA microar-
rays. Nat. Genet. 21:33–37
23. Buck MJ, Lieb JD. 2004. ChIP-chip: considerations for the design, analysis, and applica-
tion of genome-wide chromatin immunoprecipitation experiments. Genomics 83:349–60
24. Canovas FM, Dumas-Gaudot E, Recorbet G, Jorrin J, Mock HP, Rossignol M. 2004.
Plant proteome analysis. Proteomics 4:285–98
25. Chen T, Kao MY, Tepel M, Rush J, Church GM. 2001. A dynamic programming ap-
proach to de novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol.
8:325–37
26. Cohen AM, Hersh WR. 2005. A survey of current work in biomedical text mining. Brief
Bioinform. 6:57–71
27. Cope LM, Irizarry RA, Jaffee HA, Wu Z, Speed TP. 2004. A benchmark for Affymetrix
GeneChip expression measures. Bioinformatics 20:323–31
28. Coughlan SJ, Agrawal V, Meyers B. 2004. A comparison of global gene expression mea-
surement technologies in Arabidopsis thaliana. Comp. Funct. Genomics 5:245–52
29. Dalby PA. 2003. Optimising enzyme function by directed evolution. Curr. Opin. Struct.
Biol. 13:500–5
30. Dancik V, Addona TA, Clauser KR, Vath JE, Pevzner PA. 1999. De novo peptide se-
quencing via tandem mass spectrometry. J. Comput. Biol. 6:327–42
31. Densmore LD 3rd. 2001. Phylogenetic inference and parsimony analysis. Methods Mol.
Biol. 176:23–36
32. Deshpande N, Addess KJ, Bluhm WF, Merino-Ott JC, Townsend-Merino W, et al.
2005. The RCSB Protein Data Bank: a redesigned query system and relational database
based on the mmCIF schema. Nucleic Acids Res. 33:D233–37
33. Di X, Matsuzaki H, Webster TA, Hubbell E, Liu G, et al. 2005. Dynamic model based al-
gorithms for screening and genotyping over 100 K SNPs on oligonucleotide microarrays.
Bioinformatics 21:1958–63
34. Digman MA, Brown CM, Sengupta P, Wiseman PW, Horwitz AR, Gratton E. 2005.
Measuring fast dynamics in solutions and cells with a laser scanning microscope. Biophys.
J. 89:1317–27
35. Donaldson I, Martin J, de Bruijn B, Wolting C, Lay V, et al. 2003. PreBIND and
Textomy—mining the biomedical literature for protein-protein interactions using a sup-
port vector machine. BMC Bioinformatics 4:11
Annu. Rev. Plant Biol. 2006.57:335-360. Downloaded from arjournals.annualreviews.org
36. Doolittle WF. 1999. Phylogenetic classification and the universal tree. Science 284:2124–
29
37. Draghici S. 2003. Data Analysis Tools for DNA Microarrays. London: Chapman and Hall
38. Edgar R, Domrachev M, Lash AE. 2002. Gene Expression Omnibus: NCBI gene expres-
by University of Delhi on 01/08/09. For personal use only.
sion and hybridization array data repository. Nucleic Acids. Res. 30:207–10
39. Edwards JS, Palsson BO. 2000. The Escherichia coli MG1655 in silico metabolic geno-
type: its definition, characteristics, and capabilities. Proc. Natl. Acad. Sci. USA 97:5528–33
40. Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, et al. 2005. The Sequence Ontol-
ogy: a tool for the unification of genome annotations. Genome Biol. 6:R44
41. Eisen MB, Spellman PT, Brown PO, Botstein D. 1998. Cluster analysis and display of
genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95:14863–68
42. Emmert-Buck MR, Bonner RF, Smith PD, Chuaqui RF, Zhuang Z, et al. 1996. Laser
capture microdissection. Science 274:998–1001
43. Fazzari MJ, Greally JM. 2004. Epigenomics: beyond CpG islands. Nat. Rev. Genet. 5:446–
55
44. Fiehn O. 2002. Metabolomics—the link between genotypes and phenotypes. Plant Mol.
Biol. 48:155–71
45. Foissac S, Bardou P, Moisan A, Cros MJ, Schiex T. 2003. EUGENE’HOM: a generic
similarity-based gene finder using multiple homologous sequences. Nucleic Acids Res.
31:3742–45
46. Foster I. 2002. What is the Grid? A three point checklist. In GRIDToday, pp. 4. Chicago:
Argonne National Lab & University of Chicago
47. Fraser AG, Marcotte EM. 2004. A probabilistic view of gene function. Nat. Genet. 36:559–
64
48. Frazer KA, Chen X, Hinds DA, Pant PV, Patil N, Cox DR. 2003. Genomic DNA inser-
tions and deletions occur frequently between humans and nonhuman primates. Genome
Res. 13:341–46
49. Gannon D, Alameda J, Chipara O, Christie M, Duke V, et al. 2005. Building grid portal
applications from a Web service component architecture. Proc. IEEE 93:551–63
50. Garcia-Hernandez M, Berardini TZ, Chen G, Crist D, Doyle A, et al. 2002. TAIR: a
resource for integrated Arabidopsis data. Funct. Integr. Genomics 2:239–53
51. Gibbs RA, Weinstock GM. 2003. Evolving methods for the assembly of large genomes.
Cold Spring Harb. Symp. Quant. Biol. 68:189–94
52. Goff SA, Ricke D, Lan TH, Presting G, Wang R, et al. 2002. A draft sequence of the rice
genome (Oryza sativa L. ssp. japonica). Science 296:92–100
53. Gonzales MD, Archuleta E, Farmer A, Gajendran K, Grant D, et al. 2005. The Legume
Information System (LIS): an integrated information resource for comparative legume
biology. Nucleic Acids Res. 33:D660–65
54. Gorg A, Obermaier C, Boguth G, Harder A, Scheibe B, et al. 2000. The current state of
two-dimensional electrophoresis with immobilized pH gradients. Electrophoresis 21:1037–
53
55. Gras R, Muller M. 2001. Computational aspects of protein identification by mass spec-
trometry. Curr. Opin. Mol. Ther. 3:526–32
56. Guo H, Moose SP. 2003. Conserved noncoding sequences among cultivated cereal
genomes identify candidate regulatory sequence elements and patterns of promoter evo-
lution. Plant Cell 15:1143–58
57. Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr, et al. 2003. Improving the
Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic
Annu. Rev. Plant Biol. 2006.57:335-360. Downloaded from arjournals.annualreviews.org
73. Kerk NM, Ceserani T, Tausta SL, Sussex IM, Nelson TM. 2003. Laser capture microdis-
section of cells from plant tissues. Plant Physiol. 132:27–35
74. Khatri P, Draghici S. 2005. Ontological analysis of gene expression data: current tools,
limitations, and open problems. Bioinformatics 21:3587–95
75. Klamt S, Stelling J, Ginkel M, Gilles ED. 2003. FluxAnalyzer: exploring structure, path-
ways, and flux distributions in metabolic networks on interactive flux maps. Bioinformatics
19:261–69
76. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC. 1993. De-
tecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science
262:208–14
77. Lawrence CJ, Seigfried TE, Brendel V. 2005. The maize genetics and genomics database.
The community resource for access to diverse maize data. Plant Physiol. 138:55–58
78. Lewin B. 2003. Genes VIII. Upper Saddle River, NJ: Prentice Hall
Annu. Rev. Plant Biol. 2006.57:335-360. Downloaded from arjournals.annualreviews.org
79. Li L, Wang X, Xia M, Stolc V, Su N, et al. 2005. Tiling microarray analysis of rice
chromosome 10 to identify the transcriptome and relate its expression to chromosomal
architecture. Genome Biol. 6:R52
80. Liu X, Noll DM, Lieb JD, Clarke ND. 2005. DIP-chip: rapid and accurate determination
by University of Delhi on 01/08/09. For personal use only.
93. Myers EW. 1995. Toward simplifying and accurately formulating fragment assembly. J.
Comput. Biol. 2:275–90
94. Nakazono M, Qiu F, Borsuk LA, Schnable PS. 2003. Laser-capture microdissection, a
tool for the global analysis of gene expression in specific plant cell types: identification of
genes expressed differentially in epidermal cells or vascular tissues of maize. Plant Cell.
15:583–96
95. Neumann E. 2005. A life science Semantic Web: Are we there yet? Sci. STKE 283:pe22
96. Newton RP, Brenton AG, Smith CJ, Dudley E. 2004. Plant proteome analysis by
mass spectrometry: principles, problems, pitfalls and recent developments. Phytochem-
istry 65:1449–85
97. Noel JP, Austin MB, Bomati EK. 2005. Structure-function relationships in plant phenyl-
propanoid biosynthesis. Curr. Opin. Plant Biol. 8:249–53
98. Papin JA, Reed JL, Palsson BO. 2004. Hierarchical thinking in network biology: the
Annu. Rev. Plant Biol. 2006.57:335-360. Downloaded from arjournals.annualreviews.org
ArrayExpress—a public repository for microarray gene expression data at the EBI. Nucleic
Acids Res. 33:D553–55
101. Patil N, Berno AJ, Hinds DA, Barrett WA, Doshi JM, et al. 2001. Blocks of limited hap-
lotype diversity revealed by high-resolution scanning of human chromosome 21. Science
294:1719–23
102. Pop M, Phillippy A, Delcher AL, Salzberg SL. 2004. Comparative genome assembly.
Brief Bioinform. 5:237–48
103. Rhee SY. 2004. Carpe diem. Retooling the publish or perish model into the share and
survive model. Plant Physiol. 134:543–47
104. Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, et al. 2003. The Arabidopsis
Information Resource (TAIR): a model organism database providing a centralized, cu-
rated gateway to Arabidopsis biology, research materials and community. Nucleic Acids
Res. 31:224–28
105. Rhee SY, Crosby B. 2005. Biological databases for plant research. Plant Physiol. 138:1–3
106. Rhodes D, Yu J, Shanker K, Deshpande N, Varambally R, et al. 2004. Large-scale meta-
analysis of cancer microarray data identifies common transcriptional profiles of neoplastic
transformation and progression. Proc. Natl. Acad. Sci. USA 101:9309–14
107. Roberts C, Nelson B, Marton M, Stoughton R, Meyer M, et al. 2000. Signaling and
circuitry of multiple MAPK pathways revealed by a matrix of global gene expression
profiles. Science 287:873–80
108. Roth FP, Hughes JD, Estep PW, Church GM. 1998. Finding DNA regulatory motifs
within unaligned noncoding sequences clustered by whole-genome mRNA quantitation.
Nat. Biotechnol. 16:939–45
109. Saal LH, Troein C, Vallon-Christersson J, Gruvberger S, Borg Å, Peterson C. 2002.
BioArray Software Environment: a platform for comprehensive management and analysis
of microarray data. Genome Biol. 3:software0003.1–.6
110. Schauer N, Steinhauser D, Strelkov S, Schomburg D, Allison G, et al. 2005. GC-MS
libraries for the rapid identification of metabolites in complex biological samples. FEBS
Lett. 579:1332–37
111. Schena M, Shalon D, Davis RW, Brown PO. 1995. Quantitative monitoring of gene
expression patterns with a complementary DNA microarray. Science 270:467–70
112. Schlueter SD, Dong Q, Brendel V. 2003. GeneSeqer@PlantGDB: gene structure pre-
diction in plant genomes. Nucleic Acids Res. 31:3597–600
113. Schneider M, Bairoch A, Wu CH, Apweiler R. 2005. Plant protein annotation in the
UniProt Knowledgebase. Plant Physiol. 138:59–66
114. Seo TS, Bai X, Kim DH, Meng Q, Shi S, et al. 2005. Four-color DNA sequencing by
synthesis on a chip using photocleavable fluorescent nucleotides. Proc. Natl. Acad. Sci.
USA 102:5926–31
115. Shanks JV. 2005. Phytochemical engineering: combining chemical reaction engineering
with plant science. AIChE J. 51:2–7
116. Shen L, Gong J, Caldo RA, Nettleton D, Cook D, et al. 2005. BarleyBase—an expression
profiling database for plant genomics. Nuceic Acids Res. 33:D614–18
117. Sinha U, Bui A, Taira R, Dionisio J, Morioka C, et al. 2002. A review of medical imaging
informatics. Ann. NY Acad. Sci. 980:168–97
Annu. Rev. Plant Biol. 2006.57:335-360. Downloaded from arjournals.annualreviews.org
118. Slonim DK. 2002. From patterns to pathways: gene expression data analysis comes of
age. Nat. Genet. 32:502–8
119. SMRS Working Group. 2005. Summary recommendations for standardization and re-
porting of metabolic analyses. Nat. Biotechnol. 23:833–38
by University of Delhi on 01/08/09. For personal use only.
120. Sriram G, Fulton DB, Iyer VV, Peterson JM, Zhou R, et al. 2004. Quantification of
compartmented metabolic fluxes in developing soybean embryos by employing biosyn-
thetically directed fractional 13 C labeling, two-dimensional [13 C, 1 H] nuclear magnetic
resonance, and comprehensive isotopomer balancing. Plant Physiol. 136:3043–57
121. Steuer R, Kurths J, Fiehn O, Weckwerth W. 2003. Interpreting correlations in
metabolomic networks. Biochem. Soc. Trans. 31:1476–78
122. Steuer R, Kurths J, Fiehn O, Weckwerth W. 2003. Observing and interpreting correla-
tions in metabolomic networks. Bioinformatics 19:1019–26
123. Stevens J, Doerge R. 2005. Combining Affymetrix microarray results. BMC Bioinformatics
6:57
124. Stevens R, Goble CA, Bechhofer S. 2000. Ontology-based knowledge representation for
bioinformatics. Brief Bioinform. 1:398–414
125. Stevens RD, Robinson AJ, Goble CA. 2003. myGrid: personalised bioinformatics on the
information grid. Bioinformatics 19(Suppl.)1:i302–4
126. Stoeckert CJ Jr, Causton HC, Ball CA. 2002. Microarray databases: standards and on-
tologies. Nat. Genet. 32(Suppl.):469–73
127. Stolc V, Samanta MP, Tongprasit W, Sethi H, Liang S, et al. 2005. Identification of
transcribed sequences in Arabidopsis thaliana by using high-resolution genome tiling
arrays. Proc. Natl. Acad. Sci. USA 102:4453–58
128. Tanabe L, Scherf U, Smith LH, Lee JK, Hunter L, Weinstein JN. 1999. MedMiner: an
internet text-mining tool for biomedical information, with application to gene expression
profiling. BioTechniques 27:1210–17
129. Tchieu JH, Fana F, Fink JL, Harper J, Nair TM, et al. 2003. The PlantsP and PlantsT
Functional Genomics Databases. Nucleic Acids Res. 31:342–44
130. The Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flow-
ering plant Arabidopsis thaliana. Nature 408:796–815
131. The Wellcome Trust. 2003. Sharing Data from Large-Scale Biological Research Projects: A
System of Tripartite Responsibility. Fort Lauderdale, FL: Wellcome Trust
132. Toyoda T, Shinozaki K. 2005. Tiling array-driven elucidation of transcriptional structures
based on maximum-likelihood and Markov models. Plant J. 43:611–21
133. Trethewey R. 2004. Metabolite profiling as an aid to metabolic engineering in plants.
Curr. Opin. Plant Biol. 7:196–201
134. van Helden J. 2003. Regulatory sequence analysis tools. Nucleic Acids Res. 31:3593–96
135. Vincent PL, Coe EH, Polacco ML. 2003. Zea mays ontology—a database of international
terms. Trends Plant Sci. 8:517–20
136. Wan X, Xu D. 2005. Computational methods for remote homolog identification. Curr.
Protein Peptide Sci. 6:527–46
137. Ware DH, Jaiswal P, Ni J, Yap IV, Pan X, et al. 2002. Gramene, a tool for grass genomics.
Plant Physiol. 130:1606–13
138. Weckwerth W, Loureiro M, Wenzel K, Fiehn O. 2004. Differential metabolic networks
unravel the effects of silent plant phenotypes. Proc. Natl. Acad. Sci. USA 101:7809–14
139. Wheeler DL, Smith-White B, Chetvernin V, Resenchuk S, Dombrowski SM, et al.
2005. Plant genome resources at the national center for biotechnology information. Plant
Physiol. 138:1280–88
140. Wiechert W, Mollney M, Petersen S, de Graaf AA. 2001. A universal framework for 13C
Annu. Rev. Plant Biol. 2006.57:335-360. Downloaded from arjournals.annualreviews.org
142. Woo Y, Affourtit J, Daigle S, Viale A, Johnson K, et al. 2004. A comparison of cDNA,
oligonucleotide, and affymetrix GeneChip gene expression microarray platforms. J.
Biomol. Tech. 15:276–84
143. Yamada K, Lim J, Dale JM, Chen H, Shinn P, et al. 2003. Empirical analysis of transcrip-
tional activity in the Arabidopsis genome. Science 302:842–46
144. Yamazaki Y, Jaiswal P. 2005. Biological ontologies in rice databases. An introduction to
the activities in Gramene and Oryzabase. Plant Cell Physiol. 46:63–68
145. Yates JR 3rd, Eng JK, McCormack AL, Schieltz D. 1995. Method to correlate tandem
mass spectra of modified peptides to amino acid sequences in the protein database. Anal.
Chem. 67:1426–36
146. Yona G, Levitt M. 2002. Within the twilight zone: a sensitive profile-profile comparison
tool based on information theory. J. Mol. Biol. 315:1257–75
147. Yu J, Hu S, Wang J, Wong GK, Li S, et al. 2002. A draft sequence of the rice genome
(Oryza sativa L. ssp. indica). Science 296:79–92
148. Yuan Q, Ouyang S, Wang A, Zhu W, Maiti R, et al. 2005. The institute for genomic
research Osa1 rice genome annotation database. Plant Physiol. 138:18–26
149. Zhang J, Leiderman K, Pfeiffer JR, Wilson BS, Oliver JM, Steinberg SL. 2006. Char-
acterizing the topography of membrane receptors and signaling molecules from spatial
patterns obtained using nanometer-scale electron-dense probes and electron microscopy.
Micron 37:14–34
150. Zhang MQ. 2002. Computational prediction of eukaryotic protein-coding genes. Nat.
Rev. Genet. 3:698–709
151. Zhang P, Foerster H, Tissier CP, Mueller L, Paley S, et al. 2005. MetaCyc and AraCyc.
Metabolic pathway databases for plant research. Plant Physiol. 138:27–37
152. Zhu H, Bilgin M, Snyder M. 2003. Proteomics. Annu. Rev. Biochem. 72:783–812
153. Zhu T, Wang X. 2000. Large-scale profiling of the Arabidopsis transcriptome. Plant
Physiol. 124:1472–76
154. Zhu W, Schlueter SD, Brendel V. 2003. Refined annotation of the Arabidopsis genome
by complete expressed sequence tag mapping. Plant Physiol. 132:469–84
155. Ding J, Viswanathan K, Berleant D, Hughes L, Wurtele ES, et al. 2005. Using the biologi-
cal taxonomy to access biological literature with PathBinderH. Bioinformatics 21:2560–62
DISCLOSURE STATEMENT
J.D. is a PI of the PLEXdb database that focuses on using Affymetrix GeneChips for cross-
species comparison.
Annu. Rev. Plant Biol. 2006.57:335-360. Downloaded from arjournals.annualreviews.org
by University of Delhi on 01/08/09. For personal use only.
Annual Review
v
Contents ARI 5 April 2006 18:47
vi Contents
Contents ARI 5 April 2006 18:47
INDEXES
by University of Delhi on 01/08/09. For personal use only.
ERRATA
An online log of corrections to Annual Review of Plant Biology chapters (if any, 1977 to
the present) may be found at http://plant.annualreviews.org/
Contents vii