Bioinformatics and Its Applications in Plant Biology: Seung Yon Rhee, Julie Dickerson, and Dong Xu

ANRV274-PP57-13 ARI 5 April 2006 19:12
Bioinformatics and Its

Applications in Plant
Biology
Annu. Rev. Plant Biol. 2006.57:335-360. Downloaded from arjournals.annualreviews.org
Seung Yon Rhee,1 Julie Dickerson,2

and Dong Xu3
by University of Delhi on 01/08/09. For personal use only.
1
Department of Plant Biology, Carnegie Institution, Stanford, California 94305;
email: [email protected]
2
Baker Center for Computational Biology, Electrical and Computer Engineering,
Iowa State University, Ames, Iowa 50011-3060; email: [email protected]
3
Digital Biology Laboratory, Computer Science Department and Life Sciences
Center, University of Missouri-Columbia, Columbia, Missouri 65211-2060;
email: [email protected]
Annu. Rev. Plant Biol. Key Words

2006. 57:335–60
sequence analysis, computational proteomics, microarray data
The Annual Review of
Plant Biology is online at analysis, bio-ontology, biological database
plant.annualreviews.org
Abstract
doi: 10.1146/
annurev.arplant.56.032604.144103 Bioinformatics plays an essential role in today’s plant science. As the
Copyright
c 2006 by amount of data grows exponentially, there is a parallel growth in the
Annual Reviews. All rights demand for tools and methods in data management, visualization, in-
reserved tegration, analysis, modeling, and prediction. At the same time, many
First published online as a researchers in biology are unfamiliar with available bioinformatics
Review in Advance on
February 28, 2006 methods, tools, and databases, which could lead to missed oppor-
tunities or misinterpretation of the information. In this review, we
1543-5008/06/0602-
0335$20.00 describe some of the key concepts, methods, software packages, and
databases used in bioinformatics, with an emphasis on those relevant
to plant science. We also cover some fundamental issues related to
biological sequence analyses, transcriptome analyses, computational
proteomics, computational metabolomics, bio-ontologies, and bio-
logical databases. Finally, we explore a few emerging research topics
in bioinformatics.
335
ANRV274-PP57-13 ARI 5 April 2006 19:12
the human brain to process and thus there

Contents is an increasing need to use computational
methods to process and contextualize these
INTRODUCTION . . . . . . . . . . . . . . . . . 336
data.
SEQUENCE ANALYSIS . . . . . . . . . . . 337
Bioinformatics refers to the study of bio-
Genome Sequencing . . . . . . . . . . . . . 337
logical information using concepts and meth-
Gene Finding and Genome
ods in computer science, statistics, and engi-
Annotation . . . . . . . . . . . . . . . . . . . . 337
neering. It can be divided into two categories:
Sequence Comparison . . . . . . . . . . . . 338
biological information management and com-
TRANSCRIPTOME ANALYSIS . . . . 340
putational biology. The National Institutes
Microarray Analysis . . . . . . . . . . . . . . 340
of Health (NIH) (http://www.bisti.nih.gov/)
Tiling Arrays . . . . . . . . . . . . . . . . . . . . . 341
defines the former category as “research, de-
Regulatory Sequence Analysis . . . . . 341
velopment, or application of computational
COMPUTATIONAL
tools and approaches for expanding the use
PROTEOMICS . . . . . . . . . . . . . . . . . 342
of biological, medical, behavioral or health
Electrophoresis Analysis . . . . . . . . . . 342
data, including those to acquire, represent, de-
Protein Identification Through
scribe, store, analyze, or visualize such data.”
Mass Spectrometry . . . . . . . . . . . . 342

The latter category is defined as “the devel-
METABOLOMICS AND
opment and application of data-analytical and
METABOLIC FLUX . . . . . . . . . . . . 344
theoretical methods, mathematical modeling,
ONTOLOGIES . . . . . . . . . . . . . . . . . . . . 345
and computational simulation techniques to
Types of Bio-Ontologies . . . . . . . . . . 345
the study of biological, behavioral, and social
Applications of Ontologies . . . . . . . . 345
systems.” The boundaries of these categories
Software for Accessing and
are becoming more diffuse and other cate-
Analyzing Ontologies and
gories will no doubt surface in the future as
Annotations . . . . . . . . . . . . . . . . . . . 346
this field matures.
DATABASES . . . . . . . . . . . . . . . . . . . . . . . 347
The intention of this article is not to pro-
Types of Biological Databases . . . . . 347
vide an exhaustive summary of all the advances
Data Representation and Storage . . 348
made in bioinformatics. Rather, we describe
Data Access and Exchange . . . . . . . . 348
some of the key concepts, methods, and tools
Data Curation . . . . . . . . . . . . . . . . . . . . 349
used in this field, particularly those relevant
EMERGING AREAS IN
to plant science, and their current limitations
BIOINFORMATICS . . . . . . . . . . . . . 350
and opportunities for new development and
Text Mining . . . . . . . . . . . . . . . . . . . . . . 350
improvement. The first section introduces
Computational Systems Biology . . . 350
sequence-based analyses, including gene find-
Semantic Web . . . . . . . . . . . . . . . . . . . . 351
ing, gene family and phylogenetic analy-
Cellular Localization and Spatially
ses, and comparative genomics approaches.
Resolved Data . . . . . . . . . . . . . . . . . 351
The second section presents computational
CONCLUSION . . . . . . . . . . . . . . . . . . . . 351
transcriptome analysis, ranging from analy-
ses of various array technologies to regula-
tory sequence prediction. In section three,
we focus on computational proteomics, in-
INTRODUCTION cluding gel analysis and protein identifica-
Recent developments in technologies and in- tion from mass-spectrometry data. Section
strumentation, which allow large-scale as well four describes computational metabolomics.
as nano-scale probing of biological samples, Section five introduces biological ontologies
are generating an unprecedented amount of and their applications. Section six addresses
digital data. This sea of data is too much for various issues related to biological databases
336 Rhee · Dickerson · Xu

ANRV274-PP57-13 ARI 5 April 2006 19:12
ranging from database development to cura- highly repetitive sequences, although the cost
tion. In section seven, we discuss a few emerg- of sequencing is another limitation. Recently
ing research topics in bioinformatics. developed methods continue to reduce the
cost of sequencing, including sequencing by
using differential hybridization of oligonu-
SEQUENCE ANALYSIS cleotide probes (48, 62, 101), polymorphism
Biological sequence such as DNA, RNA, and ratio sequencing (16), four-color DNA
protein sequence is the most fundamental sequencing by synthesis on a chip (114), and
object for a biological system at the molecular the “454 method” based on microfabricated
level. Several genomes have been sequenced high-density picoliter reactors (87). Each of
to a high quality in plants, including Arabidop- these sequencing technologies has significant
sis thaliana (130) and rice (52, 147, 148). Draft analytical challenges for bioinformatics in
genome sequences are available for poplar terms of experimental design, data interpre-
(http://genome.jgi-psf.org/Poptr1/) and tation, and analysis of the data in conjunction

lotus (http://www.kazusa.or.jp/lotus/), and with other data (33).
sequencing efforts are in progress for several
others including tomato, maize, Medicago
truncatula, sorghum (11) and close relatives Gene Finding and Genome
of Arabidopsis thaliana. Researchers also gen- Annotation
erated expressed sequence tags (ESTs) from Gene finding refers to prediction of introns
many plants including lotus, beet, soybean, and exons in a segment of DNA sequence.
cotton, wheat, and sorghum (see http:// Dozens of computer programs for identifying
www.ncbi.nlm.nih.gov/dbEST/). protein-coding genes are available (150).
Some of the well-known ones include Gen-
scan (http://genes.mit.edu/GENSCAN.ht
Genome Sequencing ml), GeneMarkHMM (http://opal.biology.
Advances in sequencing technologies provide gatech.edu/GeneMark/ ), GRAIL (http://
opportunities in bioinformatics for manag- compbio.ornl.gov/Grail-1.3/ ), Genie
ing, processing, and analyzing the sequences. (http://www.fruitfly.org/seq tools/genie.
Shotgun sequencing is currently the most html), and Glimmer (http://www.tigr.org/
common method in genome sequencing: softlab/glimmer). Several new gene-finding
pieces of DNA are sheared randomly, cloned, tools are tailored for applications to plant
and sequenced in parallel. Software has been genomic sequences (112).
developed to piece together the random, Ab initio gene prediction remains a chal-
overlapping segments that are sequenced lenging problem, especially for large-sized eu-
separately into a coherent and accurate con- karyotic genomes. For a typical Arabidopsis
tiguous sequence (93). Numerous software thaliana gene with five exons, at least one
packages exist for sequence assembly (51), in- exon is expected to have at least one of its
cluding Phred/Phrap/Consed (http://www. borders predicted incorrectly by the ab ini-
phrap.org), Arachne (http://www.broad. tio approach (19). Transcript evidence from
mit.edu/wga/), and GAP4 (http://staden. full-length cDNA or EST sequences or sim-
sourceforge.net/overview.html). TIGR ilarity to potential protein homologs can sig-
developed a modular, open-source package nificantly reduce uncertainty of gene identi-
called AMOS (http://www.tigr.org/soft fication (154). Such methods are widely used
ware/AMOS/), which can be used for com- in “structural annotation” of genomes, which
parative genome assembly (102). Current refers to the identification of features such
limitations in shotgun sequencing and assem- as genes and transposons in a genomic se-
bly software remain largely in the assembly of quence using ab initio algorithms and other
www.annualreviews.org • Bioinformatics and its Applications 337

ANRV274-PP57-13 ARI 5 April 2006 19:12
information. Several software packages have patterns. Inevitably, a combination of re-

been developed for structural annotation (3, peat finding tools is required to obtain a
45, 57, 66). In addition, one can use genome satisfactory overview of repeats found in an
comparison tools such as SynBrowse (http:// organism.
www.synbrowser.org/) and VISTA (http://
genome.lbl.gov/vista/index.shtml) to en-
hance the accuracy of gene identification. Sequence Comparison
Current limitations of structural annotation Comparing sequences provides a foundation
include accurate prediction of transcript start for many bioinformatics tools and may al-
sites and identification of small genes en- low inference of the function, structure, and
coding less than 100 amino acids, noncoding evolution of genes and genomes. For ex-
genes (such as microRNA precursors), and al- ample, sequence comparison provides a ba-
ternative splicing sites. sis for building a consensus gene model like
An important aspect of genome annota- UniGene (18). Also, many computational

tion is the analysis of repetitive DNAs, which methods have been developed for homology
are copies of identical or nearly identical identification (136). Although sequence com-
sequences present in the genome (78). Repet- parison is highly useful, it should be noted
itive sequences exist in almost any genome, that it is based on sequence similarity between
and are abundant in most plant genomes two strings of text, which may not correspond
(69). The identification and characterization to homology (relatedness to a common an-
of repeats is crucial to shed light on the evo- cestor in evolution), especially when the con-
lution, function and organization of genomes fidence level of a comparison result is low.
and to enable filtering for many types of Also, homology may not mean conservation in
homology searches. A small library of plant- function.
specific repeats can be found at ftp://ftp. Methods in sequence comparison can be
tigr.org/pub/data/TIGR Plant Repeats/; largely grouped into pair-wise, sequence-
this is likely to grow substantially as more profile, and profile-profile comparison. For
genomes are sequenced. One can use Repeat- pair-wise sequence comparison, FASTA
Masker (http://www.repeatmasker.org/) to (http://fasta.bioch.virginia.edu/) and
search repetitive sequences in a genome. BLAST (http://www.ncbi.nlm.nih.gov/
Working from a library of known repeats, blast/) are popular. To assess the confidence
RepeatMasker is built upon BLAST and level for an alignment to represent homol-
can screen DNA sequences for interspersed ogous relationship, a statistical measure
repeats and low complexity regions. Repeats (Expectation Value) was integrated into
with poorly conserved patterns or short pair-wise sequence alignments (71). Remote
sequences are hard to identify using Repeat- homologous relationships are often missed by
Masker due to the limitations of BLAST. pair-wise sequence alignment due to its insen-
To identify novel repeats, various algorithms sitivity. Sequence-profile alignment is more
were developed. Some widely used tools sensitive for detecting remote homologs.
include RepeatFinder (http://ser-loopp.tc. A protein sequence profile is generated by
cornell.edu/cbsu/repeatfinder.htm) and multiple sequence alignment of a group of
RECON (http://www.genetics.wustl.edu/ closely related proteins. A multiple sequence
eddy/recon/). However, due to the high alignment builds correspondence among
computational complexity of the problem, residues across all of the sequences simulta-
none of the programs can guarantee finding neously, where aligned positions in different
all possible repeats as all the programs use sequences probably show functional and/or
some approximations in computation, which structural relationship. A sequence profile
will miss some repeats with less distinctive is calculated using the probability of

ANRV274-PP57-13 ARI 5 April 2006 19:12
occurrence for each amino acid at each align- family databases are built from multiple
ment position. PSI-BLAST (http://www. sources. InterPro (http://www.ebi.ac.uk/
ncbi.nlm.nih.gov/BLAST/) is a popular interpro/) is a database that integrates
example of a sequence-profile alignment tool. domain information from multiple protein
Some other sequence-profile comparison domain databases. Using protein family
methods are slower but even more accu- information to predict gene function is more
rate than PSI-BLAST, including HMMER reliable than using sequence comparison
(http://hmmer.wustl.edu/), SAM (http:// alone. On the other hand, very closely related
www.cse.ucsc.edu/research/compbio/sam. proteins may not guarantee a functional
html), and META-MEME (http://met relationship (97). One can use structure-
ameme.sdsc.edu/). A profile-profile align- or function-based protein families (when
ment is more sensitive than the sequence- available) to complement sequence-based
profile-based search programs in detecting family for additional function information.
remote homologs (146). However, due to SCOP (http://scop.mrc-lmb.cam.ac.uk/

its high false positive rate, profile-profile scop/) and CATH (http://cathwww.bio
comparison is not widely used. Given poten- chem.ucl.ac.uk/) are the two well-known
tial false positive predictions, it is helpful to structure-based family resources. ENZYME
correlate the sequence comparison results (http://us.expasy.org/enzyme/) is a typical

with the relationship observed in functional example of a function family.
genomic data, especially the widely available A protein family can be represented in a
microarray data as discussed in the section phylogenetic tree that shows the evolutionary
Transcriptome Analysis below. For example, relationships among proteins. Phylogenetic
when a gene is predicted to have a particular analysis can be used in comparative genomics,
function through sequence comparison, gene function prediction, and inference of
one can gain confidence in the prediction lateral gene transfer among other things
if the gene has strong correlation in gene (36). The analysis typically starts from
expression profile with other genes known to aligning the related proteins using tools
have the same function. like ClustalW (http://bips.u-strasbg.fr/fr/
Proteins can be generally classified based Documentation/ClustalX/). Among the
on sequence, structure, or function. Several popular methods to build phylogenetic trees
sequence-based methods were developed are minimum distance (also called neighbor
based on sizable protein sequence (typically joining), maximum parsimony, and maximum
longer than 100 amino acids), including Pfam likelihood trees (reviewed in 31). Some
(http://pfam.wustl.edu/ ), ProDom (http:// programs provide options to use any of the
protein.toulouse.inra.fr/prodom/current/ three methods, e.g., the two widely used
html/home.php), and Clusters of Orthol- packages PAUP (http://paup.csit.fsu.edu),
ogous Group (COG) (http://www.ncbi. and PHYLIP (http://evolution.genetics.
nlm.nih.gov/COG/new/). Other methods washington.edu/phylip.html). Although
are based on “fingerprints” of small con- phylogenetic analysis is a research topic
served motifs in sequences, as with PROSITE with a long history and many methods
(http://au.expasy.org/prosite/), PRINTS have been developed, various heuristics and
(http://umber.sbs.man.ac.uk/dbbrowser/ approximations are used in constructing
PRINTS/), and BLOCKS (http://www.psc. a phylogenetic tree, as the exact methods
edu/general/software/packages/blocks/blo are too computationally intense. Hence,
cks.html). The false positive rate of motif different methods sometimes produce signif-
assignment is high due to high probability of icantly different phylogenetic trees. Manual
matching short motifs in unrelated proteins assessment of different results is generally
by chance. Other sequence-based protein required.

ANRV274-PP57-13 ARI 5 April 2006 19:12
TRANSCRIPTOME ANALYSIS creased numbers of genes per array. Unfortu-

nately, the diversity of array platforms makes
The primary goal of transcriptome analysis is
it difficult to compare results between mi-
to learn about how changes in transcript abun-
croarray formats that use different probe se-
dance control growth and development of an
quences, RNA sample labeling, and data col-
organism and its response to the environ-
lection methods (142).
ment. DNA microarrays proved a powerful
Other important issues in microarray anal-
technology for observing the transcriptional
ysis are in processing and normalizing data.
profile of genes at a genome-wide level (22,
Some journals require multiple biological
111). Microarray data are also being combined
replicates (typically at least three) and sta-
with other information such as regulatory se-
tistically valid results before publishing mi-
quence analysis, gene ontology, and pathway
croarray results. Replication of the microarray
information to infer coregulated processes.
experiment and appropriate statistical design
Whole-genome tiled arrays are used to de-

are needed to minimize the false discovery
tect transcription without bias toward known
rate. The microarray data must also be de-
or predicted gene structures and alternative
posited into a permanent public repository
splice variants. Other types of analysis include
with open access. A good overview of microar-
ChIP-chip [chromatin immunoprecipitation

ray data analysis can be found in References
(ChIP) and microarray analysis (chip)] analy-
37 and 118. The main difficulty of dealing
sis, which combines microarrays with meth-
with microarray data is the sheer amount of
ods for detecting the chromosomal locations
data resulting from a single experiment. This
at which protein-DNA interactions occur
makes it very difficult to decide which tran-
across the genome (23). A related technique
scripts to focus on for interpreting the results.
uses DNA immunoprecipitation (DIP-chip)
Even for standardized arrays such as those
to predict DNA-binding sites (80). This re-
from Affymetrix, there are still arguments on
view does not cover all available technolo-
the optimal statistical treatment for the sets of
gies for measuring expression data such as
probes designed for each gene. For example,
tag-based transcriptional profiling technolo-
the Affycomp software compares Affymetrix
gies like massively parallel signature sequenc-
results using two spike-in experiments and
ing (MPSS) and SAGE (20, 28).
a dilution experiment for different meth-
ods of normalization under different assess-
ment criteria (27). This information can be
Microarray Analysis used to select the appropriate normalization
Microarray analysis allows the simultane- methods.
ous measurement of transcript abundance for Many tools are available that perform a
thousands of genes (153). Two general types of variety of analysis on large microarray data
microarrays are high-density oligonucleotide sets. Examples include commercial software
arrays that contain a large number (thou- such as Gene Traffic, GeneSpring (http://
sands or often millions) of relatively short (25– www.agilent.com/chem/genespring), Affy-
100-mer) probes synthesized directly on the metrix’s GeneChip Operating Software
surface of the arrays, or arrays with ampli- (GCOS), and public software such as Cluster
fied polymerase chain reaction products or (41), CaARRAY (http://caarray.nci.nih.
cloned DNA fragments mechanically spot- gov/), and BASE (109). A notable exam-
ted directly on the array surface. Many differ- ple is Bioconductor (http://www.biocon
ent technologies are being developed, which ductor.org), which is an open-source and
have been recently surveyed by Meyers and open-development set of routines written for
colleagues (89). Competition among microar- the open-source R statistical analysis package
ray platforms has led to lower costs and in- (http://www.r-project.org).

ANRV274-PP57-13 ARI 5 April 2006 19:12
Observing the patterns of transcriptional data across the different chips. Visualization
activity that occur under different conditions of the output from tiling arrays requires view-
such as genotypes or time courses reveals ing the probe sequences on the array together
genes that have highly correlated patterns of with the sequence assembly and the probe
expression. However, correlation cannot dis- expression data. The Arabidopsis Tiling Ar-
tinguish between genes that are under com- ray Transcriptome Express Tool (also known
mon regulatory control and those whose ex- as ChipViewer) (http://signal.salk.edu/cgi-
pression patterns just happen to correlate. bin/atta) displays information about what
Recent efforts in microarray analysis have fo- type of transcription occurred along the
cused on analysis of microarray data across Arabidopsis genome (143). Another tool is
experiments (91). A study by the Toxicoge- the Integrated Genome Browser (IGB) from
nomics research consortium indicates that Affymetrix, a Java program for exploring
“microarray results can be comparable across genomes and combining annotations from
multiple laboratories, especially when a com- multiple data sources. Another option for vi-
mon platform and set of procedures are used” sualizing such data are collaborations such as
(7). Meta-analysis can investigate the effect of those between Gramene (137) and PLEXdb
the same treatment across different studies to (116), which allow users to overlay probe ar-
arrive at a single estimate of the true effect of ray information onto a comparative sequence
the treatment (106, 123). viewer.
The major limitations of WGAs include
the requirement of a sequenced genome, the
Tiling Arrays large number of chips required for complete
Typical microarray sample known and pre- genome coverage, and analysis of recently du-
dicted genes. Tiling arrays cover the genome plicated (and thus highly homologous) genes.
at regular intervals to measure transcrip-
tion without bias toward known or predicted
gene structures, discovery of polymorphisms, Regulatory Sequence Analysis
analysis of alternative splicing, and identi- Interpreting the results of microarray exper-
fication of transcription factor-binding sites iments involves discovering why genes with
(90). Whole-genome arrays (WGAs) cover similar expression profiles behave in a coordi-
the entire genome with overlapping probes or nated fashion. Regulatory sequence analysis
probes with regular gaps. The WGA ensures approaches this question by extracting mo-
that the experimental results are not depen- tifs that are shared between the upstream se-
dent on the level of current genome annota- quences of these genes (134). Comparative
tion as well as discovering new transcripts and genomics studies of conserved noncoding se-
unusual forms of transcription. In plants, sim- quences (CNSs) may also help to find key
ilar studies have been performed for the en- motifs (56, 67). There are several methods
tire Arabidopsis genome (127, 143) and parts of to search over-represented motifs at the up-
the rice genome (70, 79). These studies iden- stream of coregulated genes. Roughly they
tified thousands of novel transcription units can be categorized into two classes: oligonu-
including genes within the centromeres, sub- cleotide frequency-based (68, 134) and prob-
stantial antisense gene transcription, and tran- abilistic sequence-based models (76, 85,
scription activity in intergenic regions. Tiling 108).
array data may also be used to validate pre- The oligonucleotide frequency-based
dicted intron/exon boundaries (132). method calculates the statistical significance
Further work is needed to establish the of a site based on oligonucleotide frequency
best practices for determining when transcrip- tables observed in all noncoding regions of
tion has occurred and how to normalize array the specific organism’s genome. Usually, the

ANRV274-PP57-13 ARI 5 April 2006 19:12
length of the oligonucleotide varies from 4 of protein-protein interactions, protein ac-

to 9 bases. Hexanucleotide (oligonucleotide tivity profiling, protein subcellular localiza-
length of 6) analysis is most widely used. tion, and protein structure, have not been
The significant oligonucleotides can then widely used in plant science. However, re-
be grouped as longer consensus motifs. cent efforts such as the structural genomic
Frequency-based methods tend to be simple, initiative that includes Arabidopsis (http://
efficient, and exhaustive (all over-represented www.uwstructuralgenomics.org/) are en-
patterns of chosen length are detected). The couraging.
main limitation is the difficulty of identifying
complex motif patterns. The public Web
resource, Regulatory Sequence Analysis Electrophoresis Analysis
Tools (RSAT), performs sequence similar- Electrophoresis analysis can qualitatively and
ity searches and analyzes the noncoding quantitatively investigate expression of pro-
sequences in the genomes (134). teins under different conditions (54). Several
For the probabilistic-based methods, the bioinformatics tools have been developed for
motif is represented as a position probability two-dimensional (2D) electrophoresis analy-
matrix, where the motifs are assumed to be sis (86). SWISS-2DPAGE can locate the pro-
hidden in the noisy background sequences. teins on the 2D PAGE maps from Swiss-
One of the strengths of probabilistic-based Prot (http://au.expasy.org/ch2d/). Melanie
methods is the ability to identify motifs with (http://au.expasy.org/melanie/) can ana-
complex patterns. Many potential motifs can lyze, annotate, and query complex 2D
be identified; however, it can be difficult to gel samples. Flicker (http://open2dprot.
separate unique motifs from this large pool sourceforge.net/Flicker/) is an open-source
of potential solutions. Probabilistic-based stand-alone program for visually compar-
methods also tend to be computationally ing 2D gel images. PDQuest (http://www.
intense as they must be run multiple times proteomeworks.bio-rad.com) is a popular
to get an optimal solution. AlignACE, Aligns commercial software package for comparing
Nucleic Acid Conserved Elements (http:// 2D gel images. Some software platforms han-
atlas.med.harvard.edu/), is a popular motif dle related data storage and management, in-
finding tool that was first developed for yeast cluding PEDRo (http://pedro.man.ac.uk/),
but has been expanded to other species (107). a software package for modeling, capturing,
and disseminating 2D gel data and other
proteomics experimental data. Main limita-
COMPUTATIONAL tions of electrophoresis analysis include lim-
PROTEOMICS ited ability to identify proteins and low accu-
Proteomics is a leading technology for the racy in detecting protein abundance.
qualitative and quantitative characterization
of proteins and their interactions on a genome
scale. The objectives of proteomics include Protein Identification Through Mass
large-scale identification and quantification of Spectrometry
all protein types in a cell or tissue, analysis of After protein separation using 2D elec-
post-translational modification and associa- trophoresis or liquid chromatography and
tion with other proteins, and characterization protein digestion using an enzyme (trypsin,
of protein activities and structures. Applica- pepsin, glu-C, etc.), proteins are identified by
tion of proteomics in plants is still in its ini- typically using mass spectrometry (MS) (1). In
tial phase, mostly in protein identification (24, contrast to other protein identification tech-
96). Other aspects of proteomics (reviewed niques, such as Edman degradation microse-
in 152), such as identification and prediction quencing, MS provides a high-throughput

ANRV274-PP57-13 ARI 5 April 2006 19:12
approach for large-scale protein identifica- In this case, additional MS/MS experiments
tion. The data generated from mass spec- are needed to identify the proteins.
trometers are often complicated and compu-
tational analyses are critical in interpreting the
data for protein identification (17, 55). A ma- Tandem mass spectrometry. MS/MS fur-
jor limitation in MS protein identification is ther breaks each digested peptide into smaller
the lack of open-source software. Most widely fragments, whose spectra provide effective
used tools are expensive commercial pack- signatures of individual amino acids in the
ages. In addition, current statistical models for peptide for protein identification. Many
matches between MS spectra and protein se- tools have been developed for MS/MS-based
quences are generally oversimplified. Hence, peptide/protein identification, the most
the confidence assessments for computational popular ones being SEQUEST (http://
protein identification results are often unreli- fields.scripps.edu/sequest/) and Mascot
able. There are two types of MS-based protein (http://www.matrixscience.com/). Both

identification methods: peptide mass finger- rely on the comparison between theoretical
printing (PMF) and tandem mass spectrome- peptides derived from the database and
try (MS/MS). experimental mass spectrometric tandem
spectra. SEQUEST, one of the earliest tools

Peptide mass fingerprinting. PMF pep- developed for this, produces a list of possible
tide/protein identification compares the peptide/protein assignments in a protein
masses of peptides derived from the experi- mixture based on a correlation scoring
mental spectral peaks with each of the possible scheme (145). Mascot, together with its PMF
peptides computationally digested from pro- protein identification capacity, uses a similar
teins in the sequence database. The proteins in algorithm as SEQUEST for MS/MS pep-
the sequence database with a significant num- tide/protein identification. The limitations of
ber of peptide matches are considered can- these programs are that a significant portion
didates for the proteins in the experimental of MS/MS spectra cannot be assigned due
sample. MOWSE (99) was an earlier software to various factors, including sequencing and
package for PMF protein identification, and annotation errors in the search database.
Emowse (http://emboss.sourceforge.net/) In addition, post-translational modifica-
is the latest implementation of the MOWSE tions are currently not handled well using
algorithm. Several other computational tools computational approaches.
have also been developed for PMF protein The de novo sequencing approach based
identification. MS-Fit in the Protein Prospec- on MS/MS spectra is an active research area
tor (http://prospector.ucsf.edu/) uses a vari- (30). Typically the algorithms match the
ant of MOWSE scoring scheme incorporat- separations of peaks by the mass of one or
ing new features, including constraints on the several amino acids and infer the probable
minimum number of peptides to be matched peptide sequences that are consistent with
for a possible hit, the number of missed the matched amino acids (25). There are a
cleavages, and the target protein’s molec- few popular software packages for peptide
ular weight range. Mascot (http://www. de novo sequencing using MS/MS data,
matrixscience.com/) is an extension of the including Lutefisk (http://www.hairyfatguy.
MOWSE algorithm. It incorporates the com/lutefisk/) and PEAKS (http://www.bio
same scoring scheme with the addition of a informaticssolutions.com/products/peaks).
probability-based score. A limitation of PMF One limitation of current de novo methods
protein identification is that it sometimes can- is that they often cannot provide the exact
not identify proteins because multiple pro- sequence of a peptide. Instead, several top
teins in the database can fit the PMF spectra. candidate sequences are suggested.

ANRV274-PP57-13 ARI 5 April 2006 19:12
METABOLOMICS AND nologies, and methods used in a metabolite

METABOLIC FLUX profiling experiment.
Metabolite data have been used to con-
Metabolomics is the analysis of the com-
struct metabolic correlation networks (121).
plete pool of small metabolites in a cell at
Such correlations may reflect the net parti-
any given time. Metabolomics may prove to
tioning of carbon and nitrogen resulting from
be particularly important in plants due to
direct enzymatic conversions and indirect cel-
the proliferation of secondary metabolites.
lular regulation by transcriptional or bio-
As of 2004, more than 100,000 metabolites
chemical processes. However, metabolic cor-
have been identified in plants, with estimates
relation matrices cannot infer that a change
that this may be less that 10% of the total
in one metabolite led to a change in another
(133). In a metabolite profiling experiment,
metabolite in a metabolic reaction network
metabolites are extracted from tissues, sep-
(122).
arated, and analyzed in a high-throughput

Metabolic flux analysis measures the
manner (44). Metabolic fingerprinting looks
steady-state flow between metabolites. Fluxes,
at a few metabolites to help differentiate sam-
however, are even more difficult to measure
ples according to their phenotype or bio-
than metabolite levels due to complications
logical relevance (58, 115). Technology has

in modeling intracellular transport of metabo-
now advanced to semiautomatically quantify
lites and the incomplete knowledge about the
>1000 compounds from a single leaf extract
topology and location of the pathways in vivo
(138).
(115). The most basic approach to metabolic
The key challenge in metabolite profiling
flux analysis is stoichiometric analysis that cal-
is the rapid, consistent, and unambiguous
culates the quantities of reactants and prod-
identification of metabolites from complex
ucts of a chemical reaction to determine the
plant samples (110). Identification is routinely
flux of each metabolite (39). However, this
performed by time-consuming standard ad-
method is numerically difficult to solve for
dition experiments using commercially avail-
large networks and it has problems if paral-
able or purified metabolite preparations. A
lel metabolic pathways, metabolic cycles, and
publicly accessible database that contains the
reversible reactions are present (140). Flux-
evidence and underlying metabolite identi-
Analyzer is a package for MATLAB that inte-
fication for gas chromatography-mass spec-
grates pathway and flux analysis for metabolic
trometry (GC–MS) profiles from diverse bi-
networks (75).
ological sources is needed. Standards for
Flux analysis using 13 C carbon labeling
experimental metadata and data quality in
data seeks to overcome some of the disadvan-
metabolomics experiments are still in a very
tages of stoichiometric flux analysis described
early stage and a large-scale public repository
above (120). More rigorous analysis is needed
is not yet available. The ArMet (architecture
for full determination of fluxes from all of
for metabolomics) proposal (61) gives a de-
the experimental data in 13 C constrained flux
scription of plant metabolomics experiments
analysis (stoichiometric model with a few flux
and their results along with a database schema.
ratios as constraints) and the stoichiometric
MIAMET (Minimum Information About a
and isotopomer balances. Iterative methods
Metabolomics Experiment) (13) gives report-
have been used to solve the resulting matrix
ing requirements with the aim of standard-
of isotopomer balances, with the nuclear
izing experiment descriptions, particularly
magnetic resonance or gas chromatography
within publications. The Standard Metabolic
measurements used to provide consistency.
Reporting Structures (SMRS) working group
As more reliable data are collected, one
(119) has developed standards for describing
can use ordinary differential equations for
the biological sample origin, analytical tech-
dynamic simulations of metabolic networks

ANRV274-PP57-13 ARI 5 April 2006 19:12
and combine information about connectivity, cellular compartment,” and “biological pro-
concentration balances, flux balances, cess.” GO is organized as a directed acyclic
metabolic control, and pathway optimization. graph, which is a type of hierarchy tree that
Ultimately, one may integrate all of the infor- allows a term to exist as a specific concept
mation and perform analysis and simulation in belonging to more than one general term.
a cellular modeling environment like E-Cell Other examples of ontologies currently in de-
(http://www.e-cell.org/) or CellDesigner velopment are the Sequence Ontology (SO)
(http://www.systems-biology.org). project (40) and the Plant Ontology (PO)
project (www.plantontology.org). The SO
project aims to explicitly define all the terms
ONTOLOGIES needed to describe features on a nucleotide
The data that are generated and analyzed as sequence, which can be used for genome se-
described in the previous sections need to be quence annotation for any organism. The PO
compared with the existing knowledge in the project aims to develop shared vocabularies
field in order to place the data in a biologically to describe anatomical structures for flower-
meaningful context and derive hypotheses. To ing plants to depict gene expression patterns
do this efficiently, data and knowledge need and plant phenotypes.
to be described in explicit and unambiguous A few challenges in the development and

ways that must be comprehensible to both hu- use of ontologies remain to be addressed,
mans and computer programs. An ontology is including redundancies in the ontologies,
a set of vocabulary terms whose meanings and minimal or lack of formal, computer-
relations with other terms are explicitly stated comprehensive definitions of the terms in the
and which are used to annotate data (5, 10, ontologies, and general acceptance by the re-
14, 124). This section introduces the types of search and publishing community (10, 14).
ontologies in development and use today and There is an opportunity for an international
some applications and caveats of using the on- repository of ontology standards that could
tologies in biology. oversee the development and maintenance of
the ontologies.
Types of Bio-Ontologies
A growing number of shared ontologies are Applications of Ontologies
being built and used in biology. Examples in- Ontologies are used mainly to annotate data
clude ontologies for describing gene and pro- such as sequences, gene expression clusters,
tein function (59), cell types (9), anatomies experiments, and strains. Ontologies that
and developmental stages of organisms (50, have such annotations to data in databases
135, 144), microarray experiments (126), can be used in numerous ways, including
and metabolic pathways (84, 151). A list of connecting different databases, refining
open-source ontologies used in biology can searching, providing a framework for inter-
be found on the Open Biological Ontolo- preting the results of functional genomics
gies Web site (http://obo.sourceforge.net/). experiments, and inferring knowledge (8, 10,
Many ontologies on this site are un- 47). For example, one can ask which functions
der development and are subject to fre- and processes are statistically significantly
quent change. The Gene Ontology (GO) over-represented in an expression cluster
(www.geneontology.org) is an example of of interest compared to the functions and
bio-ontologies that has garnered community processes carried out by all of the genes from
acceptance. It is a set of more than 16,000 a gene expression array. Because GO is one of
controlled vocabulary terms for the biolog- the more well-established ontologies, this sec-
ical domains of ‘‘molecular function,” “sub- tion focuses on GO to illustrate applications

ANRV274-PP57-13 ARI 5 April 2006 19:12
of ontologies in biology. Ontologies have ies that attempt to define biological processes
been used by many model organism databases and functions from gene expression data us-
to annotate genes and gene products ing the GO annotations should ensure that
(http://www.geneontology.org/GO.curren no annotation with inferred from expression
t.annotations.shtml, http://www.geneonto pattern (IEP) evidence code is used. The other
logy.org/GO.biblio.shtml#annots). Func- caveat is that annotations to GO are not equiv-
tion annotations of genes using GO have alently represented throughout GO. When
been used mainly in two ways: predicting looking for statistical over-representation of
protein functions, processes, and localization GO terms in genes of an expression cluster,
patterns from various data sources (http:// there is low statistical power for detecting de-
www.geneontology.org/GO.biblio.shtml# viations from expectation for terms that are
predictions) and providing a biological annotated with a small number of genes (74).
framework or benchmark set for inter-
preting results of large-scale probing of

samples such as gene expression profiles and Software for Accessing and Analyzing
protein-protein interactions (http://www. Ontologies and Annotations
geneontology.org/GO.biblio.shtml#gene There are a number of software tools for
exp). In addition, GO annotations have visualizing, editing, and analyzing ontologies

been used to test the robustness of semantic and their annotations. The GO Web site
similarity searching methods (83) and to maintains a comprehensive list of these tools
study adaptive evolution (4). (http://www.geneontology.org/GO.tools.
There are several issues in using GO an- shtml). Some of them are accessible via Web
notations to predict function and to use as browsers and others have to be installed
a benchmark for large-scale data. One is locally. Tools are also needed to facilitate
the misuse or lack of use of evidence codes, data integrity checks and more flexible
which provide the type of evidence that was and customizable searching and browsing
used to make the annotation (http://www.ge capabilities to explore these complex net-
neontology.org/GO.evidence.shtml). Only works of concepts. Most of the tools that
about half of the evidence codes refer to direct facilitate analysis of the GO annotations are
experimental evidence. Also, several evidence developed to help interpret gene expression
codes are used for indirect evidence, which studies. These applications allow researchers
indicate less certainty in the assertion of the to compare a list of genes (for example,
annotation than those made with direct ex- from an expression cluster) and identify
perimental evidence. Other codes are used for over-represented GO terms in this list as
computationally derived annotations and have compared to the whole genome or whole list
no experimental support and have a higher of genes under study. Most of these software
probability of being incorrect. Researchers programs use statistical models to provide
and computer programs that use the anno- significance in the over-representation.
tations for inferring knowledge or analyzing Recently, Khatri and colleagues reported
functional genomics data should be familiar comparisons of 14 of these tools on their
with these evidence codes in order to mini- functionalities, advantages, and limitations
mize misinterpretation of the data. For exam- (74). Finally, most of the bio-ontologies are
ple, methods to assess relationship between informal in their semantic representation.
sequence conservation and coexpression of Definitions of the terms are provided in
genes and using GO annotations to validate natural language, which is fine for human
their results should ensure that no annotations comprehension but does not easily allow
using the ISS and IEA evidence codes are used computers and software to be developed that
to avoid circular arguments. Similarly, stud- can help check for ontology integrity and

ANRV274-PP57-13 ARI 5 April 2006 19:12
provide more semantically powerful search prominent example of community-specific

functions. More tools are needed that can fa- databases are those that cater to researchers
cilitate the conversion of bio-ontologies to be focused on studying model organisms (77,
more formal and computer comprehensive. 104, 144) or clade-oriented comparative
databases (53, 88, 92, 137). Other exam-
ples of community-specific databases include
DATABASES databases focused on specific types of data
Traditionally, biologists relied on textbooks such as metabolism (151) and protein mod-
and research articles published in scientific ification (129). The concept of community-
journals as the main source of information. specific databases is subject to change as re-
This has changed dramatically in the past searchers are widening their scope of research.
decade as the Internet and Web browsers be- For example, databases focused on com-
came commonplace. Today, the Internet is paring genome sequences recently emerged
the first place researchers go to find infor- (e.g., http://www.phytome.org and Refer-
mation. Databases that are available via the ence 64). The third category of databases
Web also became an indispensable tool for bi- includes smaller-scale, and often short-lived,
ological research. In this section, we describe databases that are developed for project data
types and examples of biological databases, management during the funding period. Of-
how these databases are built and accessed, ten these databases and Web resources are not
how data among databases are exchanged, and maintained beyond the funding period of the
current challenges and opportunities in bi- project and currently there is no standard way
ological database development and mainte- of depositing or archiving these databases af-
nance. ter the funding period.
There are some issues in database man-
agement. First, there is a general lack of
Types of Biological Databases good documentation on the rationale of the
Three types of biological databases have design and implementation. More effort is
been established and are developed: large- needed to share the experiences via con-
scale public repositories, community-specific ferences and publications. Also, there are
databases, and project-specific databases. no accepted standards in making databases,
Nucleic Acids Research (http://nar.oxford schema, software, and standard operating
journals.org/) publishes a database issue in procedures available. In response to this, the
January of every year. Recently, Plant Phys- National Human Genome Research Institute
iology started publishing articles describing (NHGRI) has funded a collaborative project
databases (105). Large-scale public reposito- called the Generic Model Organism Database
ries are usually developed and maintained by (http://www.gmod.org) to promote the de-
government agencies or international con- velopment and sharing of software, schemas,
sortia and are places for long-term data and standard operation procedures. The
storage. Examples include GenBank for se- project’s major aim is to build a generic or-
quences (139), UniProt (113) for protein in- ganism database toolkit to allow researchers
formation, Protein Data Bank (32) for pro- to set up a genome database “off the shelf.”
tein structure information, and ArrayExpress Another major issue is that there is a gen-
(100) and Gene Expression Omnibus (GEO) eral lack of infrastructure of supporting,
(38) for microarray data. There are a num- managing, and using digital data archived in
ber of community-specific databases, which databases and Web sites in the long term (82).
typically contain information curated with One possibility to alleviate this problem is to
high standards and address the needs of create a public archive of biological databases
a particular community of researchers. A and Web sites to which finished projects

ANRV274-PP57-13 ARI 5 April 2006 19:12
could deposit the database, software, and ways. Also, it is difficult to create rich seman-
Web sites. There are several projects that are tic relationships in relational databases to ask
building digital repository systems that can the database “what if ” types of queries with-
be models for such a repository such as D- out having extensive software built on top of
Space (http://dspace.org/) and the CalTech the database. Another limitation of relational
Collection of Open Digital Archives (CODA; databases is that it is very difficult, if not im-
http://library.caltech.edu/digital/). Some possible, to preserve all of the changes that
additional challenges in long-term archiving occur to attributes of entities.
of data were articulated in a recent National
Science Board report (http://www.nsf.gov/
nsb/documents/2005/LLDDC report.pdf). Data Access and Exchange
The most direct, powerful, and flexible way
of accessing data in a database is using
Data Representation and Storage structured query language (SQL) (http://

Databases can be developed using a num- databases.about.com/od/sql/). SQL has a
ber of different methods including simple reasonably intuitive and simple syntax that
file directories, object-oriented database soft- requires no programming knowledge and is
ware, and relational database software. Due suited for biologists to learn without a steep
to the increasing quantity of data that need to learning curve. However, to use SQL, users
be stored and made accessible using the In- need to know the database schema. In addi-
ternet, relational database management soft- tion, some queries that are based on less opti-
ware has become popular and has become mized database structure could result in slow
the de facto standard in biology. Relational performance and can even sometimes lock the
databases provide effective means of storing database system. In most databases, access to
and retrieving large quantities of data via the data is provided via database access soft-
indexes, normalization, referential integrity, ware and graphical user interface (GUI) that
triggers, and transactions. Notable relational allow searching and browsing of the data. In
database software that is freely available and addition to text-based search user interfaces,
quite popular in bioinformatics is MySQL more sophisticated ways of accessing data such
(http://www.mysql.com/) and PostgreSQL as graphical displays and tree-based browsers
(http://www.postgresql.org/). In relational are also common.
databases, data are represented as entities, at- Although accessing information from a
tributes (properties of the entities), and rela- database is fairly easy if one knows which
tionships between the entities. This type of database to go to, it is not as easy to find infor-
representation is called Entity-Relationship mation if one does not know which database
(ER) and database schemas are described to search. There are several ways to solve
using ER diagrams (e.g., TAIR schema this problem such as indexing the content
at http://arabidopsis.org/search/schemas. of database-driven pages, developing software
html). Entities and attributes become tables that will connect to individual databases di-
and columns in the physical implementation rectly, or developing a data warehouse of many
of the database, respectively. Data are the val- different data types or database in one site. A
ues that are stored in the fields of the tables. relatively new method that is gaining some
Although relational databases are power- attention is to use a registry system where dif-
ful ways of storing large quantities of data, ferent databases that specialize on particular
they have limitations. For example, it is not information can declare what data are avail-
trivial to represent complex relationships be- able in their system and register methods to
tween data such as signal transduction path- access their data. Users can send requests to

ANRV274-PP57-13 ARI 5 April 2006 19:12
the registry system, which then contact the ap- Data Curation
propriate databases to retrieve the requested Data curation is defined as any activity de-
data. Conceptually, this is an elegant way of in- voted to selecting, organizing, assessing qual-
tegrating different databases without depend- ity, describing, and updating data that result
ing on the individual databases’ schema. How- in enhanced quality, trustworthiness, inter-
ever, this relies on the willingness of individual pretability, and longevity of the data. It is a
databases to participate in the registry system. crucial task in today’s research environment
This method is called Web services and has where data are being generated at an ever-
been accepted widely by the Internet industry increasing rate and an increasing amount of
but has not yet been commonly implemented. research is based on re-use of data. In general,
Projects like BioMOBY (141) and myGRID some level of curation is done by data gener-
(125) are implementing this idea for biological ators, but most curation activities are carried
databases, but they have not yet been widely
out in data repositories. A number of differ-

used. ent strategies to curation are used, including
Semantics (meaning) and syntax (format) computational, manual, in-house, and those
of data need to be made explicit in order that involve external expertise. Assessing data
to exchange data for analysis and mining.
quality involves both determining the crite-

A simple way of formatting data is using ria for measuring quality and performing the
a tag and value system (called markup lan- measurements. Data quality criteria for raw
guage). An emerging standard for exchanging data are tied with methods of data acquisition.
data and information via the Web is Exten- In many databases, these criteria are not made
sible Markup Language (XML), which al- explicit and the information on the metrics of
lows information providers to define new tag data-quality assessment is rare.
and attribute names at will and to nest doc- Curation of data into public repositories
ument structures to any level of complex- should be a parallel and integrated process
ity, among other features. The document with publication in peer-reviewed journals.
that defines the meaning of the tags for an Although much progress has been made in
XML document is called Document Type electronic publication and open-access pub-
Definition (DTD). The use of a common lishing, there is still a gap between connect-
DTD allows different users and applications ing the major conclusions in papers and the
to exchange data in XML. Although many data that were used to draw the conclusions.
databases and bioinformatics projects present In a few cases, data are required to be sub-
their data in XML, currently almost every mitted to public repositories (e.g., sequence
group has their own DTD. Standardization data to GenBank, microarray data to Array-
and common use of DTDs for exchanging Express/GEO, and Arabidopsis stock data to
common data types will be pivotal. There are ABRC). However, there are no such stan-
notable exceptions to this rule including the dards established for other data types (e.g.,
specification of microarray data, MAGEML proteomics data, metabolomics data, protein
(Microarray Gene Expression Markup Lan- localization, in situ hybridization, phenotype
guage), provided by the Microarray Gene Ex- description, protein function information).
pression Database Society (MGED) (http:// Standards, specifications, and requirements
www.mged.org/). To a lesser extent, the for publication of data into repositories should
BIOPAX (http://www.biopax.org/) is also be made more accessible to researchers early
becoming a community-accepted standard to on in their data-generation and research-
describe pathways and reactions. Other than activity processes.
DTDs, biological database communities do One of the most important aspects
not yet have a standard system in software en- of today’s changing research landscape is
gineering to communicate with each other.

ANRV274-PP57-13 ARI 5 April 2006 19:12
the culture of data and expertise sharing. teractions and biological research archived in
The now famous Bermuda principle (http:// the literature will remain accessible in prin-
www.gene.ucl.ac.uk/hugo/bermuda.htm) ciple but underutilized in practice. One key
was extended to large-scale data at a recent area of text mining is relationship extraction
meeting (131). In this meeting, the policy that finds relationships between entities such
for publicly releasing large-scale data pre- as genes and proteins. Examples include Med-
publication and appropriate conduct and Miner at the National Library of Medicine
acknowledgment of the uses of these data (128), PreBIND (35), the curated BIND sys-
by the scientific community were discussed. tem (2, 6), PathBinderH (155), and iHOP
Clearly articulated and community-accepted (63). (See Reference 26 for a complete sur-
policies are needed on how data from data vey of text mining applications.) Results on
repositories should be cited and referenced real-world tasks such as the automatic extrac-
and how the generators of the data should tion and assignment of GO annotations are
be acknowledged. Establishing this standard promising, but they are still far from reaching
should include journal publishers, database the required performance demanded by real-
scientists, data generators, funding bodies, world applications (15). One key difficulty
and representatives of the user community. that needs to be addressed in this field is the
Additional challenges and opportunities in complex nature of the names and terminology
database curation were recently articulated such as the large range of variants for protein
(82, 103). names and GO terms in free text. The current
generation of systems is beginning to combine
statistical methods with machine learning to
EMERGING AREAS IN capture expert knowledge on how genes and
BIOINFORMATICS proteins are referred to in scientific papers to
In addition to some of the challenges and op- create usable systems with high precision and
portunities mentioned in this review, there are recall for specialized tasks in the near future.
many exciting areas of research in bioinfor-
matics that are emerging. In this section, we
focus on a few of these areas such as text min- Computational Systems Biology
ing, systems biology, and the semantic web. Classical systems analysis in engineering
Some additional emerging areas such as im- treats a system as a black box whose in-
age analysis (117), grid computing (46, 49), ner structure and behavior can be analyzed
directed evolution (29), rational protein de- and modeled by varying internal or exter-
sign (81), microRNA-related bioinformatics nal conditions, and studying the effect of
(21), and modeling in epigenomics (43) are the variation on the external observables.
not covered due to the limitation of space. The result is an understanding of the in-
ner makeup and working mechanisms of the
system (72). Systems biology is the applica-
Text Mining tion of this theory to biology. The observ-
The size of the biological literature is expand- ables are measurements of what the organism
ing at an increasing rate. The Medline 2004 is doing, ranging from phenotypic descrip-
database had 12.5 million entries and is ex- tions to detailed metabolic profiling. A crit-
panding at a rate of 500,000 new citations ical issue is how to effectively integrate var-
each year (26). The goal of text mining is to ious types of data, such as sequence, gene
allow researchers to identify needed informa- expression, protein interactions, and pheno-
tion and shift the burden of searching from types to infer biological knowledge. Some
researchers to the computer. Without auto- areas that require more work include creat-
mated text mining, much of biomolecular in- ing coherent validated data sets, developing

ANRV274-PP57-13 ARI 5 April 2006 19:12
common formats for pathway data [SBML implementation of applications using the
(65) and BioPAX (http://www.biopax.org)], semantic web is scarce at this point, there
and creating ontologies to define complex in- are some useful examples being developed
teractions, curation, and linkages with text- such as Haystack (a browser that retrieves
mining tools. The Systems Biology Work- data from multiple databases and allows users
bench project (http://sbw.kgi.edu/) aims to to annotate and manage the information to
develop an open-source software framework reflect their understanding) (http://www-db.
for sharing information between different cs.wisc.edu/cidr/cidr2005/papers/P02.pdf )
types of pathway models. Other issues are and BioDash (a drug development user inter-
that biological systems are underdefined (not face that associates diseases, drug progression
enough measurements are available to charac- stages, molecular biology, and pathway
terize the system) and samples are not taken knowledge for users) (http://www.w3.org/
often enough to capture time changes in a sys- 2005/04/swls/BioDash/Demo/).
tem that may occur at vastly different time

scales in different networks such as signaling
and regulatory networks (98). The long-term Cellular Localization and Spatially
goal to create a complete in silico model of Resolved Data
a cell is still distant; however the tools that Research in nanotechnology and electron mi-
are being developed to integrate information croscopy is allowing researchers to select spe-
from a wide variety of sources will be valuable cific areas of cells and tissues and to image
in the short term. spatiotemporal distributions of signaling re-
ceptors, gene expression, and proteins. Laser
capture microdissection allows the selection
Semantic Web of specific tissue types for detailed analysis
Semantic web is a model to “create a univer- (42). This technique has been applied to spe-
sal mechanism for information exchange by cific plant tissues in maize and Arabidopsis
giving meaning, in a machine-interpretable (73, 94). Confocal imaging is being used to
way, to the content of documents and data model auxin transport and gene expression
on the Web” (95). This model will enable the patterns in Arabidopsis (60). Methods in elec-
development of searching tools that know tron microscopy are being applied to image
what type of information can be obtained the spatiotemporal distribution of signaling
from which documents and understand how receptors (149). Improved methods in laser
the information in each document relates to scanning microscopes may allow measure-
another, which will allow software agents that ments of fast diffusion and dynamic processes
can use reasoning and logic to make deci- in the microsecond-to-millisecond time range
sions automatically based on the constraints in live cells (34). These emerging capabili-
provided in the query (e.g., automatic travel ties will lead to new understanding of cell
agents, phenotype prediction) (12). Bioin- dynamics.
formatics could benefit enormously from
successful implementation of this model and
should play a leading role in realizing it (95). CONCLUSION
Current efforts to realize the concepts of the In this review, we attempt to highlight some of
semantic web have been focused on develop- the recent advances made in bioinformatics in
ing standards and specifications of identifying the basic areas of sequence, gene expression,
and describing data such as Universal protein, and metabolite analyses, databases,
Resource Identifier (URI) and Resource and ontologies, current limitations in these
Definition Framework (RDF), respectively areas, and some emerging areas. A number
(http://www.w3c.org/2001/sw). Although of unsolved problems exist in bioinformatics

ANRV274-PP57-13 ARI 5 April 2006 19:12
today, including data and database integra- parative, connected, holistic views and ap-
tion, automated knowledge extraction, robust proaches in plant biology. We will also see
inference of phenotype from genotype, and more integration of plant research and other
training and retraining of students and estab- biological research, from microbes to human,
lished researchers in bioinformatics. Bioinfor- from a large-scale comparative genomics per-
matics is an approach that will be an essen- spective. Bioinformatics will provide the glue
tial part of plant research and we hope that with which all of these types of integration
every plant researcher will incorporate more will occur. However, it will be people, not
bioinformatics tools and approaches in their tools, who will enable the gluing. Ways in
research projects. which biological research will be conducted
If the next 50 years of plant biology can be in 2050 will be much different from the way
summed into one word, it would be “integra- in which it was done in 2000. Each researcher
tion.” We will see integration of basic research will spend more time on the computer and the
with applied research in which plant biotech- Internet to generate and describe data and ex-
nology will play an essential role in solving ur- periments, to analyze the data and find other
gent problems in our society such as develop- people’s data relevant for comparison, to find
ing renewable energy, reducing world hunger existing knowledge in the field and to relate it
and poverty, and preserving the environment. to his or her results into the current body of
We will see integration of disparate, special- knowledge, and to publish his or her results
ized areas of plant research into more com- to the world.
ACKNOWLEDGMENTS
We are grateful to Blake Meyers, Dan MacLean, Shijun Li, Scott Peck, Mark Lange, Bill
Beavis, Todd Vision, Stefanie Hartmann, Gary Stacey, Chris Town, Volker Brendel, and Nevin
Young for their critical comments on the manuscript. This work has been supported in part
by NSF grants DBI-99,78564, DBI-04,17062, DBI-03,21666 (SYR); ITR-IIS-04,07204 (DX);
DBI-02,09809 ( JD); USDA grants NRI-2002-35,300-12,619 ( JD) and CSREES 2004-25,604-
14,708 (DX); NIH grants NHGRI-HG002273, R01-GM65466 (SYR); National Center for
Soybean Biotechnology (DX); Pioneer-Hi-Bred (SYR); and Carnegie Canada (SYR).
LITERATURE CITED
1. Aebersold R, Mann M. 2003. Mass spectrometry-based proteomics. Nature 422:198–207
2. Alfarano C, Andrade CE, Anthony K, Bahroos N, Bajec M, et al. 2005. The Biomolecular
Interaction Network Database and related tools 2005 update. Nucleic Acids Res. 33:D418–
24
3. Allen JE, Pertea M, Salzberg SL. 2004. Computational gene prediction using multiple
sources of evidence. Genome Res. 14:142–48
4. Aris-Brosou S. 2005. Determinants of adaptive evolution at the molecular level: the
extended complexity hypothesis. Mol. Biol. Evol. 22:200–9
5. Ashburner M, Ball C, Blake J, Botstein D, Butler H, et al. 2000. Gene ontology: tool for
the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25:25–29
6. Bader G, Betel D, Hogue C. 2002. BIND: the Biomolecular Interaction Network
Database. Nucleic Acids Res. 31:248–50
7. Bammler T, Beyer RP, Bhattacharya S, Boorman GA, Boyles A, et al. 2005. Standardizing
global gene expression analysis between laboratories and across platforms. Nat. Methods
2:351–56

ANRV274-PP57-13 ARI 5 April 2006 19:12
8. Bard J. 2003. Ontologies: formalising biological knowledge for bioinformatics. Bioessays

25:501–6
9. Bard J, Rhee SY, Ashburner M. 2005. An ontology for cell types. Genome Biol. 6:R21
10. Bard JB, Rhee SY. 2004. Ontologies in biology: design, applications and future challenges.
Nat. Rev. Genet. 5:213–22
11. Bedell JA, Budiman MA, Nunberg A, Citek RW, Robbins D, et al. 2005. Sorghum
genome sequencing by methylation filtration. PLoS Biol. 3:e13
12. Berners-Lee T, Hendler J, Lassila O. 2001. The Semantic Web. Sci. Am. 284:34–43
13. Bino R, Hall R, Fiehn O, Kopka J, Saito K, et al. 2004. Potential of metabolomics as a
functional genomics tool. Trends Plant Sci. 9:418–25
14. Blake J. 2004. Bio-ontologies-fast and furious. Nat. Biotechnol. 22:773–74
15. Blaschke C, Krallinger M, Leon E, Valencia A. 2005. Evaluation of BioCreAtIvE assess-
ment of task 2. BMC Bioinformatics 6:S16
16. Blazej RG, Paegel BM, Mathies RA. 2003. Polymorphism ratio sequencing: a new
approach for single nucleotide polymorphism discovery and genotyping. Genome Res.
13:287–93
17. Blueggel M, Chamrad D, Meyer HE. 2004. Bioinformatics in proteomics. Curr. Pharm.
Biotechnol. 5:79–88
18. Boguski MS, Schuler GD. 1995. ESTablishing a human transcript map. Nat. Genet.
10:369–71
19. Brendel V, Zhu W. 2002. Computational modeling of gene structure in Arabidopsis
thaliana. Plant Mol. Biol. 48:49–58
20. Brenner S, Johnson M, Bridgham J, Golda G, Lloyd DH, et al. 2000. Gene expression
analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat.
Biotechnol. 18:630–34
21. Brown JR, Sanseau P. 2005. A computational view of microRNAs and their targets. Drug
Discov. Today 10:595–601
22. Brown P, Botstein D. 1999. Exploring the new world of the genome with DNA microar-
rays. Nat. Genet. 21:33–37
23. Buck MJ, Lieb JD. 2004. ChIP-chip: considerations for the design, analysis, and applica-
tion of genome-wide chromatin immunoprecipitation experiments. Genomics 83:349–60
24. Canovas FM, Dumas-Gaudot E, Recorbet G, Jorrin J, Mock HP, Rossignol M. 2004.
Plant proteome analysis. Proteomics 4:285–98
25. Chen T, Kao MY, Tepel M, Rush J, Church GM. 2001. A dynamic programming ap-
proach to de novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol.
8:325–37
26. Cohen AM, Hersh WR. 2005. A survey of current work in biomedical text mining. Brief
Bioinform. 6:57–71
27. Cope LM, Irizarry RA, Jaffee HA, Wu Z, Speed TP. 2004. A benchmark for Affymetrix
GeneChip expression measures. Bioinformatics 20:323–31
28. Coughlan SJ, Agrawal V, Meyers B. 2004. A comparison of global gene expression mea-
surement technologies in Arabidopsis thaliana. Comp. Funct. Genomics 5:245–52
29. Dalby PA. 2003. Optimising enzyme function by directed evolution. Curr. Opin. Struct.
Biol. 13:500–5
30. Dancik V, Addona TA, Clauser KR, Vath JE, Pevzner PA. 1999. De novo peptide se-
quencing via tandem mass spectrometry. J. Comput. Biol. 6:327–42
31. Densmore LD 3rd. 2001. Phylogenetic inference and parsimony analysis. Methods Mol.
Biol. 176:23–36

ANRV274-PP57-13 ARI 5 April 2006 19:12
32. Deshpande N, Addess KJ, Bluhm WF, Merino-Ott JC, Townsend-Merino W, et al.
2005. The RCSB Protein Data Bank: a redesigned query system and relational database
based on the mmCIF schema. Nucleic Acids Res. 33:D233–37
33. Di X, Matsuzaki H, Webster TA, Hubbell E, Liu G, et al. 2005. Dynamic model based al-
gorithms for screening and genotyping over 100 K SNPs on oligonucleotide microarrays.
Bioinformatics 21:1958–63
34. Digman MA, Brown CM, Sengupta P, Wiseman PW, Horwitz AR, Gratton E. 2005.
Measuring fast dynamics in solutions and cells with a laser scanning microscope. Biophys.
J. 89:1317–27
35. Donaldson I, Martin J, de Bruijn B, Wolting C, Lay V, et al. 2003. PreBIND and
Textomy—mining the biomedical literature for protein-protein interactions using a sup-
port vector machine. BMC Bioinformatics 4:11
36. Doolittle WF. 1999. Phylogenetic classification and the universal tree. Science 284:2124–
29
37. Draghici S. 2003. Data Analysis Tools for DNA Microarrays. London: Chapman and Hall
38. Edgar R, Domrachev M, Lash AE. 2002. Gene Expression Omnibus: NCBI gene expres-
sion and hybridization array data repository. Nucleic Acids. Res. 30:207–10
39. Edwards JS, Palsson BO. 2000. The Escherichia coli MG1655 in silico metabolic geno-
type: its definition, characteristics, and capabilities. Proc. Natl. Acad. Sci. USA 97:5528–33
40. Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, et al. 2005. The Sequence Ontol-
ogy: a tool for the unification of genome annotations. Genome Biol. 6:R44
41. Eisen MB, Spellman PT, Brown PO, Botstein D. 1998. Cluster analysis and display of
genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95:14863–68
42. Emmert-Buck MR, Bonner RF, Smith PD, Chuaqui RF, Zhuang Z, et al. 1996. Laser
capture microdissection. Science 274:998–1001
43. Fazzari MJ, Greally JM. 2004. Epigenomics: beyond CpG islands. Nat. Rev. Genet. 5:446–
55
44. Fiehn O. 2002. Metabolomics—the link between genotypes and phenotypes. Plant Mol.
Biol. 48:155–71
45. Foissac S, Bardou P, Moisan A, Cros MJ, Schiex T. 2003. EUGENE’HOM: a generic
similarity-based gene finder using multiple homologous sequences. Nucleic Acids Res.
31:3742–45
46. Foster I. 2002. What is the Grid? A three point checklist. In GRIDToday, pp. 4. Chicago:
Argonne National Lab & University of Chicago
47. Fraser AG, Marcotte EM. 2004. A probabilistic view of gene function. Nat. Genet. 36:559–
64
48. Frazer KA, Chen X, Hinds DA, Pant PV, Patil N, Cox DR. 2003. Genomic DNA inser-
tions and deletions occur frequently between humans and nonhuman primates. Genome
Res. 13:341–46
49. Gannon D, Alameda J, Chipara O, Christie M, Duke V, et al. 2005. Building grid portal
applications from a Web service component architecture. Proc. IEEE 93:551–63
50. Garcia-Hernandez M, Berardini TZ, Chen G, Crist D, Doyle A, et al. 2002. TAIR: a
resource for integrated Arabidopsis data. Funct. Integr. Genomics 2:239–53
51. Gibbs RA, Weinstock GM. 2003. Evolving methods for the assembly of large genomes.
Cold Spring Harb. Symp. Quant. Biol. 68:189–94
52. Goff SA, Ricke D, Lan TH, Presting G, Wang R, et al. 2002. A draft sequence of the rice
genome (Oryza sativa L. ssp. japonica). Science 296:92–100

ANRV274-PP57-13 ARI 5 April 2006 19:12
53. Gonzales MD, Archuleta E, Farmer A, Gajendran K, Grant D, et al. 2005. The Legume
Information System (LIS): an integrated information resource for comparative legume
biology. Nucleic Acids Res. 33:D660–65
54. Gorg A, Obermaier C, Boguth G, Harder A, Scheibe B, et al. 2000. The current state of
two-dimensional electrophoresis with immobilized pH gradients. Electrophoresis 21:1037–
53
55. Gras R, Muller M. 2001. Computational aspects of protein identification by mass spec-
trometry. Curr. Opin. Mol. Ther. 3:526–32
56. Guo H, Moose SP. 2003. Conserved noncoding sequences among cultivated cereal
genomes identify candidate regulatory sequence elements and patterns of promoter evo-
lution. Plant Cell 15:1143–58
57. Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr, et al. 2003. Improving the
Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic
Acids Res. 31:5654–66

58. Harrigan GG, Goodacre R, eds. 2003. Metabolic Profiling: Its Role in Biomarker Discovery
and Gene Function Analysis. Boston: Plenum
59. Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, et al. 2004. The Gene Ontology
(GO) database and informatics resource. Nucleic Acids Res. 32:D258–61

60. Heisler MG, Ohno C, Das P, Sieber P, Reddy GV, et al. 2005. Patterns of auxin transport
and gene expression during primordium development revealed by live imaging of the
Arabidopsis inflorescence meristem. Curr. Biol. 15:1899–911
61. Jenkins H, Hardy N, Beckmann D, Draper J, Smith AR, et al. 2004. A proposed framework
for the description of plant metabolomics experiments and their results. Nat. Biotechnol.
22:1601–6
62. Hinds DA, Stuve LL, Nilsen GB, Halperin E, Eskin E, et al. 2005. Whole-genome
patterns of common DNA variation in three human populations. Science 307:1072–79
63. Hoffmann R, Valencia A. 2004. A gene network for navigating the literature. Nat. Genet.
36:664
64. Horan K, Lauricha J, Bailey-Serres J, Raikhel N, Girke T. 2005. Genome cluster database.
A sequence family analysis platform for Arabidopsis and rice. Plant Physiol. 138:47–54
65. Hucka M, Finney A, Bornstein BJ, Keating SM, Shapiro BE, et al. 2004. Evolving a
lingua franca and associated software infrastructure for computational systems biology:
the Systems Biology Markup Language (SBML) Project. Syst. Biol. 1:41–53
66. Hudek AK, Cheung J, Boright AP, Scherer SW. 2003. Genescript: DNA sequence an-
notation pipeline. Bioinformatics 19:1177–78
67. Inada DC, Bashir A, Lee C, Thomas BC, Ko C, et al. 2003. Conserved noncoding
sequences in the grasses. Genome Res. 13:2030–41
68. Jensen LJ, Knudsen S. 2000. Automatic discovery of regulatory patterns in promoter
regions based on whole cell expression data and functional annotation. Bioinformatics
16:326–33
69. Jiang N, Bao Z, Zhang X, Eddy SR, Wessler SR. 2004. Pack-MULE transposable ele-
ments mediate gene evolution in plants. Nature 431:569–73
70. Jiao Y, Jia P, Wang X, Su N, Yu S, et al. 2005. A tiling microarray expression analysis of
rice chromosome 4 suggests a chromosome-level regulation of transcription. Plant Cell
17:1641–57
71. Karlin S, Altschul SF. 1990. Methods for assessing the statistical significance of molecular
sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87:2264–68
72. Kell DB, Brown M, Davey HM, Dunn WB, Spasic I, Oliver SG. 2005. Metabolic foot-
printing and systems biology: the medium is the message. Nat. Rev. Microbiol. 3:557–65

ANRV274-PP57-13 ARI 5 April 2006 19:12
73. Kerk NM, Ceserani T, Tausta SL, Sussex IM, Nelson TM. 2003. Laser capture microdis-
section of cells from plant tissues. Plant Physiol. 132:27–35
74. Khatri P, Draghici S. 2005. Ontological analysis of gene expression data: current tools,
limitations, and open problems. Bioinformatics 21:3587–95
75. Klamt S, Stelling J, Ginkel M, Gilles ED. 2003. FluxAnalyzer: exploring structure, path-
ways, and flux distributions in metabolic networks on interactive flux maps. Bioinformatics
19:261–69
76. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC. 1993. De-
tecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science
262:208–14
77. Lawrence CJ, Seigfried TE, Brendel V. 2005. The maize genetics and genomics database.
The community resource for access to diverse maize data. Plant Physiol. 138:55–58
78. Lewin B. 2003. Genes VIII. Upper Saddle River, NJ: Prentice Hall
79. Li L, Wang X, Xia M, Stolc V, Su N, et al. 2005. Tiling microarray analysis of rice
chromosome 10 to identify the transcriptome and relate its expression to chromosomal
architecture. Genome Biol. 6:R52
80. Liu X, Noll DM, Lieb JD, Clarke ND. 2005. DIP-chip: rapid and accurate determination
of DNA-binding specificity. Genome Res. 15:421–27

81. Looger LL, Dwyer MA, Smith JJ, Hellinga HW. 2003. Computational design of receptor
and sensor proteins with novel functions. Nature 423:185–90
82. Lord P, Macdonald A. 2003. e-Science Curation Report–Data Curation for e-Science in the
UK: An Audit to Establish Requirements for Future Curation and Provision. Twickenham,
UK: Digital Archiving Consultancy Ltd.
83. Lord PW, Stevens RD, Brass A, Goble CA. 2003. Investigating semantic similarity mea-
sures across the Gene Ontology: the relationship between sequence and annotation.
Bioinformatics 19:1275–83
84. Mao X, Cai T, Olyarchuk JG, Wei L. 2005. Automated genome annotation and pathway
identification using the KEGG Orthology (KO) as a controlled vocabulary. Bioinformatics
21:3787–93
85. Marchal K, Thijs G, De Keersmaecker S, Monsieurs P, De Moor B, Vanderleyden J.
2003. Genome-specific higher-order background models to improve motif detection.
Trends Microbiol. 11:61–66
86. Marengo E, Robotti E, Antonucci F, Cecconi D, Campostrini N, Righetti PG. 2005.
Numerical approaches for quantitative analysis of two-dimensional maps: a review of
commercial software and home-made systems. Proteomics 5:654–66
87. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, et al. 2005. Genome sequencing
in microfabricated high-density picolitre reactors. Nature 437:376–80
88. Matthews DE, Carollo VL, Lazo GR, Anderson OD. 2003. GrainGenes, the genome
database for small-grain crops. Nucleic Acids Res. 31:183–86
89. Meyers BC, Galbraith DW, Nelson T, Agrawal V. 2004. Methods for transcriptional
profiling in plants. Be fruitful and replicate. Plant Physiol. 135:637–52
90. Mockler TC, Ecker JR. 2005. Applications of DNA tiling arrays for whole-genome anal-
ysis. Genomics 85:1–15
91. Moreau Y, Aerts S, Moor B, Strooper B, Dabrowski M. 2003. Comparison and meta-
analysis of microarray data: from the bench to the computer desk. Trends Genet. 19:570–77
92. Mueller LA, Solow TH, Taylor N, Skwarecki B, Buels R, et al. 2005. The SOL Ge-
nomics Network. A comparative resource for solanaceae biology and beyond. Plant Phys-
iol. 138:1310–17

ANRV274-PP57-13 ARI 5 April 2006 19:12
93. Myers EW. 1995. Toward simplifying and accurately formulating fragment assembly. J.
Comput. Biol. 2:275–90
94. Nakazono M, Qiu F, Borsuk LA, Schnable PS. 2003. Laser-capture microdissection, a
tool for the global analysis of gene expression in specific plant cell types: identification of
genes expressed differentially in epidermal cells or vascular tissues of maize. Plant Cell.
15:583–96
95. Neumann E. 2005. A life science Semantic Web: Are we there yet? Sci. STKE 283:pe22
96. Newton RP, Brenton AG, Smith CJ, Dudley E. 2004. Plant proteome analysis by
mass spectrometry: principles, problems, pitfalls and recent developments. Phytochem-
istry 65:1449–85
97. Noel JP, Austin MB, Bomati EK. 2005. Structure-function relationships in plant phenyl-
propanoid biosynthesis. Curr. Opin. Plant Biol. 8:249–53
98. Papin JA, Reed JL, Palsson BO. 2004. Hierarchical thinking in network biology: the
unbiased modularization of biochemical networks. Trends Biochem. Sci. 29:641–47

99. Pappin DJ, Hojrup P, Bleasby AJ. 1993. Rapid identification of proteins by peptide-mass
fingerprinting. Curr. Biol. 3:327–32
100. Parkinson H, Sarkans U, Shojatalab M, Abeygunawardena N, Contrino S, et al. 2005.
ArrayExpress—a public repository for microarray gene expression data at the EBI. Nucleic
Acids Res. 33:D553–55
101. Patil N, Berno AJ, Hinds DA, Barrett WA, Doshi JM, et al. 2001. Blocks of limited hap-
lotype diversity revealed by high-resolution scanning of human chromosome 21. Science
294:1719–23
102. Pop M, Phillippy A, Delcher AL, Salzberg SL. 2004. Comparative genome assembly.
Brief Bioinform. 5:237–48
103. Rhee SY. 2004. Carpe diem. Retooling the publish or perish model into the share and
survive model. Plant Physiol. 134:543–47
104. Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, et al. 2003. The Arabidopsis
Information Resource (TAIR): a model organism database providing a centralized, cu-
rated gateway to Arabidopsis biology, research materials and community. Nucleic Acids
Res. 31:224–28
105. Rhee SY, Crosby B. 2005. Biological databases for plant research. Plant Physiol. 138:1–3
106. Rhodes D, Yu J, Shanker K, Deshpande N, Varambally R, et al. 2004. Large-scale meta-
analysis of cancer microarray data identifies common transcriptional profiles of neoplastic
transformation and progression. Proc. Natl. Acad. Sci. USA 101:9309–14
107. Roberts C, Nelson B, Marton M, Stoughton R, Meyer M, et al. 2000. Signaling and
circuitry of multiple MAPK pathways revealed by a matrix of global gene expression
profiles. Science 287:873–80
108. Roth FP, Hughes JD, Estep PW, Church GM. 1998. Finding DNA regulatory motifs
within unaligned noncoding sequences clustered by whole-genome mRNA quantitation.
Nat. Biotechnol. 16:939–45
109. Saal LH, Troein C, Vallon-Christersson J, Gruvberger S, Borg Å, Peterson C. 2002.
BioArray Software Environment: a platform for comprehensive management and analysis
of microarray data. Genome Biol. 3:software0003.1–.6
110. Schauer N, Steinhauser D, Strelkov S, Schomburg D, Allison G, et al. 2005. GC-MS
libraries for the rapid identification of metabolites in complex biological samples. FEBS
Lett. 579:1332–37
111. Schena M, Shalon D, Davis RW, Brown PO. 1995. Quantitative monitoring of gene
expression patterns with a complementary DNA microarray. Science 270:467–70

ANRV274-PP57-13 ARI 5 April 2006 19:12
112. Schlueter SD, Dong Q, Brendel V. 2003. GeneSeqer@PlantGDB: gene structure pre-
diction in plant genomes. Nucleic Acids Res. 31:3597–600
113. Schneider M, Bairoch A, Wu CH, Apweiler R. 2005. Plant protein annotation in the
UniProt Knowledgebase. Plant Physiol. 138:59–66
114. Seo TS, Bai X, Kim DH, Meng Q, Shi S, et al. 2005. Four-color DNA sequencing by
synthesis on a chip using photocleavable fluorescent nucleotides. Proc. Natl. Acad. Sci.
USA 102:5926–31
115. Shanks JV. 2005. Phytochemical engineering: combining chemical reaction engineering
with plant science. AIChE J. 51:2–7
116. Shen L, Gong J, Caldo RA, Nettleton D, Cook D, et al. 2005. BarleyBase—an expression
profiling database for plant genomics. Nuceic Acids Res. 33:D614–18
117. Sinha U, Bui A, Taira R, Dionisio J, Morioka C, et al. 2002. A review of medical imaging
informatics. Ann. NY Acad. Sci. 980:168–97
118. Slonim DK. 2002. From patterns to pathways: gene expression data analysis comes of
age. Nat. Genet. 32:502–8
119. SMRS Working Group. 2005. Summary recommendations for standardization and re-
porting of metabolic analyses. Nat. Biotechnol. 23:833–38
120. Sriram G, Fulton DB, Iyer VV, Peterson JM, Zhou R, et al. 2004. Quantification of
compartmented metabolic fluxes in developing soybean embryos by employing biosyn-
thetically directed fractional 13 C labeling, two-dimensional [13 C, 1 H] nuclear magnetic
resonance, and comprehensive isotopomer balancing. Plant Physiol. 136:3043–57
121. Steuer R, Kurths J, Fiehn O, Weckwerth W. 2003. Interpreting correlations in
metabolomic networks. Biochem. Soc. Trans. 31:1476–78
122. Steuer R, Kurths J, Fiehn O, Weckwerth W. 2003. Observing and interpreting correla-
tions in metabolomic networks. Bioinformatics 19:1019–26
123. Stevens J, Doerge R. 2005. Combining Affymetrix microarray results. BMC Bioinformatics
6:57
124. Stevens R, Goble CA, Bechhofer S. 2000. Ontology-based knowledge representation for
bioinformatics. Brief Bioinform. 1:398–414
125. Stevens RD, Robinson AJ, Goble CA. 2003. myGrid: personalised bioinformatics on the
information grid. Bioinformatics 19(Suppl.)1:i302–4
126. Stoeckert CJ Jr, Causton HC, Ball CA. 2002. Microarray databases: standards and on-
tologies. Nat. Genet. 32(Suppl.):469–73
127. Stolc V, Samanta MP, Tongprasit W, Sethi H, Liang S, et al. 2005. Identification of
transcribed sequences in Arabidopsis thaliana by using high-resolution genome tiling
arrays. Proc. Natl. Acad. Sci. USA 102:4453–58
128. Tanabe L, Scherf U, Smith LH, Lee JK, Hunter L, Weinstein JN. 1999. MedMiner: an
internet text-mining tool for biomedical information, with application to gene expression
profiling. BioTechniques 27:1210–17
129. Tchieu JH, Fana F, Fink JL, Harper J, Nair TM, et al. 2003. The PlantsP and PlantsT
Functional Genomics Databases. Nucleic Acids Res. 31:342–44
130. The Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flow-
ering plant Arabidopsis thaliana. Nature 408:796–815
131. The Wellcome Trust. 2003. Sharing Data from Large-Scale Biological Research Projects: A
System of Tripartite Responsibility. Fort Lauderdale, FL: Wellcome Trust
132. Toyoda T, Shinozaki K. 2005. Tiling array-driven elucidation of transcriptional structures
based on maximum-likelihood and Markov models. Plant J. 43:611–21
133. Trethewey R. 2004. Metabolite profiling as an aid to metabolic engineering in plants.
Curr. Opin. Plant Biol. 7:196–201

ANRV274-PP57-13 ARI 5 April 2006 19:12
134. van Helden J. 2003. Regulatory sequence analysis tools. Nucleic Acids Res. 31:3593–96
135. Vincent PL, Coe EH, Polacco ML. 2003. Zea mays ontology—a database of international
terms. Trends Plant Sci. 8:517–20
136. Wan X, Xu D. 2005. Computational methods for remote homolog identification. Curr.
Protein Peptide Sci. 6:527–46
137. Ware DH, Jaiswal P, Ni J, Yap IV, Pan X, et al. 2002. Gramene, a tool for grass genomics.
Plant Physiol. 130:1606–13
138. Weckwerth W, Loureiro M, Wenzel K, Fiehn O. 2004. Differential metabolic networks
unravel the effects of silent plant phenotypes. Proc. Natl. Acad. Sci. USA 101:7809–14
139. Wheeler DL, Smith-White B, Chetvernin V, Resenchuk S, Dombrowski SM, et al.
2005. Plant genome resources at the national center for biotechnology information. Plant
Physiol. 138:1280–88
140. Wiechert W, Mollney M, Petersen S, de Graaf AA. 2001. A universal framework for 13C
metabolic flux analysis. Metab. Eng. 3:265–83

141. Wilkinson M, Schoof H, Ernst R, Haase D. 2005. BioMOBY successfully integrates
distributed heterogeneous bioinformatics Web Services. The PlaNet exemplar case. Plant
Physiol. 138:5–17
142. Woo Y, Affourtit J, Daigle S, Viale A, Johnson K, et al. 2004. A comparison of cDNA,
oligonucleotide, and affymetrix GeneChip gene expression microarray platforms. J.
Biomol. Tech. 15:276–84
143. Yamada K, Lim J, Dale JM, Chen H, Shinn P, et al. 2003. Empirical analysis of transcrip-
tional activity in the Arabidopsis genome. Science 302:842–46
144. Yamazaki Y, Jaiswal P. 2005. Biological ontologies in rice databases. An introduction to
the activities in Gramene and Oryzabase. Plant Cell Physiol. 46:63–68
145. Yates JR 3rd, Eng JK, McCormack AL, Schieltz D. 1995. Method to correlate tandem
mass spectra of modified peptides to amino acid sequences in the protein database. Anal.
Chem. 67:1426–36
146. Yona G, Levitt M. 2002. Within the twilight zone: a sensitive profile-profile comparison
tool based on information theory. J. Mol. Biol. 315:1257–75
147. Yu J, Hu S, Wang J, Wong GK, Li S, et al. 2002. A draft sequence of the rice genome
(Oryza sativa L. ssp. indica). Science 296:79–92
148. Yuan Q, Ouyang S, Wang A, Zhu W, Maiti R, et al. 2005. The institute for genomic
research Osa1 rice genome annotation database. Plant Physiol. 138:18–26
149. Zhang J, Leiderman K, Pfeiffer JR, Wilson BS, Oliver JM, Steinberg SL. 2006. Char-
acterizing the topography of membrane receptors and signaling molecules from spatial
patterns obtained using nanometer-scale electron-dense probes and electron microscopy.
Micron 37:14–34
150. Zhang MQ. 2002. Computational prediction of eukaryotic protein-coding genes. Nat.
Rev. Genet. 3:698–709
151. Zhang P, Foerster H, Tissier CP, Mueller L, Paley S, et al. 2005. MetaCyc and AraCyc.
Metabolic pathway databases for plant research. Plant Physiol. 138:27–37
152. Zhu H, Bilgin M, Snyder M. 2003. Proteomics. Annu. Rev. Biochem. 72:783–812
153. Zhu T, Wang X. 2000. Large-scale profiling of the Arabidopsis transcriptome. Plant
Physiol. 124:1472–76
154. Zhu W, Schlueter SD, Brendel V. 2003. Refined annotation of the Arabidopsis genome
by complete expressed sequence tag mapping. Plant Physiol. 132:469–84
155. Ding J, Viswanathan K, Berleant D, Hughes L, Wurtele ES, et al. 2005. Using the biologi-
cal taxonomy to access biological literature with PathBinderH. Bioinformatics 21:2560–62

ANRV274-PP57-13 ARI 5 April 2006 19:12
DISCLOSURE STATEMENT
J.D. is a PI of the PLEXdb database that focuses on using Affymetrix GeneChips for cross-
species comparison.

Contents ARI 5 April 2006 18:47
Annual Review
Contents of Plant Biology
Volume 57, 2006
Looking at Life: From Binoculars to the Electron Microscope

Sarah P. Gibbs p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 1
MicroRNAs and Their Regulatory Roles in Plants

Matthew W. Jones-Rhoades, David P. Bartel, and Bonnie Bartel p p p p p p p p p p p p p p p p p p p p p p p p p p19
Chlorophyll Degradation During Senescence
S. Hörtensteiner p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p55
Quantitative Fluorescence Microscopy: From Art to Science

Mark Fricker, John Runions, and Ian Moore p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p79
Control of the Actin Cytoskeleton in Plant Cell Growth
Patrick J. Hussey, Tijs Ketelaar, and Michael J. Deeks p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 109
Responding to Color: The Regulation of Complementary Chromatic
Adaptation
David M. Kehoe and Andrian Gutu p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 127
Seasonal Control of Tuberization in Potato: Conserved Elements with
the Flowering Response
Mariana Rodríguez-Falcón, Jordi Bou, and Salomé Prat p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 151
Laser Microdissection of Plant Tissue: What You See Is What You Get
Timothy Nelson, S. Lori Tausta, Neeru Gandotra, and Tie Liu p p p p p p p p p p p p p p p p p p p p p p p p p p 181
Integrative Plant Biology: Role of Phloem Long-Distance
Macromolecular Trafficking
Tony J. Lough and William J. Lucas p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 203
The Role of Root Exudates in Rhizosphere Interactions with Plants
and Other Organisms
Harsh P. Bais, Tiffany L. Weir, Laura G. Perry, Simon Gilroy,
and Jorge M. Vivanco p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 233
Genetics of Meiotic Prophase I in Plants
Olivier Hamant, Hong Ma, and W. Zacheus Cande p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 267
Biology and Biochemistry of Glucosinolates
Barbara Ann Halkier and Jonathan Gershenzon p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 303
v
Bioinformatics and Its Applications in Plant Biology

Seung Yon Rhee, Julie Dickerson, and Dong Xu p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 335
Leaf Hydraulics
Lawren Sack and N. Michele Holbrook p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 361
Plant Uncoupling Mitochondrial Proteins
Anı́bal Eugênio Vercesi, Jiri Borecký, Ivan de Godoy Maia, Paulo Arruda,
Iolanda Midea Cuccovia, and Hernan Chaimovich p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 383
Genetics and Biochemistry of Seed Flavonoids
Loı̈c Lepiniec, Isabelle Debeaujon, Jean-Marc Routaboul, Antoine Baudry,
Lucille Pourcel, Nathalie Nesi, and Michel Caboche p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 405
Cytokinins: Activity, Biosynthesis, and Translocation

Hitoshi Sakakibara p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 431
Global Studies of Cell Type-Specific Gene Expression in Plants
David W. Galbraith and Kenneth Birnbaum p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 451
Mechanism of Leaf-Shape Determination

Hirokazu Tsukaya p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 477
Mosses as Model Systems for the Study of Metabolism and
Development
David Cove, Magdalena Bezanilla, Phillip Harries, and Ralph Quatrano p p p p p p p p p p p p p p 497
Structure and Function of Photosystems I and II
Nathan Nelson and Charles F. Yocum p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 521
Glycosyltransferases of Lipophilic Small Molecules
Dianna Bowles, Eng-Kiat Lim, Brigitte Poppenberger, and Fabián E. Vaistij p p p p p p p p p p p 567
Protein Degradation Machineries in Plastids
Wataru Sakamoto p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 599
Molybdenum Cofactor Biosynthesis and Molybdenum Enzymes
Günter Schwarz and Ralf R. Mendel p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 623
Peptide Hormones in Plants
Yoshikatsu Matsubayashi and Youji Sakagami p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 649
Sugar Sensing and Signaling in Plants: Conserved and Novel
Mechanisms
Filip Rolland, Elena Baena-Gonzalez, and Jen Sheen p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 675
Vitamin Synthesis in Plants: Tocopherols and Carotenoids
Dean DellaPenna and Barry J. Pogson p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 711
Plastid-to-Nucleus Retrograde Signaling
Ajit Nott, Hou-Sung Jung, Shai Koussevitzky, and Joanne Chory p p p p p p p p p p p p p p p p p p p p p p 739
vi Contents
The Genetics and Biochemistry of Floral Pigments

Erich Grotewold p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 761
Transcriptional Regulatory Networks in Cellular Responses and
Tolerance to Dehydration and Cold Stresses
Kazuko Yamaguchi-Shinozaki and Kazuo Shinozaki p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 781
Pyrimidine and Purine Biosynthesis and Degradation in Plants
Rita Zrenner, Mark Stitt, Uwe Sonnewald, and Ralf Boldt p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 805
Phytochrome Structure and Signaling Mechanisms
Nathan C. Rockwell, Yi-Shin Su, and J. Clark Lagarias p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 837
Microtubule Dynamics and Organization in the Plant Cortical Array
David W. Ehrhardt and Sidney L. Shaw p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 859
INDEXES
Subject Index p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 877

Cumulative Index of Contributing Authors, Volumes 47–57 p p p p p p p p p p p p p p p p p p p p p p p p p p p 915
Cumulative Index of Chapter Titles, Volumes 47–57 p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 920
ERRATA
An online log of corrections to Annual Review of Plant Biology chapters (if any, 1977 to
the present) may be found at http://plant.annualreviews.org/
Contents vii

Bioinformatics and Its Applications in Plant Biology: Seung Yon Rhee, Julie Dickerson, and Dong Xu

Uploaded by

Copyright:

Available Formats

Bioinformatics and Its Applications in Plant Biology: Seung Yon Rhee, Julie Dickerson, and Dong Xu

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bioinformatics and Its Applications in Plant Biology: Seung Yon Rhee, Julie Dickerson, and Dong Xu

Uploaded by

Copyright:

Available Formats

ANRV274-PP57-13 ARI 5 April 2006 19:12

Bioinformatics and Its

Seung Yon Rhee,1 Julie Dickerson,2

Annu. Rev. Plant Biol. Key Words

the human brain to process and thus there

Mass Spectrometry . . . . . . . . . . . . 342

336 Rhee · Dickerson · Xu

(http://genome.jgi-psf.org/Poptr1/) and tation, and analysis of the data in conjunction

www.annualreviews.org • Bioinformatics and its Applications 337

information. Several software packages have patterns. Inevitably, a combination of re-

An important aspect of genome annota- UniGene (18). Also, many computational

338 Rhee · Dickerson · Xu

remote homologs (146). However, due to SCOP (http://scop.mrc-lmb.cam.ac.uk/

correlate the sequence comparison results (http://us.expasy.org/enzyme/) is a typical

www.annualreviews.org • Bioinformatics and its Applications 339

TRANSCRIPTOME ANALYSIS creased numbers of genes per array. Unfortu-

Whole-genome tiled arrays are used to de-

ChIP-chip [chromatin immunoprecipitation

340 Rhee · Dickerson · Xu

www.annualreviews.org • Bioinformatics and its Applications 341

length of the oligonucleotide varies from 4 of protein-protein interactions, protein ac-

342 Rhee · Dickerson · Xu

able. There are two types of MS-based protein (http://www.matrixscience.com/). Both

spectra. SEQUEST, one of the earliest tools

www.annualreviews.org • Bioinformatics and its Applications 343

METABOLOMICS AND nologies, and methods used in a metabolite

arated, and analyzed in a high-throughput

logical relevance (58, 115). Technology has

344 Rhee · Dickerson · Xu

to be described in explicit and unambiguous A few challenges in the development and

www.annualreviews.org • Bioinformatics and its Applications 345

preting results of large-scale probing of

exp). In addition, GO annotations have visualizing, editing, and analyzing ontologies

346 Rhee · Dickerson · Xu

provide more semantically powerful search prominent example of community-speciﬁc

www.annualreviews.org • Bioinformatics and its Applications 347

Data Representation and Storage structured query language (SQL) (http://

348 Rhee · Dickerson · Xu

out in data repositories. A number of differ-

quality involves both determining the crite-

www.annualreviews.org • Bioinformatics and its Applications 349

350 Rhee · Dickerson · Xu

tem that may occur at vastly different time

www.annualreviews.org • Bioinformatics and its Applications 351

352 Rhee · Dickerson · Xu

8. Bard J. 2003. Ontologies: formalising biological knowledge for bioinformatics. Bioessays

www.annualreviews.org • Bioinformatics and its Applications 353

354 Rhee · Dickerson · Xu

Acids Res. 31:5654–66

(GO) database and informatics resource. Nucleic Acids Res. 32:D258–61

www.annualreviews.org • Bioinformatics and its Applications 355

of DNA-binding speciﬁcity. Genome Res. 15:421–27

356 Rhee · Dickerson · Xu

unbiased modularization of biochemical networks. Trends Biochem. Sci. 29:641–47

www.annualreviews.org • Bioinformatics and its Applications 357

358 Rhee · Dickerson · Xu

metabolic ﬂux analysis. Metab. Eng. 3:265–83

www.annualreviews.org • Bioinformatics and its Applications 359

360 Rhee · Dickerson · Xu