2000-Gene Expression Data Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

FEBS 23893 FEBS Letters 480 (2000) 17^24

Minireview
Gene expression data analysis
Alvis Brazma*, Jaak Vilo
European Molecular Biology Laboratory, Outstation Hinxton ^ the European Bioinformatics Institute, Cambridge CB10 1SD, UK

Received 5 June 2000

Edited by Gianni Cesareni

ultimate product of a gene, transcription is the ¢rst step in


Abstract Microarrays are one of the latest breakthroughs in
experimental molecular biology, which allow monitoring of gene gene regulation, and information about the transcript levels is
expression for tens of thousands of genes in parallel and are needed for understanding gene regulatory networks. More-
already producing huge amounts of valuable data. Analysis and over, the measurement of mRNA levels currently is consider-
handling of such data is becoming one of the major bottlenecks in ably cheaper and can be done in a more high-throughput way
the utilization of the technology. The raw microarray data are than direct measurements of the protein levels. The correla-
images, which have to be transformed into gene expression tion between the mRNA and protein abundance in the cell
matrices ^ tables where rows represent genes, columns represent may not be straightforward, nevertheless the absence of
various samples such as tissues or experimental conditions, and mRNA in a cell is likely to imply a not very high level of
numbers in each cell characterize the expression level of the the respective protein and thus at least qualitative estimates
particular gene in the particular sample. These matrices have to
about the proteome can be based on the transcriptome infor-
be analyzed further, if any knowledge about the underlying
biological processes is to be extracted. In this paper we mation. The mRNA and protein level correlation studies are
concentrate on discussing bioinformatics methods used for such under way (see [1]).
analysis. We briefly discuss supervised and unsupervised data The ability to monitor gene expression at the transcript
analysis and its applications, such as predicting gene function level has become possible due to the advent of DNA micro-
classes and cancer classification. Then we discuss how the gene array technologies (see [2]). A microarray is a glass slide, onto
expression matrix can be used to predict putative regulatory which single-stranded DNA molecules are attached at ¢xed
signals in the genome sequences. In conclusion we discuss some locations (spots). There may be tens of thousands of spots
possible future directions. ß 2000 Federation of European Bio- on an array, each related to a single gene. Microarrays exploit
chemical Societies. Published by Elsevier Science B.V. All the preferential binding of complementary single-stranded nu-
rights reserved.
cleic acid sequences. There are several variations of microar-
Key words: Microarray; Gene expression; Promoter ray technologies each used in a speci¢c way.
sequence; Pattern discovery; Clustering One of the most popular experimental platforms is used for
comparing mRNA abundance in two di¡erent samples (or a
sample and a control). RNA from the sample and control
cells are extracted and labeled with two di¡erent £uorescent
1. Introduction labels, e.g. a red dye for the RNA from the sample population
and a green dye for that from the control population. Both
With several eukaryotic genomes completed and the draft extracts are washed over the microarray. Gene sequences from
human genome published, we are now entering the postge- the extracts hybridize to their complementary sequences in the
nomic age. The main focus in genomic research is switching spots.
from sequencing to using the genome sequences in order to To measure the relative abundance of the hybridized RNA
understand how genomes are functioning. Some questions we the array is excited by a laser. If the RNA from the sample
would like to ask are population is in abundance, the spot will be red, if the RNA
from the control population is in abundance, it will be green.
b what are the functional roles of di¡erent genes and in what If sample and control bind equally, the spot will be yellow,
cellular processes do they participate; while if neither binds, it will not £uoresce and appear black.
b how are genes regulated, how do genes and gene products Thus, from the £uorescence intensities and colors for each
interact, what are these interaction networks ; spot, the relative expression levels of the genes in the sample
b how does gene expression level di¡er in various cell types and control populations can be estimated.
and states, how is gene expression changed by various dis- By measuring transcription levels of genes in an organism
eases or compound treatments. under various conditions, at di¡erent developmental stages
and in di¡erent tissues, we can build up `gene expression
Knowing the gene transcript abundance in various tissues, pro¢les' which characterize the dynamic functioning of each
developmental stages and under various conditions is impor- gene in the genome. We can imagine the expression data rep-
tant for attacking these questions. Although mRNA is not the resented in a matrix with rows representing genes, columns
representing samples (e.g. various tissues, developmental
stages and treatments), and each cell containing a number
*Corresponding author. Fax: +44 1223 494468. characterizing the expression level of the particular gene in
E-mail: [email protected]; [email protected] the particular sample. We will call such a table a gene expres-

0014-5793 / 00 / $20.00 ß 2000 Federation of European Biochemical Societies. Published by Elsevier Science B.V. All rights reserved.
PII: S 0 0 1 4 - 5 7 9 3 ( 0 0 ) 0 1 7 7 2 - 5
18 A. Brazma, J. Vilo/FEBS Letters 480 (2000) 17^24

sion matrix. Building up a database of such matrices will help (so-called channel). Therefore the raw data produced by mi-
us to understand gene regulation, metabolic and signaling croarrays are in fact monochrome images (Fig. 1). Transform-
pathways, the genetic mechanisms of disease, and the response ing these images into the gene expression matrix is a non-
to drug treatments. For instance, if overexpression of certain trivial process: the spots corresponding to genes on the micro-
genes is correlated with a certain cancer, we can explore which array should be identi¢ed, their boundaries determined, the
other conditions a¡ect the expression of these genes and £uorescence intensity from each spot measured and compared
which other genes have similar expression pro¢les. We can to the background intensity and to these intensities for other
also investigate which compounds (potential drugs) lower channels. The software for this initial image processing is
the expression level of these genes. often provided with the image scanner, since it will depend
on particular properties of the hardware. Often laborious
2. From raw data to gene expression matrix manual adjustment of the grid for spots is used. We will not
discuss the raw data processing in detail in this paper, some
Like many experimental technologies, microarrays measure survey of image analysis software can be found on http://
the target quantity (i.e. relative or absolute mRNA abun- cmpteam4.unil.ch/biocomputing/array/software/MicroArray_
dance) indirectly by measuring another physical quantity ^ Software.html.
the intensity of the £uorescence of the spots on the array In any physical experiment it is important to know not only
for each £uorescent dye, i.e. for each optical wavelength the value of the measurement, but also the standard error or

Fig. 1. A sample image from scanning a hybridized rat microarray containing over 5000 genes. Each spot features a pool of identical single-
stranded DNA molecules representing a single gene. The brightness of the spot is proportional to the amount of £uorescent mRNA hybridized
to the DNA of the spot. Automated image analysis software should identify these £uorescence spots, determine their boundaries, and the £uo-
rescence intensity from each spot should be measured and compared to the background £uorescence. Moreover, the image should be compared
to a similar image obtained from the control measurements and the ratio of background-subtracted intensities calculated. In this way images
are transformed into a gene expression matrix, which can be analyzed further by numerical methods. The image was kindly provided by Tom
Freeman (Sanger Centre, Cambridge, UK).
A. Brazma, J. Vilo/FEBS Letters 480 (2000) 17^24 19

some other indicator of reliability for each data point. For columns, we can look either for similarities or for di¡erences.
most microarray technology platforms only the ratio of the If we ¢nd that two rows are similar, we can hypothesize that
background-subtracted signals of the given sample and the the respective genes are co-regulated and possibly functionally
control is meaningful. If the spot intensity is low, the ratio related. By comparing samples, we can ¢nd which genes are
of these numbers may be high, but the measurement may not di¡erentially expressed and, for instance, study e¡ects of var-
be reliable. The spot quality can be assessed not only by the ious compounds.
absolute intensity in each channel, but also by many other Before we can perform any comparisons, we need a way to
factors, such as uniformity of the individual pixel intensities, measure the similarity (or distance) between the objects we are
or the shape of the spot. Unfortunately there is currently no comparing. We can regard these objects (rows or columns in
standard way of assessing the spot measurement reliability. If the matrix) as points in n-dimensional space or as n-dimen-
experiments have been done in replicates, they can be used to sional vectors, where n is the number of samples for gene
assess the standard errors in addition to the single measure- comparison, or number of genes for sample comparison.
ment quality assessments. Little has been published yet on The natural, so-called Euclidean distance (for de¢nition see
how to use the reliability of gene expression measurements [4]) between these points in the n-dimensional space may be
by combining the information about the spot image in each the most obvious, but not necessarily the best choice. It is
channel and the replicate images. intuitively appealing to use the correlation coe¤cient calcu-
Another di¤culty in creating a gene expression matrix lated by treating the two n-dimensional vectors as series of
comes from the necessity to identify each spot with the re- random variables. In fact this distance is related to the angle
spective gene. This is not always possible, since spots are between the two n-dimensional vectors. Euclidean and corre-
typically based on EST sequences, and linking the EST to lation distance measures are related, if we normalize the
the respective gene may be non-trivial. Typically it is done length of the n-dimensional vectors to 1. This makes it possi-
through EST clustering. Additionally, the same gene may be ble to use correlation distance even in the cases when Euclid-
represented by several spots on the array, either by exactly the ean properties are important. Some other distance measures,
same or by a di¡erent sequence. What expression level to including rank correlation coe¤cient and mutual information-
attribute to the gene, if measurements from these di¡erent based measure, are proposed in D'haesleer et al. [5]. Cur-
spots di¡er? rently, to the best of our knowledge, there is no theory how
Microarray-based gene expression measurements are still to choose the best distance measure. Possibly one `right' dis-
far from giving estimates of mRNA counts per cell in the tance measure in the expression pro¢le space does not exist,
sample. The measurements are relative by nature: essentially and the choice should depend on the questions that we are
we can compare the expression level either of the same gene in asking. Standard sets of known co-regulated genes in various
di¡erent samples, or of di¡erent genes in the same sample. organisms and gene regulatory network modeling can poten-
Moreover, appropriate normalization should be applied to tially help in ¢nding theoretically substantiated similarity
enable any data comparisons. Typically it is assumed that measures.
abundance ratios of 1.5^2 are indicative of a change in gene After having chosen the similarity measure in the expression
expression, but such estimates are very crude. The reliability pro¢le space we can study the expression matrix in either a
of ratios depends on the absolute intensity values, as well as supervised or an unsupervised manner. The supervised ap-
varying from spot to spot due to speci¢city of the sequence proach assumes that for some (or all) pro¢les we have addi-
and cross-hybridization of homologous sequences (for in- tional information, such as functional classes for the genes, or
stance see [3]). This should be kept in mind while analyzing diseased/normal states attributed to the samples. We can view
the gene expression matrix. The value of microarray-based this additional information as labels attached to the rows or
gene expression measurements would be considerably higher columns. Having this information, a typical task is to build a
if reliability and limitations of particular microarray platforms classi¢er able to predict the labels from the expression pro¢le.
for particular kinds of measurements, as well as cross-plat- A typical example of unsupervised data analysis is expression
form comparison and normalization, were studied and pub- pro¢le clustering to ¢nd groups of co-regulated genes or re-
lished. lated samples. For conceptual illustration of unsupervised and
After we have processed the raw image data into the gene supervised analysis see Fig. 2. First we discuss the clustering
expression matrix, the next task is to analyze this matrix and approach.
to try to extract from it some knowledge about the underlying
biological processes. 3.1. Unsupervised analysis
The goal of clustering is to group together objects (genes or
3. Gene expression matrix analysis samples) with similar properties. This can also be viewed as
the reduction of the dimensionality of the system. Clustering
There are two straightforward ways how gene expression is not a new technique, many algorithms have been developed
matrix can be studied: for it and many of these algorithms have been applied to
analyze expression data. The hierarchical [6] and K-mean clus-
1. comparing expression pro¢les of genes by comparing rows tering algorithms [7,8] as well as self-organizing maps [9] have
in the expression matrix; all been used for clustering expression pro¢les. Even a simple
2. comparing expression pro¢les of samples by comparing clustering algorithm based on binning (i.e. discretizing the
columns in the matrix. expression pro¢le space and clustering together the pro¢les
that map into the same bin) has been shown to be useful
Additionally both methods can be combined (provided that for clustering genes and subsequent discovering of transcrip-
the data normalization allows it). When comparing rows or tion factor binding sites [10]. More recently new algorithms
20 A. Brazma, J. Vilo/FEBS Letters 480 (2000) 17^24

[16] and yeast gene regulatory machinery [17]. Clustering has


been applied to the obtained gene expression matrices, and
groups of functionally related and co-regulated genes have
been revealed. Tavazoie et al. [8] clustered expression pro¢les
of 3000 most variable yeast genes during the cell cycle (15
time points, data from Cho et al. [18]) into 30 clusters by
the K-means algorithm. They found that for half of these
clusters, strong sequence patterns are present in the gene up-
stream sequences. Note that expression pro¢les of cell cycle-
dependent genes are periodic and Fourier analysis has been
Fig. 2. Supervised and unsupervised data analysis. In the unsuper- used to discover these genes [16].
vised case (left) we are given data points in n-dimensional space
Eisen et al. [6] have developed a hierarchical clustering-
(n = 2 in the example) and we are trying to ¢nd ways how to groups
together points with similar features. For instance, there are three based algorithm and visualization software package, which
natural clusters in the example, each consisting of data points close is currently one of the most frequently used tools for expres-
to each other in a sense of Euclidean distance. A clustering algo- sion pro¢le clustering and data visualization. They applied
rithm should identify these clusters. In the supervised case (right), their software to gene expression matrices obtained by com-
the objects are labelled (e.g. we have ¢lled and un¢lled points in the
example), and the task is to ¢nd a set of classi¢cation rules allowing bining 80 di¡erent yeast samples (experimental conditions)
us to discriminate between these points as precisely as possible. For studied in various hybridization experiments at Stanford Uni-
instance, the dotted line in the drawing discriminates most of the versity (including the ones mentioned above).
points correctly, allowing us to predict their `labels' ^ ¢lled or un- Gene expression pro¢le clustering does not necessarily re-
¢lled ^ by their position above or below the dotted line.
quire the full genome. For instance Iyer et al. [19] studied
8600 genes in human ¢broblasts and obtained 10 distinct
gene clusters each associated with genes with particular func-
have been developed speci¢cally for gene expression pro¢le tional roles, such as signal transduction, coagulation, hemo-
clustering, for instance based on ¢nding approximate cliques stasis, in£ammation etc.
in graphs [11]. A simple method for ¢nding sets of interesting genes is
Hierarchical clustering works by iteratively joining the two comparing expression pro¢les of two or more samples for
closest clusters starting from singleton clusters [6] or itera- di¡erentially expressed genes. For instance, Lee et al. [20]
tively partitioning clusters starting with the complete set used this method to ¢nd genes that are di¡erentially expressed
[12], see Fig. 3. After each joining of two clusters, the distan- in skeletal muscle of adult (5 months) and old (30 months)
ces between all the other clusters and a new joined cluster are mice. Of over 6347 mouse genes surveyed by a microarray, 58
recalculated. The complete linkage, average linkage, and sin- displayed a greater than two-fold increase, whereas 55 dis-
gle linkage methods use maximum, average, and minimum played a greater than two-fold decrease in expression in the
distances between the members of two clusters respectively. skeletal muscles of the old mice. Of the genes that increased in
Note that to obtain a particular partitioning into clusters, expression, 16% were mediators or stress response genes and
the threshold distance should be chosen by independent 9% were involved in neuronal growth. Of genes that decreased
means (typically by the user himself). in expression, 13% were participating in energy metabolism.
The K-means clustering algorithm typically uses the Euclid- In the same study gene expression pro¢les from 30 month
ean properties of the vector space. The desired number of old mice with restricted caloric intake (76% of that of a con-
clusters K has to be chosen a priori. After the initial partition- trol population) were compared to the 30 month old control
ing of the vector space into K parts, the algorithm calculates population, and it was shown that the expression pro¢le of
the center points in each subspace and adjusts the partition so restricted caloric intake mice was closer to that of younger
that each vector is assigned to the cluster the center of which mice.
is the closest. This is repeated iteratively until either the par- Hierarchical clustering [6] has also been used for sample
titioning stabilizes or the given number of iterations is ex- clustering. An interesting application of this approach is the
ceeded. The approaches for the initial selection of the ¢rst clustering of tumors to ¢nd new possible tumor subclasses. In
set of K cluster centers can vary. a recent paper by Alizadeh et al. [21], di¡use large B-cell
Clustering of expression pro¢les has been used for grouping lymphoma (DLBCL) was studied using 96 samples of normal
genes as well as samples. The clustering of genes for ¢nding and malignant lymphocytes. Applying a hierarchical cluster-
co-regulated and functionally related groups is particularly ing algorithm to these samples they showed that there is a
interesting in the cases when we have complete sets of an diversity in gene expression among the tumors of DLBCL
organism's genes. In a frequently quoted paper DeRisi et al. patients. They identi¢ed two molecularly distinct forms of
[13] used a DNA array containing a complete set of yeast DLBCL, which had gene expression patterns indicative of
genes to study the diauxic shift time course. They selected di¡erent stages of B-cell di¡erentiation. Interestingly, these
small groups of genes with similar expression pro¢les and two groups correlated well with patient survival rates, thus
showed that these genes are functionally related and contain con¢rming that the clusters are meaningful.
relevant transcription factor binding sites upstream of their Sample clustering has been combined with gene clustering
open reading frames (ORFs). More systematic studies of to identify which genes are the most important for sample
this dataset for regulatory elements were done by Brazma et clustering [12,21]. Alon et al. [12] have applied a partition-
al. [10] and van Helden et al. [14]. ing-based clustering algorithm to study 6500 genes of 40 tu-
Later more expression studies of yeast under various con- mor and 22 normal colon tissues for clustering both genes and
ditions were carried out, including sporulation [15], cell cycle samples. They call this method two-way clustering.
A. Brazma, J. Vilo/FEBS Letters 480 (2000) 17^24 21

Fig. 3. Hierarchical clustering of gene expression matrices. The image shows an average linkage (UPGMA) clustering of 505 yeast genes during
three di¡erent cell cycle studies with a total of 60 di¡erent time points analyzed. The color image on the left shows the numerical values en-
coded by color according to the method introduced by Mike Eisen. Red is used to represent the positive values and green the negative values.
Blue shows the missing values in the respective experiments. The clustering and the image are produced using WWW-based tools in Expression
Pro¢ler (http://www.ebi.ac.uk/microarray/). The interface is interactive and further information about the genes in each subtree is available by
clicking on the respective nodes in the tree.
22 A. Brazma, J. Vilo/FEBS Letters 480 (2000) 17^24

3.2. Supervised analysis expression pro¢les, i.e. genes that are co-expressed, may have
One of the goals of supervised expression data analysis is to something in common in their regulatory mechanisms, i.e.
construct classi¢ers, such as linear discriminants, decision may be co-regulated. Therefore by clustering together genes
trees or support vector machines (SVM), which assign prede- with similar expression pro¢les one can ¢nd groups of poten-
¢ned classes to a given expression pro¢le. For instance, if a tially co-regulated genes and search for putative regulatory
classi¢er can be constructed based on gene expression pro¢les signals. The outline of such a discovery method is as follows:
that is able to distinguish between two di¡erent, but morpho-
logically closely related tumor tissues, such a classi¢er can be 1. cluster the genes based on a selection of expression mea-
used for diagnostics. Moreover, if such a classi¢er is based on surements;
a set of relatively simple rules, it can help to understand what 2. extract putative promoter sequences for the genes in the
the mechanisms involved in each tumor are. Typically, such clusters;
classi¢ers are trained on a subset of data with a priori given 3. search for sequence patterns overrepresented in these clus-
classi¢cation and tested on another subset with known classi- ters;
¢cation. After assessing the quality of the prediction they can 4. assess the quality of discovered patterns using some statis-
be applied to data the classi¢cation of which is unknown. tical signi¢cance criteria.
Brown et al. [22] have applied various supervised learning
algorithms to six functional classes of yeast genes using gene A systematic application of this approach has been reported
expression matrices from 79 samples [6]. Genes from some of for the yeast Saccharomyces cerevisiae using a public data set
the classes, such as ribosomal proteins and histones, are ex- from Stanford University [6] combining various yeast expres-
pected to be co-expressed. For these classes a good classi¢ca- sion experiments with a total of 80 conditions for 6221 genes
tion accuracy was achieved. Some other functional classes, (http://rana.stanford.edu/). The computational analysis con-
such as protein kinases, are not expected to have distinct sisted of the following steps [25].
gene expression pro¢les. It was shown that SVM provides
the best prediction accuracy for the functional classes that 1. Clustering the expression data. In the absence of theoret-
are expected to be co-regulated. ically `correct' similarity measures and clustering algo-
Golub et al. [23] applied neighborhood analysis to construct rithms, the simplest measure was selected and di¡erent
class predictors for samples, concretely for leukemias. They clusterings carried out. All genes were clustered based on
were looking for genes the expression of which is best corre- their expression pro¢les by the K-means clustering algo-
lated with two known classes of leukemias, acute myeloid rithm using Euclidean distances. Instead of ¢xing the num-
leukemia and acute lymphoblastic leukemia. They constructed ber of clusters K it was varied between 2 and 1000. For
a classi¢er based on 50 genes (out of 6817) using 38 samples each K the clustering was repeated 10 times with di¡erent
and applied it to a collection of 34 new samples. The classi¢er random initial cluster centers. In total over 900 separate
correctly predicted 29 of these 34 samples. clusterings were made and clusters of size between 20 and
Note that when classifying samples, we are confronted with 100 genes were selected, totaling over 52 100 di¡erent clus-
a problem that there are many more attributes (genes) than ters.
objects (samples) that we are trying to classify. This makes it 2. Sequence pattern discovery. For each cluster the set of gene
always possible to ¢nd a perfect discriminator if we are not upstream sequences of length 600 bp was taken for analy-
careful in restricting the complexity of the permitted classi- sis. All substring patterns of unrestricted length occurring
¢ers. To avoid this problem we must look for very simple in at least 10 sequences in a cluster were scored according
classi¢ers, compromising between simplicity and classi¢cation to the binomial probability of their occurrence in the clus-
accuracy. Ben-Dor et al. [24] applied a new clustering algo- ter. The background probability was estimated based on
rithm for classi¢cation of colon and ovarian cancer data sets. the number of occurrences of each pattern in upstream
They used unsupervised clustering to ¢nd a hierarchical struc- sequences of all 6221 genes.
ture in the expression pro¢le space, and supervised learning to 3. Finding the signi¢cance threshold by control experiment.
¢nd the best threshold to correlate the clustering structure To determine the statistical signi¢cance threshold for the
with the known cancer classes. patterns, step 2 was repeated on randomized data by re-
Whether we use supervised or unsupervised expression pro- placing the cluster contents by upstream sequences from
¢le analysis, they are only the ¢rst steps in expression data random sets of genes. A threshold probability of 1038
analysis. It is a long way from ¢nding gene clusters to ¢nding was chosen as patterns with higher probability were also
functional roles of the respective genes, and moreover, under- observable from random clusters.
standing the underlying biological processes. A natural step 4. Pattern selection. Of the over 6000 signi¢cant patterns
downstream of expression pro¢le clustering is the usage of many were observed to occur in clusters of genes with
putative promoter sequences of similarly expressed genes for high homology in the respective upstream sequences. These
¢nding regulatory sequence elements in genomes. This is eas- clusters, totaling 169 genes, were easily identi¢able and
ier for yeast, since typically yeast promoters are relatively they were removed. The remaining clusters of genes with
close to ORFs. In the next section we describe an approach non-homologous upstream sequences contained 3727
which uses gene expression data to ¢nd regulatory sequence ORFs and together they produced 1498 signi¢cant pat-
elements in yeast. terns.
5. Grouping the patterns. As 1498 substring patterns is still
4. Identi¢cation of putative regulatory signals too many for human study, they were clustered using a
similarity measure based on common information content
It seems reasonable to hypothesize that genes with similar [26]. This produced 62 clusters of similar patterns. For each
A. Brazma, J. Vilo/FEBS Letters 480 (2000) 17^24 23

cluster of patterns an approximate alignment and a con- infancy. Even the rather obvious approaches, such as cluster
sensus pattern were calculated. analysis and ¢nding di¡erentially expressed genes, have been
6. Evaluation of discovered patterns against known transcrip- used only rather crudely. For instance, the appropriateness of
tion factor binding sites. All 1498 interesting patterns were similarity measures has not been systematically explored and
matched against experimentally veri¢ed DNA binding sites these measures are used on an ad-hoc basis. The information
of yeast as given in SCPD ([27], http://cgsigma.cshl.org/ characterizing the measurement quality of di¡erent data
jian/). points is typically not used. Advances in this area are hindered
by the lack of systematic research in ways of assessing the
Of the 62 clusters of patterns 48 had matches in SCPD and measurement quality and comparing data from various tech-
14 were such that they did not have a match in any site nology platforms. These shortcomings can be overcome only
reported in the SCPD database. Table 1 shows the partial if the journals encourage publications exploring the gene ex-
consensus patterns that were calculated from pattern align- pression measurement technologies themselves, rather than al-
ments for these 14 clusters. The nucleotide groups (IUPAC ways concentrating on the biological subject. In the long run
groups represented here using a regular expression notation) the advancement of biological knowledge will be accelerated
were introduced when the frequency of the less frequent nu- by technology-centric studies, with biology becoming more
cleotide in the respective column was over 25% of the fre- quantitative science.
quency of the more frequent nucleotide. Inside the groups Gene expression data analysis methods will develop simi-
nucleotides are ordered based on their frequency. Lowercase larly as sequence analysis methods have developed over the
letters are used when the majority of the patterns do not have past decades. The amounts of gene expression data will con-
any nucleotide in that position, i.e. when the most frequent tinue growing and the data will become more systematic.
nucleotide in the respective alignment column is a dash. Currently the gene expression pro¢ling is similar to gene se-
The fact that 48 out of 62 pattern classes have matches in quencing before the era of genome sequencing: the measure-
experimentally veri¢ed yeast transcription factor binding sites ments are carried out to attack particular questions or some-
indicates the validity of the described computational discovery times just to demonstrate the concept.
method. Potentially the most interesting patterns, however, With the technology becoming more reliable, with the in-
are the ones that do not have matches in the known binding troduction of standard controls in experiments and developing
sites, and they can be targets for further research (see Table generally accepted data normalization and quality control
1). In this way, the described computational experiment has methods, it will become possible to systematically pro¢le
come up with targets for further research by more conven- genes in various organisms, tissues, developmental stages
tional methods. Automatic or semiautomatic generation of and conditions. Various chemical compounds will be pro¢led
such hypotheses is one of the main tasks of bioinformatics for their possible toxicity and other e¡ects on organisms, and
and data mining approaches. various signatures will be associated with various toxicity
The tools used for the experiments outlined above, as well mechanisms or cellular processes. This approach will resemble
as the complete results of the experiments, are available on- systematic genome sequencing. Algorithms for reliable search-
line (http://www.ebi.ac.uk/microarray). All the tools, includ- ing of similar expression pro¢les, or analyzing sets of related
ing the clustering and visualization methods for expression pro¢les to discover common signatures, will be needed, just as
data analysis and the regulatory region extraction for the searching and pattern discovery algorithms are needed to ex-
yeast, have a web interface. The individual tools are intercon- plore sequences.
nected so that similar analyses can be carried out over the web However, there is a major di¡erence between gene sequence
for any expression and sequence data. and expression data. Even if eventually we are able to over-
come various technological limitations, and even if we are
5. Conclusions able to measure gene expression in terms of absolute units
such as mRNA counts, the gene expression pro¢les are mean-
Expression data analysis methods are currently only in their ingful only in the context of the experimental conditions in
which they have been measured. This requires detailed and
Table 1 systematic annotation of samples and experimental condi-
Consensus sequences of the pattern clusters that do not have tions. For this to become a reality, agreed ontologies and
matches in the SCPD database controlled vocabularies for tissues, cell types, and treatments,
Cluster Consensus pattern as well as for array designs, image analyses and hybridization
2 aaTCTTCATGt protocols, have to be developed. Systematic building up of
5 cgTACCTCTa gene expression matrices for various organisms would be fa-
8 gACAGCTAc cilitated by establishing a public repository for gene expres-
17 tAT[TAC]GTTAAgc
sion data [28].
20 ACTTTATTT
21 [ag]TAACTT[AT]Ca Like genome sequencing, the systematic gene expression
26 TATCGAG (singleton) pro¢le is not an end in itself. It is a long way from having
29 t[ta]CGAATA[AG]aaaa detailed gene expression pro¢les to real understanding of
42 [ta]TGCATGAAc underlying cellular processes. Bioinformatics methods and
43 a[TG][GC]GTATAc
45 [ag][ga][AG]ATATG[TG][ga][ag]g tools will be needed to cope with the huge amounts of data,
46 tag[AG]TAGA[TA]A[ga]aaaa but they will not bring any deep understanding by themselves.
50 ATCCAAGAg On the other hand, the traditional `gene by gene' methods will
59 tTTTTCTG[CT][TA]c not be su¤cient to understand gene regulatory networks con-
See text for explanations. sisting of thousands or tens of thousands of genes. One of the
24 A. Brazma, J. Vilo/FEBS Letters 480 (2000) 17^24

most challenging downstream goals of gene expression pro¢l- Green, M., Golub, T., Lander, E. and Young, R. (1998) Cell 95,
ing and data analysis is the reverse engineering and modeling 717^728.
[18] Cho, R.J., Campbell, M.J., Winzeler, E.A., Steinmetz, L., Con-
of gene regulatory networks (see for instance [29^31]). With way, A., Wodicka, L., Wolfsberg, T.G., Gabrielian, A.E., Lands-
biology becoming more quantitative science, modeling ap- man, D., Lockhart, D.J. and Davis, R.W. (1998) Mol. Cell 2, 65^
proaches will become more and more usual. 73.
[19] Iyer, V.R., Eisen, M.B., Ross, D.T., Schuler, G., Moore, T., Lee,
J.C.F., Trent, J.M., Staudt, L.M., Hudson Jr., J., Boguski, M.S.,
References Lashkari, D., Shalon, D., Botstein, D. and Brown, P.O. (1999)
Science 283, 83^87.
[1] Celis, J.E., KruhÖ¡er, M., Gromova, I., Frederiksen, C., Òster- [20] Lee, C., Klopp, R.G., Weindruch, R. and Prolla, T.A. (1999)
gaard, M., Thykjaer, T., Gromov, P., Yu, Y., Pälsdöttir, H. and Science 285, 1390^1393.
Òrntoft, T.F. (2000) FEBS Lett. 480, 2^16. [21] Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S.,
[2] The Chipping Forecast (1999) Nature Genet. 21, Suppl. Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X., Po-
[3] Claverie, J.-M. (1999) Hum. Mol. Genet. 8, 1821^1832. well, J.I., Yang, L., Marti, G.E., Moore, T., Hudson Jr., J., Lu,
[4] Legendre, P. and Legendre, L. (1998) Numerical Ecology. Devel- L., Lewis, D.B., Tibshirani, R., Sherlock, G., Chan, W.C.,
opments in Environmental Modelling, Elsevier, Amsterdam. Greiner, T.C., Weisenburger, D.D., Armitage, J.O., Warnke,
[5] D'haesleer, P., Wen, X., Fuhrman, S. and Somogyi, R. (1998) in: R., Levy, R., Wilson, W., Grever, M.R., Byrd, J.C., Botstein,
Information Processing in Cells and Tissues, Plenum Press, New D., Brown, P.O. and Staudt, L.M. (2000) Nature 403, 503^511.
York. [22] Brown, M.P.S., Grundy, W.N., Lin, D., Cristianini, N., Sugnet,
[6] Eisen, M., Spellman, P.T., Botstein, D. and Brown, P.O. (1998) C.W., Furey, T.S., Ares, M.J. and Haussler, D. (2000) Proc.
Proc. Natl. Acad. Sci. USA 95, 14863^14867. Natl. Acad. Sci. USA 97, 262^267.
[7] Hartigan, J.A. (1975) Clustering Algorithms, John Wiley and [23] Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek,
Sons, New York. M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Cali-
[8] Tavazoie, S., Hughes, D., Campbell, M.J., Cho, R.J. and giuri, M.A., Bloom¢eld, C.D. and Lander, E.S. (1999) Science
Church, G.M. (1999) Nature Genet. 22, 281^285. 286, 531^537.
[9] Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., [24] Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer,
Dmitrovsky, E., Lander, E. and Golub, T. (1999) Proc. Natl. M. and Yakhini, Z. (2000) The Fourth Annual International
Acad. Sci. USA 96, 2907^2912. Conference on Computational Molecular Biology RECOMB-
[10] Brazma, A., Jonassen, I., Vilo, J. and Ukkonen, E. (1998) Ge- 2000, pp. 54^64, ACM Press, Tokyo.
nome Res. 8, 1202^1215. [25] Vilo, J., Brazma, A., Jonassen, I., Robinson, A. and Ukkonen, E.
[11] Ben-Dor, A. and Yakhini, Z. (1999) Proceedings of the Third (2000) The Eighth International Conference on Intelligent Sys-
Annual International Conference on Computational Molecular tems for Molecular Biology, AAAI Press, La Jolla, CA, in press.
Biology RECOMB-1999, pp. 33^42. ACM Press, Lyon. [26] Hertz, G.Z. and Stormo, G.D. (1995) in: Proceedings of the
[12] Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Third International Conference on Bioinformatics and Genome
Mack, D. and Levine, A.J. (1999) Proc. Natl. Acad. Sci. USA 96, Research, pp. 201^216, World Scienti¢c Publishing, Singapore.
6745^6750. [27] Zhu, J. and Zhang, M.Q. (1999) Bioinformatics 15, 607^611.
[13] DeRisi, J.L., Iyer, V.R. and Brown, P.O. (1997) Science 278, [28] Brazma, A., Robinson, A., Cameron, G. and Ashburner, M.
680^686. (2000) Nature 403, 699^700.
[14] van Helden, J., Andrë, B. and Collado-Vides, J. (1998) J. Mol. [29] Akutsu, T., Miyano, S. and Kuhara, S. (1999) The Paci¢c Sym-
Biol. 281, 827^842. posium on Biocomputing '99 (PSB'99), pp. 17^28, World Scien-
[15] Chu, S., DeRisi, J.L., Eisen, M., Mulholland, J., Botstein, D., ti¢c, Hawaii.
Brown, P.O. and Herskowitz, I. (1998) Science 282, 699^705. [30] Liang, S., Fuhrman, S. and Somogyi, R. (1998) The Paci¢c Sym-
[16] Spellman, P.T., Sherlock, G., Zhang, M., Iyer, V.R., Anders, K., posium on Biocomputing, Vol. 3, pp. 18^29, World Scienti¢c,
Eisen, M., Brown, P.O., Botstein, D. and Futcher, B. (1998) Mol. Hawaii.
Biol. Cell 9, 3273. [31] Thie¡ry, D., Colet, M. and Thomas, R. (1993) Math. Model. Sci.
[17] Holstege, F., Jennings, E., Wyrick, J., Lee, T., Hengartner, C., Comput. 55, 144^151.

You might also like