2000-Gene Expression Data Analysis
2000-Gene Expression Data Analysis
2000-Gene Expression Data Analysis
Minireview
Gene expression data analysis
Alvis Brazma*, Jaak Vilo
European Molecular Biology Laboratory, Outstation Hinxton ^ the European Bioinformatics Institute, Cambridge CB10 1SD, UK
0014-5793 / 00 / $20.00 ß 2000 Federation of European Biochemical Societies. Published by Elsevier Science B.V. All rights reserved.
PII: S 0 0 1 4 - 5 7 9 3 ( 0 0 ) 0 1 7 7 2 - 5
18 A. Brazma, J. Vilo/FEBS Letters 480 (2000) 17^24
sion matrix. Building up a database of such matrices will help (so-called channel). Therefore the raw data produced by mi-
us to understand gene regulation, metabolic and signaling croarrays are in fact monochrome images (Fig. 1). Transform-
pathways, the genetic mechanisms of disease, and the response ing these images into the gene expression matrix is a non-
to drug treatments. For instance, if overexpression of certain trivial process: the spots corresponding to genes on the micro-
genes is correlated with a certain cancer, we can explore which array should be identi¢ed, their boundaries determined, the
other conditions a¡ect the expression of these genes and £uorescence intensity from each spot measured and compared
which other genes have similar expression pro¢les. We can to the background intensity and to these intensities for other
also investigate which compounds (potential drugs) lower channels. The software for this initial image processing is
the expression level of these genes. often provided with the image scanner, since it will depend
on particular properties of the hardware. Often laborious
2. From raw data to gene expression matrix manual adjustment of the grid for spots is used. We will not
discuss the raw data processing in detail in this paper, some
Like many experimental technologies, microarrays measure survey of image analysis software can be found on http://
the target quantity (i.e. relative or absolute mRNA abun- cmpteam4.unil.ch/biocomputing/array/software/MicroArray_
dance) indirectly by measuring another physical quantity ^ Software.html.
the intensity of the £uorescence of the spots on the array In any physical experiment it is important to know not only
for each £uorescent dye, i.e. for each optical wavelength the value of the measurement, but also the standard error or
Fig. 1. A sample image from scanning a hybridized rat microarray containing over 5000 genes. Each spot features a pool of identical single-
stranded DNA molecules representing a single gene. The brightness of the spot is proportional to the amount of £uorescent mRNA hybridized
to the DNA of the spot. Automated image analysis software should identify these £uorescence spots, determine their boundaries, and the £uo-
rescence intensity from each spot should be measured and compared to the background £uorescence. Moreover, the image should be compared
to a similar image obtained from the control measurements and the ratio of background-subtracted intensities calculated. In this way images
are transformed into a gene expression matrix, which can be analyzed further by numerical methods. The image was kindly provided by Tom
Freeman (Sanger Centre, Cambridge, UK).
A. Brazma, J. Vilo/FEBS Letters 480 (2000) 17^24 19
some other indicator of reliability for each data point. For columns, we can look either for similarities or for di¡erences.
most microarray technology platforms only the ratio of the If we ¢nd that two rows are similar, we can hypothesize that
background-subtracted signals of the given sample and the the respective genes are co-regulated and possibly functionally
control is meaningful. If the spot intensity is low, the ratio related. By comparing samples, we can ¢nd which genes are
of these numbers may be high, but the measurement may not di¡erentially expressed and, for instance, study e¡ects of var-
be reliable. The spot quality can be assessed not only by the ious compounds.
absolute intensity in each channel, but also by many other Before we can perform any comparisons, we need a way to
factors, such as uniformity of the individual pixel intensities, measure the similarity (or distance) between the objects we are
or the shape of the spot. Unfortunately there is currently no comparing. We can regard these objects (rows or columns in
standard way of assessing the spot measurement reliability. If the matrix) as points in n-dimensional space or as n-dimen-
experiments have been done in replicates, they can be used to sional vectors, where n is the number of samples for gene
assess the standard errors in addition to the single measure- comparison, or number of genes for sample comparison.
ment quality assessments. Little has been published yet on The natural, so-called Euclidean distance (for de¢nition see
how to use the reliability of gene expression measurements [4]) between these points in the n-dimensional space may be
by combining the information about the spot image in each the most obvious, but not necessarily the best choice. It is
channel and the replicate images. intuitively appealing to use the correlation coe¤cient calcu-
Another di¤culty in creating a gene expression matrix lated by treating the two n-dimensional vectors as series of
comes from the necessity to identify each spot with the re- random variables. In fact this distance is related to the angle
spective gene. This is not always possible, since spots are between the two n-dimensional vectors. Euclidean and corre-
typically based on EST sequences, and linking the EST to lation distance measures are related, if we normalize the
the respective gene may be non-trivial. Typically it is done length of the n-dimensional vectors to 1. This makes it possi-
through EST clustering. Additionally, the same gene may be ble to use correlation distance even in the cases when Euclid-
represented by several spots on the array, either by exactly the ean properties are important. Some other distance measures,
same or by a di¡erent sequence. What expression level to including rank correlation coe¤cient and mutual information-
attribute to the gene, if measurements from these di¡erent based measure, are proposed in D'haesleer et al. [5]. Cur-
spots di¡er? rently, to the best of our knowledge, there is no theory how
Microarray-based gene expression measurements are still to choose the best distance measure. Possibly one `right' dis-
far from giving estimates of mRNA counts per cell in the tance measure in the expression pro¢le space does not exist,
sample. The measurements are relative by nature: essentially and the choice should depend on the questions that we are
we can compare the expression level either of the same gene in asking. Standard sets of known co-regulated genes in various
di¡erent samples, or of di¡erent genes in the same sample. organisms and gene regulatory network modeling can poten-
Moreover, appropriate normalization should be applied to tially help in ¢nding theoretically substantiated similarity
enable any data comparisons. Typically it is assumed that measures.
abundance ratios of 1.5^2 are indicative of a change in gene After having chosen the similarity measure in the expression
expression, but such estimates are very crude. The reliability pro¢le space we can study the expression matrix in either a
of ratios depends on the absolute intensity values, as well as supervised or an unsupervised manner. The supervised ap-
varying from spot to spot due to speci¢city of the sequence proach assumes that for some (or all) pro¢les we have addi-
and cross-hybridization of homologous sequences (for in- tional information, such as functional classes for the genes, or
stance see [3]). This should be kept in mind while analyzing diseased/normal states attributed to the samples. We can view
the gene expression matrix. The value of microarray-based this additional information as labels attached to the rows or
gene expression measurements would be considerably higher columns. Having this information, a typical task is to build a
if reliability and limitations of particular microarray platforms classi¢er able to predict the labels from the expression pro¢le.
for particular kinds of measurements, as well as cross-plat- A typical example of unsupervised data analysis is expression
form comparison and normalization, were studied and pub- pro¢le clustering to ¢nd groups of co-regulated genes or re-
lished. lated samples. For conceptual illustration of unsupervised and
After we have processed the raw image data into the gene supervised analysis see Fig. 2. First we discuss the clustering
expression matrix, the next task is to analyze this matrix and approach.
to try to extract from it some knowledge about the underlying
biological processes. 3.1. Unsupervised analysis
The goal of clustering is to group together objects (genes or
3. Gene expression matrix analysis samples) with similar properties. This can also be viewed as
the reduction of the dimensionality of the system. Clustering
There are two straightforward ways how gene expression is not a new technique, many algorithms have been developed
matrix can be studied: for it and many of these algorithms have been applied to
analyze expression data. The hierarchical [6] and K-mean clus-
1. comparing expression pro¢les of genes by comparing rows tering algorithms [7,8] as well as self-organizing maps [9] have
in the expression matrix; all been used for clustering expression pro¢les. Even a simple
2. comparing expression pro¢les of samples by comparing clustering algorithm based on binning (i.e. discretizing the
columns in the matrix. expression pro¢le space and clustering together the pro¢les
that map into the same bin) has been shown to be useful
Additionally both methods can be combined (provided that for clustering genes and subsequent discovering of transcrip-
the data normalization allows it). When comparing rows or tion factor binding sites [10]. More recently new algorithms
20 A. Brazma, J. Vilo/FEBS Letters 480 (2000) 17^24
Fig. 3. Hierarchical clustering of gene expression matrices. The image shows an average linkage (UPGMA) clustering of 505 yeast genes during
three di¡erent cell cycle studies with a total of 60 di¡erent time points analyzed. The color image on the left shows the numerical values en-
coded by color according to the method introduced by Mike Eisen. Red is used to represent the positive values and green the negative values.
Blue shows the missing values in the respective experiments. The clustering and the image are produced using WWW-based tools in Expression
Pro¢ler (http://www.ebi.ac.uk/microarray/). The interface is interactive and further information about the genes in each subtree is available by
clicking on the respective nodes in the tree.
22 A. Brazma, J. Vilo/FEBS Letters 480 (2000) 17^24
3.2. Supervised analysis expression pro¢les, i.e. genes that are co-expressed, may have
One of the goals of supervised expression data analysis is to something in common in their regulatory mechanisms, i.e.
construct classi¢ers, such as linear discriminants, decision may be co-regulated. Therefore by clustering together genes
trees or support vector machines (SVM), which assign prede- with similar expression pro¢les one can ¢nd groups of poten-
¢ned classes to a given expression pro¢le. For instance, if a tially co-regulated genes and search for putative regulatory
classi¢er can be constructed based on gene expression pro¢les signals. The outline of such a discovery method is as follows:
that is able to distinguish between two di¡erent, but morpho-
logically closely related tumor tissues, such a classi¢er can be 1. cluster the genes based on a selection of expression mea-
used for diagnostics. Moreover, if such a classi¢er is based on surements;
a set of relatively simple rules, it can help to understand what 2. extract putative promoter sequences for the genes in the
the mechanisms involved in each tumor are. Typically, such clusters;
classi¢ers are trained on a subset of data with a priori given 3. search for sequence patterns overrepresented in these clus-
classi¢cation and tested on another subset with known classi- ters;
¢cation. After assessing the quality of the prediction they can 4. assess the quality of discovered patterns using some statis-
be applied to data the classi¢cation of which is unknown. tical signi¢cance criteria.
Brown et al. [22] have applied various supervised learning
algorithms to six functional classes of yeast genes using gene A systematic application of this approach has been reported
expression matrices from 79 samples [6]. Genes from some of for the yeast Saccharomyces cerevisiae using a public data set
the classes, such as ribosomal proteins and histones, are ex- from Stanford University [6] combining various yeast expres-
pected to be co-expressed. For these classes a good classi¢ca- sion experiments with a total of 80 conditions for 6221 genes
tion accuracy was achieved. Some other functional classes, (http://rana.stanford.edu/). The computational analysis con-
such as protein kinases, are not expected to have distinct sisted of the following steps [25].
gene expression pro¢les. It was shown that SVM provides
the best prediction accuracy for the functional classes that 1. Clustering the expression data. In the absence of theoret-
are expected to be co-regulated. ically `correct' similarity measures and clustering algo-
Golub et al. [23] applied neighborhood analysis to construct rithms, the simplest measure was selected and di¡erent
class predictors for samples, concretely for leukemias. They clusterings carried out. All genes were clustered based on
were looking for genes the expression of which is best corre- their expression pro¢les by the K-means clustering algo-
lated with two known classes of leukemias, acute myeloid rithm using Euclidean distances. Instead of ¢xing the num-
leukemia and acute lymphoblastic leukemia. They constructed ber of clusters K it was varied between 2 and 1000. For
a classi¢er based on 50 genes (out of 6817) using 38 samples each K the clustering was repeated 10 times with di¡erent
and applied it to a collection of 34 new samples. The classi¢er random initial cluster centers. In total over 900 separate
correctly predicted 29 of these 34 samples. clusterings were made and clusters of size between 20 and
Note that when classifying samples, we are confronted with 100 genes were selected, totaling over 52 100 di¡erent clus-
a problem that there are many more attributes (genes) than ters.
objects (samples) that we are trying to classify. This makes it 2. Sequence pattern discovery. For each cluster the set of gene
always possible to ¢nd a perfect discriminator if we are not upstream sequences of length 600 bp was taken for analy-
careful in restricting the complexity of the permitted classi- sis. All substring patterns of unrestricted length occurring
¢ers. To avoid this problem we must look for very simple in at least 10 sequences in a cluster were scored according
classi¢ers, compromising between simplicity and classi¢cation to the binomial probability of their occurrence in the clus-
accuracy. Ben-Dor et al. [24] applied a new clustering algo- ter. The background probability was estimated based on
rithm for classi¢cation of colon and ovarian cancer data sets. the number of occurrences of each pattern in upstream
They used unsupervised clustering to ¢nd a hierarchical struc- sequences of all 6221 genes.
ture in the expression pro¢le space, and supervised learning to 3. Finding the signi¢cance threshold by control experiment.
¢nd the best threshold to correlate the clustering structure To determine the statistical signi¢cance threshold for the
with the known cancer classes. patterns, step 2 was repeated on randomized data by re-
Whether we use supervised or unsupervised expression pro- placing the cluster contents by upstream sequences from
¢le analysis, they are only the ¢rst steps in expression data random sets of genes. A threshold probability of 1038
analysis. It is a long way from ¢nding gene clusters to ¢nding was chosen as patterns with higher probability were also
functional roles of the respective genes, and moreover, under- observable from random clusters.
standing the underlying biological processes. A natural step 4. Pattern selection. Of the over 6000 signi¢cant patterns
downstream of expression pro¢le clustering is the usage of many were observed to occur in clusters of genes with
putative promoter sequences of similarly expressed genes for high homology in the respective upstream sequences. These
¢nding regulatory sequence elements in genomes. This is eas- clusters, totaling 169 genes, were easily identi¢able and
ier for yeast, since typically yeast promoters are relatively they were removed. The remaining clusters of genes with
close to ORFs. In the next section we describe an approach non-homologous upstream sequences contained 3727
which uses gene expression data to ¢nd regulatory sequence ORFs and together they produced 1498 signi¢cant pat-
elements in yeast. terns.
5. Grouping the patterns. As 1498 substring patterns is still
4. Identi¢cation of putative regulatory signals too many for human study, they were clustered using a
similarity measure based on common information content
It seems reasonable to hypothesize that genes with similar [26]. This produced 62 clusters of similar patterns. For each
A. Brazma, J. Vilo/FEBS Letters 480 (2000) 17^24 23
cluster of patterns an approximate alignment and a con- infancy. Even the rather obvious approaches, such as cluster
sensus pattern were calculated. analysis and ¢nding di¡erentially expressed genes, have been
6. Evaluation of discovered patterns against known transcrip- used only rather crudely. For instance, the appropriateness of
tion factor binding sites. All 1498 interesting patterns were similarity measures has not been systematically explored and
matched against experimentally veri¢ed DNA binding sites these measures are used on an ad-hoc basis. The information
of yeast as given in SCPD ([27], http://cgsigma.cshl.org/ characterizing the measurement quality of di¡erent data
jian/). points is typically not used. Advances in this area are hindered
by the lack of systematic research in ways of assessing the
Of the 62 clusters of patterns 48 had matches in SCPD and measurement quality and comparing data from various tech-
14 were such that they did not have a match in any site nology platforms. These shortcomings can be overcome only
reported in the SCPD database. Table 1 shows the partial if the journals encourage publications exploring the gene ex-
consensus patterns that were calculated from pattern align- pression measurement technologies themselves, rather than al-
ments for these 14 clusters. The nucleotide groups (IUPAC ways concentrating on the biological subject. In the long run
groups represented here using a regular expression notation) the advancement of biological knowledge will be accelerated
were introduced when the frequency of the less frequent nu- by technology-centric studies, with biology becoming more
cleotide in the respective column was over 25% of the fre- quantitative science.
quency of the more frequent nucleotide. Inside the groups Gene expression data analysis methods will develop simi-
nucleotides are ordered based on their frequency. Lowercase larly as sequence analysis methods have developed over the
letters are used when the majority of the patterns do not have past decades. The amounts of gene expression data will con-
any nucleotide in that position, i.e. when the most frequent tinue growing and the data will become more systematic.
nucleotide in the respective alignment column is a dash. Currently the gene expression pro¢ling is similar to gene se-
The fact that 48 out of 62 pattern classes have matches in quencing before the era of genome sequencing: the measure-
experimentally veri¢ed yeast transcription factor binding sites ments are carried out to attack particular questions or some-
indicates the validity of the described computational discovery times just to demonstrate the concept.
method. Potentially the most interesting patterns, however, With the technology becoming more reliable, with the in-
are the ones that do not have matches in the known binding troduction of standard controls in experiments and developing
sites, and they can be targets for further research (see Table generally accepted data normalization and quality control
1). In this way, the described computational experiment has methods, it will become possible to systematically pro¢le
come up with targets for further research by more conven- genes in various organisms, tissues, developmental stages
tional methods. Automatic or semiautomatic generation of and conditions. Various chemical compounds will be pro¢led
such hypotheses is one of the main tasks of bioinformatics for their possible toxicity and other e¡ects on organisms, and
and data mining approaches. various signatures will be associated with various toxicity
The tools used for the experiments outlined above, as well mechanisms or cellular processes. This approach will resemble
as the complete results of the experiments, are available on- systematic genome sequencing. Algorithms for reliable search-
line (http://www.ebi.ac.uk/microarray). All the tools, includ- ing of similar expression pro¢les, or analyzing sets of related
ing the clustering and visualization methods for expression pro¢les to discover common signatures, will be needed, just as
data analysis and the regulatory region extraction for the searching and pattern discovery algorithms are needed to ex-
yeast, have a web interface. The individual tools are intercon- plore sequences.
nected so that similar analyses can be carried out over the web However, there is a major di¡erence between gene sequence
for any expression and sequence data. and expression data. Even if eventually we are able to over-
come various technological limitations, and even if we are
5. Conclusions able to measure gene expression in terms of absolute units
such as mRNA counts, the gene expression pro¢les are mean-
Expression data analysis methods are currently only in their ingful only in the context of the experimental conditions in
which they have been measured. This requires detailed and
Table 1 systematic annotation of samples and experimental condi-
Consensus sequences of the pattern clusters that do not have tions. For this to become a reality, agreed ontologies and
matches in the SCPD database controlled vocabularies for tissues, cell types, and treatments,
Cluster Consensus pattern as well as for array designs, image analyses and hybridization
2 aaTCTTCATGt protocols, have to be developed. Systematic building up of
5 cgTACCTCTa gene expression matrices for various organisms would be fa-
8 gACAGCTAc cilitated by establishing a public repository for gene expres-
17 tAT[TAC]GTTAAgc
sion data [28].
20 ACTTTATTT
21 [ag]TAACTT[AT]Ca Like genome sequencing, the systematic gene expression
26 TATCGAG (singleton) pro¢le is not an end in itself. It is a long way from having
29 t[ta]CGAATA[AG]aaaa detailed gene expression pro¢les to real understanding of
42 [ta]TGCATGAAc underlying cellular processes. Bioinformatics methods and
43 a[TG][GC]GTATAc
45 [ag][ga][AG]ATATG[TG][ga][ag]g tools will be needed to cope with the huge amounts of data,
46 tag[AG]TAGA[TA]A[ga]aaaa but they will not bring any deep understanding by themselves.
50 ATCCAAGAg On the other hand, the traditional `gene by gene' methods will
59 tTTTTCTG[CT][TA]c not be su¤cient to understand gene regulatory networks con-
See text for explanations. sisting of thousands or tens of thousands of genes. One of the
24 A. Brazma, J. Vilo/FEBS Letters 480 (2000) 17^24
most challenging downstream goals of gene expression pro¢l- Green, M., Golub, T., Lander, E. and Young, R. (1998) Cell 95,
ing and data analysis is the reverse engineering and modeling 717^728.
[18] Cho, R.J., Campbell, M.J., Winzeler, E.A., Steinmetz, L., Con-
of gene regulatory networks (see for instance [29^31]). With way, A., Wodicka, L., Wolfsberg, T.G., Gabrielian, A.E., Lands-
biology becoming more quantitative science, modeling ap- man, D., Lockhart, D.J. and Davis, R.W. (1998) Mol. Cell 2, 65^
proaches will become more and more usual. 73.
[19] Iyer, V.R., Eisen, M.B., Ross, D.T., Schuler, G., Moore, T., Lee,
J.C.F., Trent, J.M., Staudt, L.M., Hudson Jr., J., Boguski, M.S.,
References Lashkari, D., Shalon, D., Botstein, D. and Brown, P.O. (1999)
Science 283, 83^87.
[1] Celis, J.E., KruhÖ¡er, M., Gromova, I., Frederiksen, C., Òster- [20] Lee, C., Klopp, R.G., Weindruch, R. and Prolla, T.A. (1999)
gaard, M., Thykjaer, T., Gromov, P., Yu, Y., Pälsdöttir, H. and Science 285, 1390^1393.
Òrntoft, T.F. (2000) FEBS Lett. 480, 2^16. [21] Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S.,
[2] The Chipping Forecast (1999) Nature Genet. 21, Suppl. Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X., Po-
[3] Claverie, J.-M. (1999) Hum. Mol. Genet. 8, 1821^1832. well, J.I., Yang, L., Marti, G.E., Moore, T., Hudson Jr., J., Lu,
[4] Legendre, P. and Legendre, L. (1998) Numerical Ecology. Devel- L., Lewis, D.B., Tibshirani, R., Sherlock, G., Chan, W.C.,
opments in Environmental Modelling, Elsevier, Amsterdam. Greiner, T.C., Weisenburger, D.D., Armitage, J.O., Warnke,
[5] D'haesleer, P., Wen, X., Fuhrman, S. and Somogyi, R. (1998) in: R., Levy, R., Wilson, W., Grever, M.R., Byrd, J.C., Botstein,
Information Processing in Cells and Tissues, Plenum Press, New D., Brown, P.O. and Staudt, L.M. (2000) Nature 403, 503^511.
York. [22] Brown, M.P.S., Grundy, W.N., Lin, D., Cristianini, N., Sugnet,
[6] Eisen, M., Spellman, P.T., Botstein, D. and Brown, P.O. (1998) C.W., Furey, T.S., Ares, M.J. and Haussler, D. (2000) Proc.
Proc. Natl. Acad. Sci. USA 95, 14863^14867. Natl. Acad. Sci. USA 97, 262^267.
[7] Hartigan, J.A. (1975) Clustering Algorithms, John Wiley and [23] Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek,
Sons, New York. M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Cali-
[8] Tavazoie, S., Hughes, D., Campbell, M.J., Cho, R.J. and giuri, M.A., Bloom¢eld, C.D. and Lander, E.S. (1999) Science
Church, G.M. (1999) Nature Genet. 22, 281^285. 286, 531^537.
[9] Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., [24] Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer,
Dmitrovsky, E., Lander, E. and Golub, T. (1999) Proc. Natl. M. and Yakhini, Z. (2000) The Fourth Annual International
Acad. Sci. USA 96, 2907^2912. Conference on Computational Molecular Biology RECOMB-
[10] Brazma, A., Jonassen, I., Vilo, J. and Ukkonen, E. (1998) Ge- 2000, pp. 54^64, ACM Press, Tokyo.
nome Res. 8, 1202^1215. [25] Vilo, J., Brazma, A., Jonassen, I., Robinson, A. and Ukkonen, E.
[11] Ben-Dor, A. and Yakhini, Z. (1999) Proceedings of the Third (2000) The Eighth International Conference on Intelligent Sys-
Annual International Conference on Computational Molecular tems for Molecular Biology, AAAI Press, La Jolla, CA, in press.
Biology RECOMB-1999, pp. 33^42. ACM Press, Lyon. [26] Hertz, G.Z. and Stormo, G.D. (1995) in: Proceedings of the
[12] Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Third International Conference on Bioinformatics and Genome
Mack, D. and Levine, A.J. (1999) Proc. Natl. Acad. Sci. USA 96, Research, pp. 201^216, World Scienti¢c Publishing, Singapore.
6745^6750. [27] Zhu, J. and Zhang, M.Q. (1999) Bioinformatics 15, 607^611.
[13] DeRisi, J.L., Iyer, V.R. and Brown, P.O. (1997) Science 278, [28] Brazma, A., Robinson, A., Cameron, G. and Ashburner, M.
680^686. (2000) Nature 403, 699^700.
[14] van Helden, J., Andrë, B. and Collado-Vides, J. (1998) J. Mol. [29] Akutsu, T., Miyano, S. and Kuhara, S. (1999) The Paci¢c Sym-
Biol. 281, 827^842. posium on Biocomputing '99 (PSB'99), pp. 17^28, World Scien-
[15] Chu, S., DeRisi, J.L., Eisen, M., Mulholland, J., Botstein, D., ti¢c, Hawaii.
Brown, P.O. and Herskowitz, I. (1998) Science 282, 699^705. [30] Liang, S., Fuhrman, S. and Somogyi, R. (1998) The Paci¢c Sym-
[16] Spellman, P.T., Sherlock, G., Zhang, M., Iyer, V.R., Anders, K., posium on Biocomputing, Vol. 3, pp. 18^29, World Scienti¢c,
Eisen, M., Brown, P.O., Botstein, D. and Futcher, B. (1998) Mol. Hawaii.
Biol. Cell 9, 3273. [31] Thie¡ry, D., Colet, M. and Thomas, R. (1993) Math. Model. Sci.
[17] Holstege, F., Jennings, E., Wyrick, J., Lee, T., Hengartner, C., Comput. 55, 144^151.