TMP 9 AA7
TMP 9 AA7
TMP 9 AA7
Address: Department of Biotechnology and Bioscience, University of Milano-Bicocca, Piazza della Scienza 2, 20126 Milan, Italy
Email: Norman Pavelka - [email protected]; Mattia Pelizzola - [email protected];
Caterina Vizzardelli - [email protected]; Monica Capozzoli - [email protected];
Andrea Splendiani - [email protected]; Francesca Granucci - [email protected]; Paola Ricciardi-
Castagnoli* - [email protected]
* Corresponding author †Equal contributors
Abstract
Background: High-density oligonucleotide microarray technology enables the discovery of genes
that are transcriptionally modulated in different biological samples due to physiology, disease or
intervention. Methods for the identification of these so-called "differentially expressed genes"
(DEG) would largely benefit from a deeper knowledge of the intrinsic measurement variability.
Though it is clear that variance of repeated measures is highly dependent on the average expression
level of a given gene, there is still a lack of consensus on how signal reproducibility is linked to signal
intensity. The aim of this study was to empirically model the variance versus mean dependence in
microarray data to improve the performance of existing methods for identifying DEG.
Results: In the present work we used data generated by our lab as well as publicly available data
sets to show that dispersion of repeated measures depends on location of the measures themselves
following a power law. This enables us to construct a power law global error model (PLGEM) that
is applicable to various Affymetrix GeneChip data sets. A new DEG identification method is
therefore proposed, consisting of a statistic designed to make explicit use of model-derived
measurement spread estimates and a resampling-based hypothesis testing algorithm.
Conclusions: The new method provides a control of the false positive rate, a good sensitivity vs.
specificity trade-off and consistent results with varying number of replicates and even using single
samples.
Page 1 of 12
(page number not for citation purposes)
BMC Bioinformatics 2004, 5:203 http://www.biomedcentral.com/1471-2105/5/203
An ideal DEG identification method should limit both More recently, Tu et al. [8] empirically modeled the vari-
false positives, i.e. genes wrongly called significant (type 1 ance versus mean dependence from a data set consisting
errors), and false negatives, i.e. genes wrongly called not of ten replicated oligonucleotide microarray experiments.
significant (type 2 errors). To this end, understanding According to the authors, the variance of the genes should
how gene expression values measured in replicated exper- decay exponentially with the mean, but only for moder-
iments are spread around the true expression level of each ately expression values. Taken together, all these aspects
gene, would help to distinguish biologically relevant gene could limit the applicability of these error models.
expression changes from fluctuations due to different
sources of variability that are unrelated to the biological Independently from the choice of the error model,
phenomenon under investigation. Measurement error another point that remains to be faced is on how to esti-
estimates can be obtained in two ways: either by empiri- mate residual error. A discussed by Wright et al. [9] the
cally inferring noise from highly replicated data or by possibilities range between two extremes: either obtaining
deducing noise from a theoretical error model [1]. Espe- a single variance estimate across all genes or obtaining a
cially when the experimental design requires the investi- gene-specific residual variance. In the same paper a hybrid
gation of a high number of conditions, the former strategy approach is proposed in which information from all
is not always feasible, because of the high cost of these genes is used to fit a single linear model from which the
experiments or due to the availability of biological mate- gene-specific variance estimates can be deduced. In the
rial. In addition, there is still a lack of consensus on how present work we chose to follow an approach similar to
gene expression values from replicated experiments the latter.
should be theoretically distributed, which restricts the
application also of the latter strategy. The aim of this study was to use highly replicated micro-
array data to empirically determine the true variance ver-
The most widely used methods for identifying DEG range sus mean dependence that exists in this type of data. This
from purely empirical filtering techniques (e.g. selecting knowledge enabled the proposal of PLGEM as a simple
genes that show a fold change higher than a fixed thresh- but powerful error model. We fitted the proposed model
old) to more sophisticated statistical tests such as the sig- on various data sets without pre-filtering the data, deriv-
nal-to-noise ratio described by Golub et al. [2] or the ing an improved test statistics and identifying DEG even
Significance Analysis of Microarrays (SAM) method by in data sets with very little number of replicates.
Tusher et al. [3]. While empirical filtering techniques rely
on arbitrarily chosen thresholds and are unable to provide Results
any type of control on the significance of the results, the Variance versus mean dependence
more sophisticated statistical tests usually need a high The relationship between measurement variability and
degree of replication in the data to accurately measure average expression values was investigated by means of
gene-specific variability. scatter plots where different measures of spread were dis-
played against different measures of location. For each
In the past years various authors have proposed compet- gene absolute or relative standard deviation was plotted
ing error models for microarray data from which discord- against the mean expression value in either linear or log-
ant implications for the variance versus mean dependence log plots using data from the 16iDC data set (Figure 1).
can be deduced. Chen et al. [4] first proposed a simple Independently from the choice of standard deviation or
Gaussian model, more recently Ideker et al. [5] and Li and inter-quartile range as the estimate of spread and of mean
Wong [6] introduced two-component models containing or median as the location estimate we obtained qualita-
a multiplicative and an additive error term. All of these tively similar plots (data not shown). Log-log plots of
models implicitly or explicitly assume a constant coeffi- both absolute and relative spread estimates revealed a
cient of variation (CV), implying that standard deviation strikingly linear dependency, indicating that measure-
should vary proportionally with the mean. More recently, ment spread could depend on signal location following a
Rocke and Durbin [7] proposed a variation of the two- power law.
component model from which they derived that variance
of repeated microarray measures is a quadratic function of The power law global error model
the mean. Dealing specifically with spotted cDNA micro- Based on the previous observation, we chose to empiri-
array technology Baggerly et al. [1] proposed a beta-bino- cally model measurement noise through linear
mial model, from which it can be derived that variance is regressions:
a second-order polynomial function of the mean. Unfor-
tunately, most of these models are based on theoretical ln( s) = k.ln( x ) + c + ε eqn.1
assumptions that have been verified on simulated data or
on data sets consisting of small numbers of replicates.
Page 2 of 12
(page number not for citation purposes)
BMC Bioinformatics 2004, 5:203 http://www.biomedcentral.com/1471-2105/5/203
1600
A B8 3
C D2
6 1
1200
2
ln(st.dev.)
4 0
st.dev.
ln(CV)
CV
-2 0 2 4 6 8 10
800
2 -1
1
400
0 -2
-2 0 2 4 6 8 10
0 -2 0 -3
0 1000 2000 3000 4000 5000 6000 7000 0 1000 2000 3000 4000 5000 6000 7000
mean ln(mean) mean ln(mean)
Page 3 of 12
(page number not for citation purposes)
BMC Bioinformatics 2004, 5:203 http://www.biomedcentral.com/1471-2105/5/203
Table 1: Effect of the presence of DEG when applying the classic permutation strategy to the PLGEM-STN statistic.
# of slope intercept adj. R^2 slope intercept adj. R^2 slope intercept adj. R^2 slope intercept adj. R^2
genes in
data set
22300 1.690 3.070 0.856 0.871 -0.114 0.938 0.187 -0.383 0.314 1.090 -0.692 0.826
10000 1.710 2.470 0.881 0.888 -0.083 0.931 0.224 -0.247 0.433 1.040 -0.555 0.824
5000 1.460 1.270 0.880 0.877 -0.147 0.934 0.093 -0.228 0.067 1.080 -0.425 0.836
2500 1.590 1.180 0.888 0.864 -0.155 0.939 0.092 -0.176 0.135 1.110 -0.348 0.853
1500 1.670 0.935 0.908 0.876 -0.166 0.948 0.078 -0.122 0.170 1.130 -0.182 0.880
1000 1.720 0.689 0.874 0.944 0.002 0.956 0.038 -0.122 0.082 0.991 -0.226 0.897
500 1.900 0.263 0.864 0.857 -0.255 0.956 0.062 -0.030 0.307 1.160 0.038 0.909
200 2.490 -0.059 0.875 0.905 -0.378 0.946 0.064 -0.012 0.426 1.050 0.233 0.914
Eight data sets with different percentages of DEG were constructed from the Latin Square data set by keeping the 62 known spiked-in probe sets,
but randomly removing increasing amounts of the remaining probe sets reaching the total indicated in the first column. The null distributions of the
PLGEM-STN statistics were evaluated through the classic permutation strategy either including or excluding DEG. A wide range of significance
levels was used to select DEG and correlation between the FPR and the significance level or between estimated and observed FDR was evaluated
through linear regressions in log-log plots. Table reports slopes, intercepts and adjusted r2 of linear models. See text for details on estimation of
FDR.
In order to verify the feasibility of the former application, cal hypothesis testing, in which we test against the null
fitting of PLGEM on real-life data as well as distribution hypothesis of non-differential expression. First of all, we
properties of the random variable E were investigated by chose to implement as the test statistic the signal-to-noise
analyzing the residuals of the model, i.e. differences ratio (STN) already used by Golub et al. [2], because it
between observed and expected values: explicitly takes unequal variances into account and
because it penalizes genes that have higher variance in
ε g = ln(s g ) − (k10,0.5 ⋅ ln(x g ) + c10,0.5 ) eqn. 5 each class more than those genes that have a high variance
in one class and a low variance in another [11]:
Figure 2A–D shows distribution of residuals εg computed
from the 16iDC data set and its dependency on the rank x (g2) − x (g1)
of mean expression values. Figure 2E–L summarizes STN = eqn. 6
model validation on other two completely unrelated s(g1) + s(g2)
high-density oligonucleotide microarray data sets, the
HG-U133A Leigh syndrome data set (Figure 2E–H) and
x (1) x (2)
the HG-U95Av2 Muscle biopsies data set (Figure 2I–L). where in the original version g and g represent,
For each tested data set an individual model was fitted and respectively, the mean of the replicated expression meas-
a distinct set of parameters α and β was determined. In all
of the three independent data sets measurement variabil- s(1) s(2)
ures of gene g in condition 1 and 2, whereas g and g
ity could be accurately modeled through equation 2, with are the corresponding standard deviations. Instead, we
a power coefficient β that was always between 0 and 1 and propose to use model-derived standard deviation esti-
a random variable E that appeared to be normally distrib- mates predicted by PLGEM in equation 2 for the corre-
uted with zero-mean and constant standard deviation sponding signal mean, rather than data-derived standard
over the whole range of expression values. Of course, deviation values calculated independently from the few
these findings were eventually expected only for q = 0.5, data points that are usually available for every single gene.
and their occurrence demonstrated a goodness of fit of
PLGEM on a series of unrelated real-life data sets. The improvement of the test statistic in ranking DEG was
evaluated as done by Broberg [12] through receiver oper-
Improved test-statistics for detecting differential ator characteristic (ROC) plots on the HG-U133A Latin
expression Square data set, where there is an a priori knowledge on
In order to identify DEG, we implemented the following the truly differentially expressed transcripts. ROC plots
general algorithm derived from the framework of statisti- investigate the relationship between false positive rates
Page 4 of 12
(page number not for citation purposes)
BMC Bioinformatics 2004, 5:203 http://www.biomedcentral.com/1471-2105/5/203
Figure 2of residuals of PLGEM fitted on three different real-life data sets
Analysis
Analysis of residuals of PLGEM fitted on three different real-life data sets. (A) PLGEM is fitted on the 16iDC data set
(MG-U74Av2) following the method described in the text, setting p = 10 and q = 0.5. (B) Model residuals are plotted as a func-
tion of the rank of the mean absolute expression level. (C) Distribution of pooled residuals. (D) The quantiles of the distribu-
tion of pooled residuals are plotted against the quantiles of a standard normal distribution. The same procedure is applied to
the Leigh syndrome data set (HG-U133A, panels E-H) and to the Muscle biopsies data set (HG-U95Av2, panels I-L).
(FPR) and false negative rates (FNR) at different signifi- statistics for each tested value of n (data not shown). In
cance levels; in this way the performance of the PLGEM- addition, the ROC curve of PLGEM-STN always had the
derived STN statistic (PLGEM-STN) has been compared shortest distance from origin, indicating that it resulted in
with the original STN statistic (CLASSIC-STN) and the the best trade-off between sensitivity and specificity. Inter-
statistic implemented in the commonly accepted Signifi- estingly, improved sensitivity was observed especially
cance Analysis of Microarrays (SAM) DEG identification when the nominal fold change was particularly low (see
method (SAM-STAT). To this purpose Exp01 of the Latin Exp02 vs. Exp01 and Exp14 vs. Exp01).
Square was taken as the baseline to which the remaining
13 experiments were compared. For each comparison Apart from discriminating between significant and not
absolute values of each statistic were ranked in decreasing significant gene expression changes, an optimal test-statis-
order and first n genes selected (where n ranged from 5 to tic should additionally provide an accurate quantification
200). Figure 3 summarizes results only for the most of the actual degree of differential expression. Figure 4
informative comparisons, but in each tested comparison shows that PLGEM-STN outperforms the competing sta-
analysis PLGEM-STN was at least as good as the other two tistics in correlating the value of the statistic with the
Page 5 of 12
(page number not for citation purposes)
BMC Bioinformatics 2004, 5:203 http://www.biomedcentral.com/1471-2105/5/203
0.0020
False positive rate
0.0010
0.001
0.0000
0.000
0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025
False negative rate False negative rate
0.001
0.0000
0.000
0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025
False negative rate False negative rate
Figure 3
Performance of PLGEM-STN in ranking differentially expressed genes
Performance of PLGEM-STN in ranking differentially expressed genes. ROC plots were used to compare the sensi-
tivity vs. specificity trade-off of the following three statistics: PLGEM-STN (black), CLASSIC-STN (blue) and SAM-STAT (red).
SAM-STAT values were obtained using the R package "siggenes" [21]. Absolute values of the corresponding statistics were
sorted in decreasing order, first n genes were selected (where n ranged from 5 to 200) and false positive and false negative
rates were evaluated on the HG-U133A Latin Square dataset. Note that, while the transcripts in Exp02 (Exp14) are spiked-in
at twice (half) the concentration than in Exp01, in both Exp06 vs. Exp01 and in Exp10 vs. Exp01 comparisons the nominal fold-
change of spiked-in transcripts ranged from 32 to 512.
Page 6 of 12
(page number not for citation purposes)
BMC Bioinformatics 2004, 5:203 http://www.biomedcentral.com/1471-2105/5/203
40
strategy may not be optimal for estimating the actual FPR
0
0
when the test-statistic makes use of a global error model
such as PLGEM. We can in fact hypothesize that measure-
-40
-40
ment spread of DEG may not be accurately described by
-10 -5 0 5 10 0.1 1.0 10.0 500.0 means of a global error model that was designed to
describe signal variability in absence of differential expres-
100
CLASSIC STN
A classic approach to empirically obtain the null distribu- • Compute resampled test-statistics between A* and B* at
tion of a test-statistic is running a series of random permu- each cycle.
tations of the chip indexes of the full data set and re-
computing the test-statistics at each permutation. Permu- The previous resampling should be repeated a sufficiently
tated test-statistics can then be pooled and significance large number of times – as large as possible compared to
thresholds (i.e. expected false positive rates) are found as the total number of possible combinations and compati-
specific quantiles of the null distribution. bly with available computational resources – and the
Page 7 of 12
(page number not for citation purposes)
BMC Bioinformatics 2004, 5:203 http://www.biomedcentral.com/1471-2105/5/203
1.0
-3 -2 -1 0 1 2 3
0.8
Resampled PLGEM-STN (quantiles)
Figure
Comparison
tion of the
5 PLGEM-STN
of two methods
statistic
for inferring the null distribu- 0.6
Frequency
ing DEG.
2 4 6 8 10
Counts
Page 8 of 12
(page number not for citation purposes)
BMC Bioinformatics 2004, 5:203 http://www.biomedcentral.com/1471-2105/5/203
Page 9 of 12
(page number not for citation purposes)
BMC Bioinformatics 2004, 5:203 http://www.biomedcentral.com/1471-2105/5/203
Page 10 of 12
(page number not for citation purposes)
BMC Bioinformatics 2004, 5:203 http://www.biomedcentral.com/1471-2105/5/203
can be found at [23]. Considering the redundancy of Performance of modeling method using different combinations of
some probe sets, there are a total of 62 distinct probe sets parameters p and q. The modeling method described in this study was
tested on the 16iDC data set using different combinations of partitions (5,
designed to match the 42 spiked transcripts.
10, 20, 50, 100, 200 and 500), and quantiles (0.01, 0.02, 0.05, 0.1,
0.2, 0.5, 0.8, 0.9, 0.95, 0.98 and 0.99). For all 77 analyzed combina-
16iDC+LPS tions of p and q regression lines were fitted to the data as described in the
This data set consists of the same samples of the 16iDC text. Goodness of fit was evaluated from the resulting slope (panel A),
data set, but includes additional four samples as a second intercept (panel B) and adjusted r2 (panel C).
experimental condition. To this end dendritic cells were Click here for file
[http://www.biomedcentral.com/content/supplementary/1471-
stimulated to mature with lipopolysaccharide (LPS) for 2105-5-203-S1.xls]
24 hours. Two independent biological samples were har-
vested and individually processed by the same two opera-
tors that prepared the samples for the 16iDC data set: one
applied the total RNA protocol, the other one applied the Acknowledgements
purified mRNA protocol. Each cRNA sample was This research was supported by AIRC (Italian Association for Cancer
hybridized twice, thus using a total of four Affymetrix MG- Research) and by a grant from the CARIPLO Foundation ("Development of
U74Av2 chips. Functional Genomics and Bioinformatics platforms aimed to foster novel
approaches in immunotherapy and molecolar diagnostics"). Authors would
Software like to kindly acknowledge Stefano Monti (Broad Institute, Cambridge, MA,
USA) and Peter J. Park (Harvard Medical School, Boston, MA, USA) for
All chips mentioned in the present study were hybridized
helpful discussion and Ottavio Beretta, Gianpiero Cattaneo, Maria Foti,
and scanned following Affymetrix recommendations and Giorgio Moro and Ettore Virzi (University of Milano-Bicocca, Milan, Italy)
MicroArray Suite 5.0 (MAS5) was used as the image acqui- for critical reading of the manuscript. We are also grateful to Valeria
sition and analysis software. All data sets used passed Tiranti, Rossana Mineri and Massimo Zeviani (Unit of Molecular Neuroge-
quality control tests and probe set signals were scaled so netics, National Neurological Institute "Carlo Besta", Milano, Italy) for
that the 4%-trimmed mean of all expression values of kindly providing the HG-U133A Leigh syndrome data set.
each chip was equal to a predefined reference intensity
(called TGT) following manufacturer's recommendations: References
1. Baggerly KA, Coombes KR, Hess KR, Stivers DN, Abruzzo LV, Zhang
W: Identifying differentially expressed genes in cDNA micro-
TGT = 100 for MG-U74Av2 and HG-U133A chips and array experiments. J Comput Biol 2001, 8:639-659.
TGT = 500 for HG-U95Av2 chips. 2. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov
JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD,
Lander ES: Molecular classification of cancer: class discovery
All procedures for fitting PLGEM, for calculating observed and class prediction by gene expression monitoring. Science
PLGEM-based signal-to-noise ratios (STN), for obtaining 1999, 286:531-537.
3. Tusher VG, Tibshirani R, Chu G: Significance analysis of micro-
expected PLGEM-STN through the resampling-based arrays applied to the ionizing radiation response. Proc Natl
approach and for comparing observed with expected STN Acad Sci U S A 2001, 98:5116-5121.
values have been implemented as R functions [24] and 4. Chen Y, Dougherty ER, Bittner ML: Ratio based decisions and the
quantitative analysis of cDNA microarray images. J Biomed
will be soon submitted for integration into the Biocon- Opt 1997, 2:364-374.
ductor project [25]. 5. Ideker T, Thorsson V, Siegel AF, Hood LE: Testing for differen-
tially-expressed genes by maximum-likelihood analysis of
microarray data. J Comput Biol 2000, 7:805-817.
Authors' contributions 6. Li C, Wong WH: Model-based analysis of oligonucleotide
NP conceived the study and drafted the manuscript. MP arrays: expression index computation and outlier detection.
Proc Natl Acad Sci U S A 2001, 98:31-36.
wrote the software, participated in the design of the study 7. Rocke DM, Durbin B: A model for measurement error for gene
and in the editing of the manuscript. CV performed the expression arrays. J Comput Biol 2001, 8:557-569.
8. Tu Y, Stolovitzky G, Klein U: Quantitative noise analysis for
microarray experiments, participated in the design of the gene expression microarray experiments. Proc Natl Acad Sci U
study and the editing of the manuscript. MC participated S A 2002, 99:14031-14036.
in the microarray experiments, AS participated in the 9. Wright GW, Simon RM: A random variance model for detec-
tion of differential gene expression in small microarray
design of the algorithms, FG and PRC coordinated the experiments. Bioinformatics 2003, 19:2448-2455.
study. All authors read and approved the final manuscript. 10. Mutch DM, Berger A, Mansourian R, Rytz A, Roberts MA: The limit
fold change model: A practical approach for selecting differ-
entially expressed genes from microarray data. BMC
Additional material Bioinformatics 2002, 3:17.
11. Cancer Genomics Software: GeneCluster 2.0b Reference
Manual [http://www.broad.mit.edu/cancer/software/genecluster2/
Additional File 1 gc_ref.html]
12. Broberg P: Statistical methods for ranking differentially
expressed genes. Genome Biol 2003, 4:. epub May 29
13. Storey JD, Tibshirani R: Statistical significance for genomewide
studies. Proc Natl Acad Sci U S A 2003, 100:9440-9445.
14. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ,
Scherf U, Speed TP: Exploration, normalization, and summa-
Page 11 of 12
(page number not for citation purposes)
BMC Bioinformatics 2004, 5:203 http://www.biomedcentral.com/1471-2105/5/203
Page 12 of 12
(page number not for citation purposes)