Academia.eduAcademia.edu

Gene ranking using bootstrapped P-values

2003, ACM SIGKDD Explorations Newsletter

Recent research has shown that it is possible to find genes involved in the pathogenesis of a particular condition on the basis of microarray experiments. Genes which are differentially expressed, for example between healthy and diseased tissues, are likely to be relevant to the disease under study. Some of the properties of microarray datasets make the task of finding these genes a challenging one. This paper proposes a gene-ranking algorithm whose main novelty is the use of bootstrapped P-values. We present an analysis of the algorithm, showing how it takes account of small-sample variability in observed values of the test statistic, in a way conventional statistical tests cannot. Experimental results show that our algorithm outperforms the widely-used twosample t-test on challenging artificial data. Gene ranking is then performed on two well-known microarray datasets, with encouraging results. For example, a number of genes from one of the datasets, whose differential expression was subsequently confirmed by a more reliable biochemical analysis, are found to be ranked higher by the bootstrapped algorithm than by the conventional t-test, suggesting that the proposed algorithm may be better able to exploit the limited data available to infer biologically useful information.

Gene Ranking Using Bootstrapped P-values S. N. Mukherjee∗ Department of Engineering Science University of Oxford Oxford OX1 3PJ, U.K. P. Sykacek Department of Engineering Science University of Oxford Oxford OX1 3PJ, U.K. [email protected] [email protected] S. J. Roberts Department of Engineering Science University of Oxford Oxford OX1 3PJ, U.K. S. J. Gurr Department of Plant Sciences University of Oxford Oxford OX1 3RB, U.K. [email protected] [email protected] ABSTRACT Recent research has shown that it is possible to find genes involved in the pathogenesis of a particular condition on the basis of microarray experiments. Genes which are differentially expressed, for example between healthy and diseased tissues, are likely to be relevant to the disease under study. Some of the properties of microarray datasets make the task of finding these genes a challenging one. This paper proposes a gene-ranking algorithm whose main novelty is the use of bootstrapped P-values. We present an analysis of the algorithm, showing how it takes account of small-sample variability in observed values of the test statistic, in a way conventional statistical tests cannot. Experimental results show that our algorithm outperforms the widely-used twosample t-test on challenging artificial data. Gene ranking is then performed on two well-known microarray datasets, with encouraging results. For example, a number of genes from one of the datasets, whose differential expression was subsequently confirmed by a more reliable biochemical analysis, are found to be ranked higher by the bootstrapped algorithm than by the conventional t-test, suggesting that the proposed algorithm may be better able to exploit the limited data available to infer biologically useful information. Keywords microarrays, differential expression, t-test, bootstrap 1. INTRODUCTION In recent years a number of seminal studies [7; 9; 12] have demonstrated the feasibility of using global expression analyses to better understand various diseases. Genes relevant to the pathology under investigation are expected to be upor down-regulated between healthy and diseased tissues. An important task in microarray data analysis is therefore identifying genes which are differentially expressed in this way. Statistical analysis of gene expression data relating to complex diseases is of course not really expected to yield results ∗ to whom correspondence should be addressed Sigkdd Explorations. of the form ‘gene X causes disease Y ’. A realistic goal is to narrow the field for further analysis, to give geneticists a short-list of genes which are worth investing hard-won funds into analysing. What makes it hard to find differentially expressed genes? Quite simply, experimental noise and biological variability. Experimental noise including errors in fabrication, hybridization, image analysis and so on, mean that the real-valued expression levels returned by a microarray experiment do not exactly reflect true mRNA levels. Biological variability refers to the natural variation we would expect to encounter even under ideal experimental conditions. That is to say, even if we could sidestep experimental issues, magically looking inside the cell and counting the RNA molecules of interest, we would still expect some variation in counts between cells in the same category. All this means we cannot simply look at expression levels of genes in diseased and healthy tissues and choose the ones which are most different, but must treat those values as random variables, and the task of gene selection as essentially statistical. A variety of two-sample statistical tests have been applied to microarray data, including conventional [13] and non-parametric [15; 16] tests. However, with typically many thousands of genes to choose from and perhaps a few dozen to be selected, this can be a little like looking for a needle in the proverbial haystack 1 . In this paper, taking a classical two-sample test as our starting point, we focus on accounting for small-sample variability in the observed value of the test statistic. Canonical tests do not explicitly address this issue, even when parametric assumptions hold: in light of the properties of microarray data we argue that the consequences of such variability may be considerable. We use the bootstrap [5] to take account of this variability. The method developed is based on the two-sample t-test, which is widely used in microarray analysis [13], but we emphasise that our algorithm, and many of 1 It turns out that the scale of this mismatch means that it computationally entirely infeasible to actually consider every possible subset of genes as a candidate solution. Most research in this area (ours included) essentially looks at one gene at a time. Volume 5,Issue 2 - Page 14 the observations made here, generalise to other two-sample tests. Making a brief digression, we note that the task addressed here is subtly different from that of feature selection for gene expression based classifiers [10; 18]. Two-sample tests aim to find all genes which are significantly up- or down-regulated between tissue classes; feature selection algorithms try to find genes which best explain class labels. As an example, consider a hypothetical dataset where a single gene fully explains the class labels, but a hundred genes are nonetheless consistently up-regulated in one set of tissues. A two-sample test will aim to identify all the up-regulated genes, while a feature selection algorithm should return the single explanatory gene. The distinction is biologically important: all hundred genes may have pathological effects of interest to the investigator, despite the fact that a single gene captures the class information. 2. BACKGROUND AND MOTIVATION Let us introduce some notation to state more clearly the questions we wish to answer. Consider microarray slides (or chips) belonging to two classes, say, healthy and diseased, with G gene expression levels measured on each slide. Recent work has shown that microarray data from higher organisms are very close to log-normally distributed [11]; in order to justify the assumptions of the t-test we therefore work in a log space. The data consists of m G-dimensional vectors xi (collectively referred to as X ), and n G-dimensional vectors yi (collectively referred to as Y). m and n are the number of slides in each class, and the vector elements are log expression levels. We now assume these data are drawn from two (possibly different) multivariate normal distributions q and r respectively: xi ∼ q(x) X = [x1 , x2 , . . . , xm ] yj ∼ Y = [y1 , y2 , . . . , yn ] (1) r(y) (2) Thus, each gene has a pair of true (but unknown) class means, one from each of the distributions q and r. Our task is to rank the genes according to how likely it is that these means are distinct. One way of accomplishing this is through the use of a test statistic, such as the t-statistic. In this paper we use the canonical form of the t-statistic throughout, assuming normally distributed data with equal but unknown variances in the two classes. We briefly present the essentials of the t-test below, emphasising the functional relationships between the data, statistic and P-value. A comprehensive account of the test can be found in most statistics textbooks, e.g. [4]. Let tk represent the t-statistic for gene k. tk is of course just a function of the data for the kth gene (Xk and Yk respectively). Let µX and µY represent the sample means of 2 2 Xk and Yk respectively, σX and σY the (unbiased) sample variances 2 , and T (·) the t-statistic function. Then tk is given by: tk = = T (Xk , Yk ) (3) µX − µY 1 (m + 2 2 1 σ (m−1)+σ (n−1) 1 1 2 Y ) ( X m+n−2 )2 n The form of Equation 4 means that it is possible to analytically obtain the distribution of the statistic, which in turn allows the probability of type I errors (false positives) to be calculated. This probability is called the P-value. Under the assumptions of the canonical test, tk has a non-central t-distribution [4], with degrees of freedom v = (m + n − 2) and non-centrality parameter ψk . In the special case of the distribution under the null hypothesis, ψk = 0 and tk has the familiar t-distribution with degrees of freedom v. The observed value of the statistic is thus mapped to a Pvalue (pk ) by a function (which we shall call f ) which depends on the t-distribution. For the two-sided test being used, f is given by: f (tk ) = 2[1 − Cv (|tk |)] (5) where Cv (·) represents the cumulative distribution function (cdf) for a t-distribution with v degrees of freedom. The method proposed in this paper is motivated by the following observations about the t-test: • If the assumptions of the test hold, the function f truly represents an error probability: pk is the probability of making a type I error, or false positive, if the significance level of the test is set just high enough to include gene k. • But the ranking of a particular gene depends on the observed value tk , which itself represents a single draw from a non-central t-distribution with unknown parameter ψk . A series of microarray experiments pertaining to the same clinical condition, with a fixed number of slides in each case, will produce varying tstatistics for the genes under study, and consequently quite different P-values and rankings. • Thus, although the t-test captures the variability of the statistic and P-value under the null hypothesis, it cannot tell how reliably the observed values tk and pk actually represent gene k. In particular, if the observed statistics are atypical values under their distributions, the conclusions drawn from them may not generalise well to subsequent microarray experiments. In classical hypothesis testing settings, the number of datapoints is relatively high, making the statistic tk and corresponding P-value pk good representatives of the kth feature. In contrast, the imbalance between slides and genes in microarray experiments places a considerable burden on the ability of ranking algorithms to discriminate between relevant and irrelevant genes. 2 For typographic simplicity we have taken the liberty of dropping the subscript k from the means and variances in Equation 4. Sigkdd Explorations. (4) Volume 5,Issue 2 - Page 15 1 f(t) E*[f(t)] ; var=1 E*[f(t)]; var=0.5 0.25 0.2 0.6 P−value P−value 0.8 0.4 E*[f(t)] 0.15 T* f(E*[t]) 0.1 0.2 p=f(t) 0.05 (p=0.05) 0 −4 0 t−statistic 2 0 4 1 2 3 4 t−statistic Figure 1: The effect of considering variation in the value of the observed t-statistic: the figure on the left shows the tail probability mass curve f , which maps a t-statistic to a P-value (Equation 5), and an illustration of the bootstrap P-value function E ∗ [f (t)] (approximated following Equation 10), as functions of the bootstrap mean t-statistic E ∗ [t], with two different variances. The figure on the right shows in detail how the bootstrap P-value relates to the estimated distribution of the tstatistic in the region of interest of the curve. In this illustrative example, the observed statistic t is located to the right of the bootstrap mean E ∗ [t] - considering the variability in t we thus find a P-value considerably higher than the conventional one. 3. METHODS As shown in Equation 5, the P-value is a statistic of the data; we use the bootstrap [5] to obtain an estimate of its value. The bootstrap is a widely-used resampling technique, by which an empirical estimate of the distribution of a statistic of interest can be obtained by repeatedly computing its value from datasets sampled with replacement from the original. Let E ∗ [F(Z)] represent the bootstrap average of a function F of data Z: E ∗ [F (Z)] ≡ lim B→∞ 1 B B X F (Z ∗b ) (6) b=1 where the Z ∗b ’s are datasets obtained by resampling Z with replacement. In practice, B is set to a large finite value; in all our experiments B = 500. Previous applications of resampling, and the bootstrap in particular, to testing, have included P-value adjustments and non-parametric tests [3], as well as multiplicity corrections [17] (also, in the context of microarray analysis [6]). Our algorithm is closest in spirit to bootstrap P-value adjustments [1; 8], insofar as it treats the P-value itself as the statistic of interest. Analysing the bootstrapped P-value A useful approach to understanding bootstrapped P-values is to explicitly think of the P-value p (we drop the subscript k for clarity) and its bootstrap estimate p∗ as realisations of random variables. Let P represent the P-value, and g(P ) its density function; P ∗ represents the bootstrap estimate and h(P ∗ ) the corresponding density. Following [14] we approximate g(P ) by a beta distribution with parameters ξ (0 < ξ ≤ 1) and 1. Thus: Following Equations 5 and 6, the bootstrap estimate of the P-value for gene k, p∗k , is given by: p∗k ≡ E ∗ [f (tk )] = B 1 X f (T (Xk∗b , Yk∗b )) B (7) b=1 where and represent data for gene k from the bth bootstrap iteration and T (·) the t-statistic function. The bootstrap mean of tk may be obtained in a similar manner. We note that as the tail probability mass f (Equation 5) is a non-linear function, its bootstrap average will not, in general, be identical to either the observed P-value pk , or the tail probability mass corresponding to the bootstrap mean of tk , f (E ∗ [tk ]). Xk∗b Yk∗b Statistical power analysis [2] is superficially similar to our method, in that it explicitly deals with the distribution of the test statistic tk . However, the aim of such analysis is quite different, namely to quantify the probability of type II error at a given significance level. Sigkdd Explorations. g(p|ξ) = ξpξ−1 (8) Viewed in this way, conventional tests draw a conclusion, as to whether a gene is differentially expressed or not, on the basis of a single draw from the P-value density g(P ); errors in the procedure can be thought of as arising due to the uncertainty in g(P )3 . Under the null hypothesis, ξ = 1, and the P-value is uniformly distributed in the range [0, 1]; consequently the probability of type I error at a given significance level α does not depend on any unknowns and is simply equal to α. Under the alternative hypothesis, the parameter ξ is unknown. The probability of type II error then depends on ξ and is given by: P r(P > α|H1 ) 3 = 1 − Fg (α; ξ) (9) ′ A simple example of a less uncertain density: if g was a delta function located at the true mean of P , a single draw would unerringly reveal which hypothesis was correct. Volume 5,Issue 2 - Page 16 where Fg is the cdf corresponding to the P-value density g in Equation 8. In general, for fixed significance level α, this second error probability rises with the parameter ξ. fˆ∗ (µ∗t , σ) = Z 60 T−test 40 Bootstrapped t−test 20 0 1 1.5 2 Noise level 2.5 3 Figure 2: Results of bootstrapped and conventional t-tests on artificial data. The score reported is the proportion of two hundred iterations in which the algorithm in question ranked the correct features in the top two places. The bootstrapped test is able to take account of small-sample variability in the observed statistic and outperforms the conventional test at various noise levels. 4. RESULTS ∞ f (x)N (x; µ∗t , σ 2 )dx (10) −∞ We simulate Equation 10 directly, by sampling from the Gaussian and computing mean P-values via the function f (Equation 5). Figure 1 shows the integral version of the bootstrapped P-value fˆ∗ (with variances 1 and 0.5) and f , as functions of the bootstrap mean t-statistic µ∗t . The right side of the figure shows an illustrative example for a single (hypothetical) gene with the observed t-statistic t = 2.5 and the corresponding bootstrap mean µ∗t = 2.25. The form of the function f makes it clear why variation around high absolute values of t has little effect on P-values, but for moderate values, fluctuations in t can profoundly effect the final P-value and ranking of the gene. Many genes of interest have moderate t-statistics and are highly sensitive to the exact observed value. For example, if the 3000 genes analysed from the colon cancer dataset [12] are arranged in descending order of absolute bootstrap mean t-statistics |µ∗t |, only the 72 highest ranked genes have |µ∗t | in excess of 6. The bootstrap distribution of the t-statistic on which the P-value of Equation 7 is based is purely empirical. Under the assumptions of a canonical t-test, however, the form of the distribution is known. We note that an alternative approach to the one taken above would thus be to estimate the non-centrality parameter ψ via the bootstrap, using the estimated value to obtain the appropriate non-central t-distribution. Using suitable approximations, the integral of Equation 10 could then be evaluated to obtain a P-value for each gene. One advantage of the approach we have taken is that it generalises easily to the non-parametric case: the cdf Cv used in Equation 5 need only be replaced by an empirical (for example, permutation-based) cdf. The P-value could then be obtained as before from Equation 7. Sigkdd Explorations. 80 % correct The effect of taking account of variation in the t-statistic via the bootstrap, is that a draw from the density function of the bootstrapped P-value, h(P ∗ ), is more likely to be close to the true mean of P than a draw from the density of the standard P-value, g(P ). Indeed, we find via simulation that the average deviation of the bootstrapped P-value P ∗ 1 around the true mean of P (i.e., E[P ∗ − E[P ]] 2 ) is lower than than the corresponding figure for the conventional P1 value (i.e., the standard deviation of P , (E[P − E[P ]]) 2 ). We further illustrate the operation of the algorithm by making some simplifying assumptions, obtaining a qualitative picture of the interplay between the various quantities involved. We assume that the bootstrap distribution of the t-statistic is approximately Gaussian, with mean µ∗t (defined as the bootstrap mean of t, E ∗ [t]) and variance σ 2 . Under these assumptions, the bootstrap P-value p∗ can be thought of as being the integral of the product of the Gaussian with the tail probability mass f (Equation 5). The integral is then itself a function, fˆ∗ , of µ∗t and σ 2 ; for a given variance, fˆ∗ can be thought of as mapping a bootstrap mean t-statistic µ∗t to an expected P-value: 100 4.1 Artificial data We assess the ability of the proposed algorithm to detect differentially expressed genes on artificially generated data. Six-dimensional data in two classes are generated from six pairs of univariate Gaussians, only two of which have distinct means. The class variances are equal and are made to vary incrementally between 1 and 3, to simulate increasing noise levels; the number of samples in the two classes are 10 and 5. We choose a relatively small number of samples to mimic microarray data; to account for small-sample effects in the results we report average scores over 200 iterations. At each iteration, 15 samples are drawn from the Gaussians and passed to two ranking algorithms: a conventional t-test and the proposed algorithm, performance being assessed in terms of the proportion of runs in which the two highest ranked features are the correct ones. Repeatedly drawing data in this way provides an estimate of generalisation accuracy under small-sample conditions: the proposed algorithm scores ∼6.5% higher than the t-test, averaged over a range of variances and all 200 runs. Results are shown in Figure 2. 4.2 Microarray data Datasets We select two well-known and widely analysed microarray datasets - the colorectal cancer data of Notterman et al. [12] and the leukaemia data of Golub et al.[7]. The colorectal cancer dataset consists of 36 labeled slides with 6600 complementary DNAs (cDNAs) and expressed sequence tags (ESTs) represented on each. The leukaemia data consists of a training set of bone marrow samples taken from patients suffering from Acute Myeloid Leukaemia (AML) and Acute Lymphoid Leukaemia (ALL), and a separate test set with bone marrow as well as periph- Volume 5,Issue 2 - Page 17 Table 1: The top five genes identified by the bootstrapped t-test from a colon cancer dataset: as a consequence of the extremely high absolute values of the t-statistic for these genes, the rankings do not vary much between the two algorithms. Rank Acc. No. Description Bootstrapped Rank under P-value t-test 1 2 3 4 M77836 M83670 U17077 T96548 8.41 × 10−9 1.32 × 10−8 2.67 × 10−8 5.07 × 10−8 2 1 3 4 5 T64297 Human pyrroline 5-carboxylate reductase mRNA Human carbonic anhydrase IV mRNA Human BENE mRNA Actin, Gamma-enteric smooth muscle (Homo sapiens) Fatty acid binding protein, liver (Homo sapiens) 6.07 × 10−8 5 Table 2: Genes, whose differential expression was subsequently confirmed by RT-PCR, ranked by bootstrapped and conventional P-values. In most cases, the bootstrap algorithm ranks these genes higher than the t-test; the average number of places between the ranks is 25.5. Bootstrap Rank under rank t-test 76 117 Acc. No. Description X54489 43 166 158 52 6 47 50 93 U22055 L23808 H50438 X54942 M97496 X64559 L02785 X86693 Human gene for melanoma growth stimulatory activity (MGSA) Human 100 kDa coactivator mRNA Human metalloproteinase (HME) mRNA M-Phase inducer phosphatase 2 (Homo sapiens) Homo sapiens CKSHS2 mRNA for CKS1 protein homologue Homo sapiens guanylin mRNA Homo sapiens mRNA for tetranectin Homo sapiens colon mucosa-associated (DRA) mRNA Homo sapiens mRNA for hevin like protein 60 261 212 72 5 62 49 83 eral blood samples. The training set contains data from 38 samples taken from patients with AML or ALL, and the test set 34 patient samples. Expression levels are given for 7129 genes/ESTs. For this work we use only the test dataset. Pre-processing We pre-process the gene expression data according to current practice [19], removing within-slide location by changing from absolute to relative expression values. As noted previously, microarray data from higher organisms are very close to log-normally distributed [11]; we therefore transform the data into a log-space. Results Colorectal cancer data: Table 1 shows P-values and identities of the five highest ranked genes identified by the bootstrap algorithm. These genes have extreme t-statistics, so as expected, their ranks under both algorithms are similar. Rankings returned by our algorithm and the t-test are noticeably different across the entire dataset: on an average, there are 65 places between the positions of the same genes; when the top 100 genes returned by the bootstrap algorithm are considered, the average displacement reduces to 15. The difference between the results of the tests thus lies in the positions assigned to genes with high, but not extreme, tstatistics. These genes may be of great practical importance: indeed, from a biological perspective the aim of microarray experiments (which are high-throughput but noisy) is essentially to guide further investigation. More accurate transcript abundance analyses, for example quantitative real- Sigkdd Explorations. time RT-PCR4 can be used to confirm differential expression. RT-PCR is too expensive to be used to assess every gene; thus one major objective of microarray data analysis is to identify a subset of genes for such assessment. Table 4.2 compares the ranks of genes whose differential expression was subsequently confirmed by RT-PCR [12]. In most cases we find the proposed algorithm ranks these genes higher than the t-test, suggesting that if used to select a subset for further assessment it is more likely to uncover relevant genes. Leukaemia data: Table 3 shows P-values and identities of the five highest ranked genes identified by the bootstrap algorithm on the leukaemia dataset. Once again, the extreme t-statistics of these top-ranked genes mean that their ranks under both algorithms are similar. For this dataset there are, on an average, 244 places between the ranks assigned to the same genes by the two algorithms; when only the top 100 genes returned by the bootstrap algorithm are considered, the average displacement is 9. Table 4 shows a selection of highly ranked genes whose ranks were higher under the bootstrapped test than the t-test. Some of these genes have been implicated in other studies too: for example, the Human myeloperoxidase gene has recently been found to be ranked much higher, compared to the t-test, by well-founded methods including information gain and a one-dimensional support vector machine [20]. 4 Reverse transcription polymerase chain reaction Volume 5,Issue 2 - Page 18 Table 3: The top five genes identified by the bootstrapped t-test from the leukaemia dataset: once again, the extremely high absolute values of the t-statistic for these genes mean that the rankings do not vary much between the two algorithms. Rank Acc. No. Description Bootstrapped Rank under P-value t-test 1 2 3 D26361 X64330 U28758 2.09 × 10−4 5.00 × 10−4 5.05 × 10−4 1 3 4 4 J03171 5.06 × 10−4 5 5 U95006 Homo sapiens KIAA0042 Homo sapiens ATP-citrate lyase Human NMDA receptor subtype 2B subunit (GRIN2B) mRNA, partial cds Human interferon-alpha/beta receptor alpha chain precursor Human D9 splice variant A mRNA 6.34 × 10−4 2 Table 4: Comparative results using the bootstrap and conventional t-tests on the leukaemia dataset. The genes shown were in the top 100 under the bootstrap test, and were ranked at least fifteen places higher than by the t-test. 5. Bootstrap rank 39 41 58 62 66 72 75 77 Rank under t-test 56 60 79 102 84 103 107 140 Acc. No. Description U49957 Y10313 AB000450 X63578 L38951 U20536 Z46376 S70609 83 85 96 99 98 124 144 135 U26648 U61734 M19508 U07563 LIM protein (LPP) mRNA, partial cds Nerve growth factor-inducible PC4 homologue Homo sapiens VRK2 Homo sapiens gene for Parvalbumin Importin beta subunit Cysteine protease Mch2 isoform alpha (Mch2) HK2 mRNA for hexokinase II Glycine transporter type 1b [human, substantia nigra, mRNA, 2364 nt] STX5A Syntaxin 5A Human protein trafficking protein (clone S31i125) MPO from Human myeloperoxidase gene Human proto-oncogene tyrosine-protein kinase (ABL) gene, exon 1b and intron 1b, and putative M8604 Met protein (M8604 Met) gene CONCLUSIONS In this paper we have proposed a novel gene ranking method based on bootstrapped P-values, and shown that it can successfully account for small-sample effects in the observed test statistic for a gene. While it is premature to draw definitive biological conclusions from our results, experiments on both artificial and real data suggest that our algorithm is better able to deal with the level of uncertainty inherent in microarray data than a classical two-sample test. In particular, results concerning the comparative ranks of genes from the colon cancer dataset [12] whose differential expression was confirmed using RT-PCR (Table 4.2) are encouraging, and suggest that the proposed algorithm may be able to guide further investigation more accurately than the ttest. In essence, our method obtains a more accurate Pvalue at the cost of computational efficiency, but we feel that in this particular domain compute-time should not be an over-riding concern - with sometimes millions of dollars being spent on designing experiments and acquiring data, a few extra minutes or even hours of processing should be acceptable if better results can be obtained! evaluation of the proposed algorithm, potentially using a different two-sample test, as well as further investigation of the biological impact of the results reported here, will be informative. Also, the extension of the method proposed to a fully non-parametric setting may prove useful in analysing data, for instance from lower organisms, which do not conform to the assumptions made here. Clearly, many questions remain to be addressed. Further theoretical analysis is required to fully understand the distributional properties of the bootstrapped P-value. Our results are promising but preliminary - a thorough empirical Sigkdd Explorations. Volume 5,Issue 2 - Page 19 Acknowledgements SNM gratefully acknowledges the support of the Biotechnology and Biological Sciences Research Council (BBSRC); thanks also to Dr. Sayan Mukherjee. 6. REFERENCES [1] R. Beran. Prepivoting Test Statistics: A Bootstrap View of Asymptotic Refinements. Journal of the American Statistical Association, 83(403):687–697, 1988. [2] J. Cohen. Statistical power analysis for the behavioral sciences. Lawerence Erlbaum, Hillsdale, NJ, 1988. [3] A. C. Davison and D. V. Hinckley. Bootstrap Methods and their Applications. Cambridge University Press, 1997. [4] M. H. DeGroot and M. J. Schervish. Probability and Statistics. Addison Wesley, 3rd edition, 2002. [5] B. Efron. Bootstrap methods: another look at the jackknife. Ann Stat., 7:1–26, 1979. [6] Y. Ge, S. Dudoit, and T. P. Speed. Resampling-based multiple testing for microarray data analysis. Technical Report 633, Department of Statistics, University of California, Berkeley, 2003. [7] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286:531–537, 1999. [14] T. Sellke, M. J. Bayarri, and J. O. Berger. Calibration of P-values for testing precise null hypotheses. The American Statistician, 55:62–71, 2001. [15] O. G. Troyanskaya, M. E. Garber, P. O. Brown, D. Botstein, and R. B. Altman. Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics, 18:1454–1461, 2002. [16] V. Tusher, R. Tibshirani, and G. Chu. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, 98:5116– 5121, 2001. [17] P. H. Westfall and S. S. Young. Resampling-Based Multiple Testing: Examples and Methods for P-Value Adjustment. John Wiley & Sons, 1993. [18] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik. Feature Selection for SVMs. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 668–674. MIT Press, 2001. [19] Y. H. Yang, S. Dudoit, P. Luu, and T. P. Speed. Normalization for cDNA microarray data. Technical report 589, Department of Statistics, University of California, Berkeley, 2001. Available at: http://statwww.berkeley.edu/tech-reports/index.html. [20] Yang Su, T. M. Murali, V. Pavlovic, M. Schaffer, and S. Kasif. RankGene: identification of diagnostic genes based on expression data. Bioinformatics, 19(12):1578– 1579, 2003. [8] P. H. Hall and M. A. Martin. On Bootstrap Resampling and Iteration. Biometrika, 75:661–671, 1988. [9] I. Hedenfalk, D. Duggan, Y. Chen, M. Radmacher, M. Bittner, R. Simon, P. Meltzer, B. Gusterson, M. Esteller, O. P. Kallioniemi, B. Wilfond, A. Borg, and J. Trent. Gene-Expression Profiles in Hereditary Breast Cancer. N. Engl. J. Med., 344(8):539–548, 2001. [10] S. Hochreiter and K. Obermayer. Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers. In S. T. Suzanna Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 16, Cambridge, MA, 2003. MIT Press. [11] D. C. Hoyle, M. Rattray, R. Jupp, and A. Brass. Making sense of microarray data distributions. Bioinformatics, 18:576–584, 2002. [12] D. Notterman, U. Alon, A. J. Sierk, and A. J. Levine. Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays. Cancer Research, 61(7):3124– 30, 2001. [13] W. Pan. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics, 18:546– 554, 2002. Sigkdd Explorations. Volume 5,Issue 2 - Page 20