Academia.eduAcademia.edu

Mathematical Statistics

2006, Bolyai Society Mathematical Studies

In genetic association studies outcome dependent sampling is often used in order to increase power. When analyzing the data, correction for the ascertainment scheme generally has to be made to avoid bias. Such correction is however not available in standard statistical methods when the data structure and/or the ascertainment scheme is complex. In this report three simulation based approaches that can be used for correction of known ascertainment schemes are described. These methods provide parameter estimates and are flexible in terms of what statistical models and ascertainment schemes can be handled. Some simulations are conducted to evaluate the methods.

Mathematical Statistics Stockholm University Genetic association studies with complex ascertainment Maria Grünewald Research Report 2004:5 Licentiate thesis ISSN 1650-0377 Postal address: Mathematical Statistics Dept. of Mathematics Stockholm University SE-106 91 Stockholm Sweden Internet: http://www.math.su.se/matstat Mathematical Statistics Stockholm University Research Report 2004:5, http://www.math.su.se/matstat Genetic association studies with complex ascertainment Maria Grünewald∗ May 2004 Abstract In genetic association studies outcome dependent sampling is often used in order to increase power. When analyzing the data, correction for the ascertainment scheme generally has to be made to avoid bias. Such correction is however not available in standard statistical methods when the data structure and/or the ascertainment scheme is complex. In this report three simulation based approaches that can be used for correction of known ascertainment schemes are described. These methods provide parameter estimates and are flexible in terms of what statistical models and ascertainment schemes can be handled. Some simulations are conducted to evaluate the methods. KEY WORDS: genetic association study, diabetes, metabolic syndrome, ascertainment, outcome dependent sampling, importance sampling, stochastic EM Postal address: Mathematical Statistics, Stockholm University, SE-106 91, Sweden. E-mail: [email protected]. ∗ Acknowledgements Financial support from STINT, SSF, Nancy Pedersen and Peter Arner is gratefully acknowledged. First of all I want to thank Keith Humphreys for supervising me. Keith has put a lot of effort into guiding me through the licentiate and he is a very nice person to work with. I would also like to thank my formal supervisor Juni Palmgren for introducing me to the area of statistical genetics, and Nancy Pedersen for co-supervising me. I got some very valuable feedback on the report from Ola Hössjer, and Iza Roos helped me to get the genetics right. Thank you! I would like to thank all my colleges, both at Mathematical Statistics at Stockholm University and in the Biostatistics group at MEB, KI. Special thanks goes to Esbjörn Ohlsson for being a very good boss, to Christina Nordgren for making things work, to Roland Orre for always being fun and for using his super-powers to fix my computer, to my friends in the ”statbiol” group for proving how much fun work can be and to Gudrun Jonasdottir, Anna Svensson and Alexandra Ekman for cheering me up and for being such nice friends. ...and if anybody find that I have thanked them more than once, that is because you deserve it. A warm thank you also goes to everybody at the Department of Biostatistics at Harvard School of Public Health for making me feel welcome when I was an exchange student in Boston. ...but most of all I would like to thank my family for all their love and support: my parents Arne Rydberg and Rakel Grünewald, my beloved fiance Erik Nilsson, and his parents Lars-Göran and Gun. I would also like to thank all my friends, especially Viveka Seitz and Linda Wickman for immense support an patience! 2 Contents 1 Introduction 6 2 Background of ascertainment 2.1 Sampling strategies in epidemiology . . . . . . . . . . . . . 2.2 Likelihoods of data under non-random ascertainment . . . . 2.3 Some special cases where correction for non-random ascertainment is straightforward . . . . . . . . . . . . . . . . . . . . . 2.4 Disadvantages of categorizing continuous variables . . . . . 2.5 Comorbidity . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Background of genetic association studies 3.1 Terminology and basic concepts of genetics . . . . . . . . . 3.2 Genetic association studies . . . . . . . . . . . . . . . . . . 3.3 Study design of genetic association studies . . . . . . . . . . 3.4 Statistical analysis of association studies . . . . . . . . . . . 3.5 Example of complex ascertainment in genetic association studies of unrelated individuals . . . . . . . . . . . . . . . . . . 8 . 10 . 11 . 12 . 15 . 17 . . . . 21 21 24 24 26 . 27 4 Estimation under complex ascertainment schemes 4.1 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . 4.2 A data augmentation approach to the ascertainment problem 4.3 Other sampling based algorithms, missing data . . . . . . . . 4.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Results of simulations . . . . . . . . . . . . . . . . . . . . . . 30 31 35 37 41 44 5 Discussion 54 A Power calculations for dichotomized data 56 B Information matrix in Clayton’s method 58 3 Notation A Ascertainment indicator AA, Aa, aa Genotype outcomes df Degrees of freedom G Genotype score (exposure) H0 Null hypothesis HA Alternative hypothesis L Likelihood l Log likelihood M Number of simulated observations n Sample size P Probability P h Phenotype (outcome) w Weight Y Data α Significance level β0X Intercept for X βX1 X2 Effect of X1 on X2 θ True parameter value θ′ Assumed parameter value or starting value θ̂ Estimated parameter value π Probability of success σ Standard deviation 4 Abbreviations BMI Body mass index DNA Deoxyribonucleic acid EM Estimation maximization HWE Hardy Weinberg equilibrium OR Odds ratio RR Relative risk SDPP Stockholm Diabetes Prevention Program SEM Stochastic EM SNP Single nucleotide polymorphism SRS Simple random sample TDT Transmission disequilibrium test WHO World Health Organization 5 1 Introduction Statistical efficiency is a major concern in both study design and analysis of data. This report is concerned with design and analysis of genetic association studies. Typically the costs of such studies are high and statistical power low, so making efficient use of available resources is of great importance. One approach to increase efficiency is to assign differential ascertainment probabilities according to specific characteristics of the sample units. An example of such a scheme is case-control studies, where ascertainment probabilities depends on disease status. The sampling scheme must however be taken into account in the analysis. If it is not corrected for bias will generally result. The importance of correcting for the sampling scheme when estimating parameters in human genetics studies has been known for a long time. Fisher (1934) pointed out the need of ascertainment correction as early as 1934 and also commented on the lack of such correction in genetic studies. When the data structure is simple or only testing is of interest, ascertainment correction is sometimes straightforward or even not necessary, but for more complex problems there is still no consensus of how to proceed. Unfortunately the lack of easily accessible statistical tools to correct for ascertainment sometimes causes investigators to ignore the ascertainment scheme and perform analysis that are statistically biased. It may also prevent the investigator from using an ascertainment scheme that would be beneficial to the power of the analysis. An aim of this report is to describe the benefits of using differential ascertainment probabilities in genetic association studies, and to stress the need of flexible methods for ascertainment correction. Some background on ascertainment is provided in Section 2. Different likelihoods for ascertained data are presented in Section 2.2 and some methods to correct for ascertainment are outlined in Section 2.3. The disadvantage of categorizing continuous variables to facilitate analysis is discussed in Section 2.4. Section 3 contains some background of genetic association studies. In Section 3.1 terminology and basic concepts of genetics are described and in Section 3.2 genetic association studies are introduced. Study design and ascertainment schemes in genetic association studies are discussed in Section 3.3 and some commonly used methods to analyze association studies are presented in Section 3.4. A data-set concerning the metabolic syndrome is described in Section 3.5; this data-set was collected using a complex ascertainment scheme and is presented in order to emphasize the need for statistical methods which can 6 handle complex data structures. A further aim of this report is to present and evaluate some estimation methods which correct for complex ascertainment. Three simulation based methods are described in Section 4. These methods can handle more general data structures and ascertainment schemes than the methods described in Section 2.3. In Section 4.4 some simulations which were performed to evaluate these methods are presented. The results of these simulations are discussed in Section 4.5. The work in this report was inspired by sampling schemes which have been used in genetic association studies. The methods presented in Section 4 are however not restricted to the analysis of genetic association studies, but could be useful whenever sampling is performed in a complex manner. For readers interested in other areas than genetic association studies it is possible to skip Section 3 without any major loss of understanding of the statistical reasoning. 7 2 Background of ascertainment The word ascertain means ”to discover with certainty, as through examination or experimentation” (Houghton-Mifflin-Company 1993). In statistics ascertainment refers to measuring variables on study units. The ascertainment probability is the probability for a study unit to be included in a particular sample. This probability may be the same for all units in the study population, as in a simple random sample (SRS), or may depend on some characteristics of the study units. In this report ascertainment probabilities will depend on the value of the outcome variable, but not on the exposure. Ideally the ascertainment probabilities are known, and even controlled, by the investigator. In this report the word ascertainment refers to selection with un-equal selection probabilities where these probabilities will be assumed to be known to the investigator, and is considered to be distinct from situations where subjects that are selected are not observed. Such non-response may be the case for example when working with human volunteers, since they are free to refuse to participate. The failure to observe data on selected units can sometimes be handled by missing data methodology such as multiple imputation, see for example Rubin & Schenker (1991) for an overview. Heckman (1979) handles similar problems in the area of econometrics. The concept of ascertainment was originally introduced into genetics in family studies where selection probability typically depended on the number of affected children in families. This sampling scheme increases the efficiency whilst retaining a valid test, since under the null the test statistic distribution is invariant to the sampling scheme. But, as Fisher (1934) pointed out for segregation analysis, the ascertainment scheme will bias estimates if not accounted for. This is true for association studies as well, if the ascertainment scheme is ignored in the analysis, it will generally lead to both biased prevalence estimates, biased effect estimates and biased variance estimates. Distributional assumptions for the residuals are also likely to be unrealistic. In some testing situations the consequence of non-random ascertainment can be an increased false positive rate. For an example in the genetics context see Smoller, Lunetta & Robins (2000). How to select subjects to increase power in genetic studies has been discussed by several authors. Morton & Collins (1998) discuss the benefits of different approaches of how to select cases and controls in genetic association studies. For example can cases and controls be chosen from the extremes of the distribution, or selection probability can be decided by a combination 8 of disease status and some other variable, such as age at onset of the disease, family history of the disease or some environmental exposure. What selection scheme is optimal depends on what the relationship is between the gene and the studied trait. Selection on extreme values of the outcome in genetic studies has also been discussed by for example Purcell, Cherny, Hewitt & Sham (2001) who investigate optimal selection of sib-pairs for linkage analysis, and Schork, Nath, Fallin & Chakravarti (2000) who concludes that sampling from extremes may give substantial increase in power in studies with unrelated individuals. Allison, Heo, Schork, Wong & Elston (1998) do however argue that selection on extreme values is not always optimal in genetic studies. In simulations they observe that under some genetic models sampling on extremes reduces power. There may also be biological reason for not sampling on extreme values of the outcome, sometimes there may be reason to believe that really extreme values may be the result of rare environmental factors, such as accidents. In this section some background for the statistical issues of analyzing nonrandomly ascertained samples will be presented. First, in Section 2.1, some background on sampling strategies in epidemiology will be provided. Different likelihoods for the ascertained data will then be presented in Section 2.2 and then some methods for ascertainment correction are presented in Section 2.3. These methods do however only work in special cases or are not very efficient. Continuous outcomes often complicate analysis and it is often tempting to categorize variables to facilitate the analysis. Categorizing continuous outcomes does however have disadvantages, such as loss of power. In Section 2.4 the consequences of categorizing continuous outcomes are described in more detail. Another complication in the analysis is comorbidity, the association of diseases. The implications of comorbidity will be described briefly in Section 2.5. The diabetes data-set that will be presented in Section 3.5 is an example of a situation where both continuous outcomes and comorbidity complicate the analysis. The methods of analysis for association studies that typically appear in the literature can not handle this level of complexity. The statistical issues presented in this section are not specific to genetic association studies and the section can be read without knowledge about such. For notational coherency some notation referring to genetic terminology will however be used, G will be used to denote exposure (genotype score) and P h will be used to denote outcome (phenotype). The words genotype and phenotype will be defined in Section 3.1. 9 2.1 Sampling strategies in epidemiology Epidemiology is ”the branch of medicine that deals with the causes, distribution, and control of disease in populations” (Houghton-Mifflin-Company 1993). Genetic association can be considered as a particular subclass of epidemiological studies. In epidemiology the outcome of interest is often a rare disease, such as a specific kind of cancer. A simple random sample (SRS) would thus have to be very large to attain reasonable statistical efficiency and power. To reduce costs but maintain power a study can be designed so that the inclusion probabilities of individuals depend in some way on their particular characteristics. The most common non-random ascertainment scheme that is employed in epidemiology is the case-control study, where affected individuals are assigned probabilities of being ascertained that are higher than those assigned to unaffected individuals. One advantage of the case-control design is that under a specific statistical model the non-random ascertainment scheme can be ignored at the data analysis stage without getting biased effect estimates; this is further discussed in Section 2.3. Disease status is often straightforward to assess, as is typically the case when registers are used for sampling from. If the disease status of subjects is unknown in the sampling stage, another variable, that is associated with the disease, is sometimes used to determine sampling probabilities. Studies where ascertainment probabilities are determined by more than one variable are not uncommon. Sometimes non-random ascertainment is not a result of design but of inability to sample according to a desired sampling scheme. An example of this is when cases and controls are sampled from different populations. In an ideal situation the study population can be chosen directly from a well defined study base. In some countries, such as Sweden, there are population registries that facilitate sample selection, but often such registries are not available, or the cost of sampling from them would be too large. Cases are sometimes sampled from hospital admissions for a specific disease, and it is then very difficult to define the study base in order to select controls. A variant of the case-control study that is often used to avoid selection bias is the matched case-control study, where cases and controls are matched in pairs or bigger groups on the basis of some characteristic of the subjects. Common matching variables in epidemiology are gender and age. Ethnicity is a relevant matching variable in genetic studies if available. One way to match for ethnicity is to choose controls within the same family as the case. The matching does always have to be taken into account in the analysis and it is not possible to estimate the effect of a matching variable. A problem 10 in case-control studies is that controls are often chosen from patients having other diseases, meaning a discovered disease marker could as well be a marker for not having the other disease. There is also the possibility that there is differential non-response in for example different ethnical groups. Since nonresponse often is larger in controls than in cases this can also introduce bias. Non-random sampling due to anything but a planned and well-documented sampling scheme can not be corrected for by statistical analysis. The importance of documentation of the sampling procedure should be stressed since there is often neither practical nor economical reason to prevent doing this, yet it is often not done. 2.2 Likelihoods of data under non-random ascertainment In likelihood-based analysis of data, ascertainment can be handled in a number of different ways: four different likelihoods for data with outcomedependent ascertainment are written below as (2.1)-(2.4). The genetic exposure is denoted G, the outcome is denoted P h, and A is an indicator variable signifying whether ascertainment has/has not occurred. The conditional likelihood, (2.4), is appropriate only for discrete outcomes whilst likelihoods (2.1), (2.2) and (2.3) are appropriate for continuous as well as discrete outcomes. Prospective likelihood L(θ; P h, G) = P (P h|G, A = 1, θ) (2.1) Retrospective likelihood L(θ; P h, G) = P (G|P h, A = 1, θ) = P (G|P h, θ) (2.2) Joint likelihood L(θ; P h, G) = P (P h, G|A = 1, θ) 11 (2.3) Conditional likelihood for matched data L(θ; P h, G) = P (P h|G, X P h, θ) (2.4) Kraft & Thomas (2000) compare the efficiency of the different likelihoods for family based case-control studies and conclude that the conditional likelihood is the least efficient of the four, and that the joint likelihood is the most effective. The relative efficiency of the prospective and the retrospective likelihoods varied with data structure and genetic model. Kraft & Thomas (2000) also describe properties of the likelihoods: the prospective likelihood and the joint likelihood demand explicit modelling of the ascertainment rule while the conditional likelihood and the retrospective likelihood do not. For the retrospective likelihood this is however only true when the ascertainment probability does not depend on the gene or on modelled covariates. Another disadvantage of the retrospective likelihood is that it is not always possible to obtain parameter estimates of the effect of exposure on outcome. It is possible to write the retrospective likelihood in terms of P (P h|G, θ) as in (2.5) but the model is only identifiable under specific parameterizations (Chen 2003). P (G|P h, θ) = P (P h|G, θ)P (G|θ) P (P h|G, θ)P (G|θ) =P P (P h|θ) G P (P h|G, θ)P (G|θ) (2.5) The conclusion is that, in general, the two likelihoods, (2.1) and (2.3), that require information about the ascertainment scheme, are unfortunately the only ones that are flexible enough to handle complex data structures and complex ascertainment schemes. 2.3 Some special cases where correction for non-random ascertainment is straightforward For binary outcome data odds ratios (OR) are commonly used to analyze data under non-random ascertainment. An odds ratio for exposure i compared with a reference exposure j is defined as ORi = 12 πi 1−πi πj 1−πj (2.6) where πi is the probability of success for exposure i. If there is differential ascertainment probabilities depending on outcome odds ratios will still give unbiased estimates. If we denote probability of success for exposure i in the ascertained sample πi⋆ = πi P (A = 1|success) then the odds ratio calculated from the sample will be πi⋆ 1−πi⋆ πj⋆ 1−πj⋆ = πi P (A=1|success) (1−πi )P (A=1|failure) πj P (A=1|success) (1−πj )P (A=1|failure) = πi 1−πi πj 1−πj (2.7) which is the same as the population odds ratio. Binary outcomes can also be modelled with logistic regression. The logistic link models the odds and shares the odds ratios convenient property of giving unbiased effect estimates in case-control sampling (Prentice & Pyke 1979). The intercept is biased under non-random ascertainment even when a logistic link is used and can thus not be used to estimate prevalence of disease. The logistic link is the only link function that will give unbiased effect estimates without taking the ascertainment into account. Kagan (2001) proves this by comparing the likelihood under simple random sampling (2.8) with the likelihood under an ascertainment scheme (2.9), for both of these sampling schemes a prospective likelihood is applied. P (P h|G = g) = n Y {h(β0P h + βGP h × g)}P hi {(1 − h(β0P h + βGP h × g)}1−P hi i=1 (2.8) P (A = 1|P h = ph)P (P h|G = g) P (A = 1) n P h i Y {h(β0P h + βGP h × g)} {r(1 − h(β0P h + βGP h × g)}1−P hi = h(β0P h + βGP h × g) + r(1 − h(β0P h + βGP h × g)) i=1 P (P h|G = g, A = 1) = (2.9) h=0,g) where h(β0P h +βGP h ×g) is the inverse of the link function and r = PP (A=1|P . (A=1|P h=1,g) Kagan (2001) concludes that these two likelihoods are equal, except for the exp(λ+µu) intercept, if and only if the link function is of the form h(u) = 1+exp(λ+µu) for some λ and µ. If λ = 0 and µ = 1 this gives h(β0P h + βGP h × g) = exp(β0P h +βGP h ×g) . The bias of the intercept will be − µ1 log(r). 1+exp(β0P h +βGP h ×g) Neuhaus (2000) describes how link functions can be adjusted to correct for 13 ascertainment in binary regression models. The correction is based on modelling a prospective likelihood as in (2.1). The link function is corrected by replacing the mean by a function of the mean and the sampling probabilities. Neuhaus (2002) also specify what the bias will be when ascertainment is ignored for some common non-logistic link functions, and conclude that this bias can be substantial. A simple, and more general, way to get unbiased estimates from data under non-random ascertainment would be to weigh each observation with wi = 1/ (inclusion probability of subject i), see for example Armitrage & Colton (1999). In likelihood terms the weighted log likelihood contribution of individual i then is wi log(L(θ; yi )) This method works for continuous outcomes as well as categorical. Weighting will however not give fully efficient results since the observations contribute with an unequal amount of information to the estimates. Example of weighting: Linear regression Since using the usual estimation equation for linear regression X i (P hi − (β0P h + βGP h gi )) (2.10) will give biased results in data with non-random ascertainment, a weighted regression solving a weighted estimating equation is used: X i ((P hi − (β0P h + βGP h gi )) × wi ) (2.11) where wi is proportional to 1/(inclusion probability of subject i). This will give unbiased results but is not fully efficient since highest efficiency of a weighted regression is obtained when wi is proportional to 1/(variance of subject i), see for example Armitrage & Colton (1999). If the variance is equal for all individuals, then the efficiency is highest if all weights are also equal. ¥ 14 2.4 Disadvantages of categorizing continuous variables Continuous variables are often categorized for computational simplicity, reduced cost, or due to lacking understanding of nature of the variable of interest. Categorization is very common in case-control studies since treating the outcome as binary allows a logistic regression model to be used without further ascertainment correction. However, there is an information loss in the categorization of continuous variables and this often leads to an unacceptably high power loss. Cohen (1983) compares the product-moment correlation between two normally distributed variables with the correlation when one of the variables is categorized and concludes that the reduction in correlation is about 20 percent when the data is split at the median, and even larger when the categories are of unequal size. If more than one variable are dichotomized the reduction in correlation follows a more complicated pattern and Cohen’s formula should not be used (Vargha, Rudas, Delaney & Maxwell 1996). The reduction in efficiency has also been investigated in applications, for example by Neale, Eaves & Kendler (1994) who compare the power of continuous and categorized traits in genetic twin studies. If the dichotomized variable is a confounder, a variable that affects the outcome of interest and that is also associated with the exposure, the information loss due to the dichotomizing can lead to insufficient confounder correction. Insufficient correction due to dichotomizing is discussed by Vargha et al. (1996). Since the focus of this report is on ascertainment, it would be of interest to see how efficiency is affected by categorization when data is non-randomly ascertained. If individuals are chosen from the extremes of the outcome distribution, as discussed by for example Morton & Collins (1998), dichotomization is not likely to make a big difference, since there is little variation within the groups. If individuals are instead chosen from the whole range of the outcome variable the result of categorizing is less obvious. To illustrate how dichotomizing outcomes can affect power in an association study with nonrandom ascertainment the following example is considered. Example: Dichotomizing an underlying continuous phenotype This example is designed so that the statistical model has similar structure 15 to that commonly estimated in genetic association studies. Some genetic vocabulary will be used to complement the statistical explanation. The genetic concepts are defined in Section 3.1. The model is represented by Figure 2.1 where arrows indicate a directed causal relationship. G Ph A Figure 2.1: Data with ascertainment on phenotype An explanatory variable, G, is assumed to be categorical and an outcome, P h, is assumed to be conditionally normally distributed. Ascertainment, A, is dependent on the value of P h according to a scheme described below. The explanatory variable has three categories based on values from a binomial distribution Bin(2,0.2). In genetic vocabulary this could be a biallelic SNP with an allele frequency of 0.2. The categories of the explanatory variable are then genotypes AA, Aa, aa. The relative effect of three genotypes on the outcome variable determine the genetic model. Here the scores that are used are G = (0, 0, 1) for a recessive genetic effect and (0, 0.5, 1) for an additive genetic effect. For a dominant effect (0, 1, 1) are used. The outcome P h is assumed to be conditionally normally distributed where the mean depends on G, P h ∼ N (βGP h × g, 1) where βGP h = 0.5. In genetic terms we would call P h a phenotype. Individuals that have a phenotypic value over some cut-off are considered cases and all other individuals are controls. We suppose that the ascertainment scheme is such that there is an equal expected number of cases as controls regardless of cut-off, that is P (A = 1|P h < cut-off) = (1 − P (P h < cut-off)) P (P h < cut-off) (2.12) and P (A = 1|P h ≥ cut-off) = 1. 16 (2.13) Simple random samples are also considered. The cut-off values are described in terms of quantiles, that is the proportion of the data that are controls for a SRS, P (P h < cut-off). The normal distribution of the phenotype is used for computational simplicity although in real data phenotypes do not always have this characteristic. It is also worth noting that if the genetic data is from a marker that is associated with the gene rather than a coding part of the gene, the phenotypes given the genetic data will be a mixture of distributions, as a result of the misclassification in the genotype. It is likely that the reduction in power due to categorization will be smaller when the data is not normally distributed and the results below should be interpreted with that in mind. Power is calculated for detecting a genetic effect on the phenotype for both continuous phenotypes and dichotomized phenotypes. The power is calculated for a log likelihood ratio test with one degree of freedom, T 2 = 2[lA −l0 ] where l0 is the log likelihood under the null hypothesis (βGP h = 0) and lA the log likelihood under the alternative (βGP h 6= 0). Since the data has a simple structure, modelling of the joint likelihood, defined in (2.3), is possible. For the details of how the power calculations were performed, see Appendix A. In Figure 2.2 it can be seen that the power using the dichotomized phenotype is lower than using the continuous. This reduction in power is seen for each of the applied ascertainment schemes but the difference is most pronounced in the simple random sample, especially when a high cut-off is used. For both continuous and dichotomized phenotypes there is a gain in power in using the differential ascertainment probabilities compared with the SRS. This gain is larger the more extreme cut-off point is chosen. For high cut-off points an ascertainment sample with a dichotomized phenotype will give a higher power than a continuous variable in a SRS, but for a lower cut-off this will not be the case. ¥ 2.5 Comorbidity A complication when correcting for ascertainment is comorbidity, the association of two or more diseases. Robins, Smoller & Lunetta (2001) argue that if comorbidity is present in data with non-random ascertainment it can result in non-valid tests. Robins et al. (2001) use causal directed acrylic graphs to show when tests are valid/not valid, and to illustrate how conditioning will affect the validity of tests. The tests discussed are TDT tests (Terwilliger & 17 0.8 0.6 0.2 0.4 Power for n=100 0.6 0.4 0.2 Power for n=100 0.8 1.0 Dominant model 1.0 Additive model 0.5 0.6 0.7 0.8 0.9 1.0 Cut−off quantile 0.5 0.6 0.7 0.8 0.9 Cut−off quantile 1.0 Recessive model 0.6 0.4 0.2 Power for n=100 0.8 SRS, continuous SRS, dichotomized Asc. sample, continuous Asc. sample, dichotomized 0.5 0.6 0.7 0.8 0.9 1.0 Cut−off quantile Figure 2.2: Power for different cut-off values for Example: Dichotomizing an underlying continuous phenotype 18 1.0 Ott 1992), which are tests of whether parents genetic material is inherited in different proportions in cases and controls. An example of comorbidity is the metabolic syndrome which will be described in Section 3.5. In the metabolic syndrome there are many outcomes that may affect each other in different extents and directions. Here we will however only consider an extremely simplified model of the metabolic syndrome, as in Figure 2.3, incorporating only BMI (body mass index) and plasma glucose level. We will also assume that there is no causal effect of plasma glucose level on BMI, but only of BMI on plasma glucose level. Let the genotype score G affect both outcomes and let ascertainment probability A depend on both outcomes. BMI G A Plasma Glucose Figure 2.3: Data with comorbidity Now consider a testing situation, investigating if there is an effect of the gene on plasma glucose. Figure 2.4 illustrates how the model would look without such an effect. Even when there is no causal effect of the gene on plasma glucose the gene and plasma glucose will be dependent through BMI, so we will have to condition on BMI to obtain a valid test for the direct relationship. BMI G A Plasma Glucose Figure 2.4: Data under H0 : No effect of G on plasma glucose If we instead are testing if there is a causal effect of the gene on BMI the 19 situation is more complicated. If there is no such effect, as in Figure 2.5, the gene and BMI will be dependent through plasma glucose and the ascertainment scheme. Conditioning on plasma glucose is however not a solution, such a conditioning would instead introduce dependence between the gene and BMI. In situations like this other methods could instead be considered, such as modelling the ascertainment corrected joint or prospective likelihood as suggested in Section 2.2. BMI G A Plasma Glucose Figure 2.5: Data under H0 : No effect of G on BMI 20 3 3.1 Background of genetic association studies Terminology and basic concepts of genetics In this section some fundamental concepts concerning the structure of DNA, chromosomes and genes, which are relevant to the central theme of this report, are described. The genetic code consists of strands of nucleotides, which consists of phosphate, sugar and bases. The bases in the nucleotides bind in pairs so that the nucleotides build a double helix that form a chromosome. There are four bases in the DNA, adenine (a), guanine (g), cytosine (c) and thymine (t). In the base-pairs adenine always binds to thymine on the opposite strand and guanine binds to cytosine. The human genome consists of more than 3,000,000,000 base-pairs, see for example Strachan & Reed (1999). Triplets of these base pairs code for the 20 amino acids that are used to build the about 90,000 different kinds of proteins that our bodies produces. A segment of bases that code for a protein is called a gene. Most gene products in the human genome are identical in all individuals. Sometimes when gene products vary between individual some variants are harmful and may cause disease. In this report statistical methods that can be used for identifying and characterizing genes with disease causing variants are considered. Humans have 23 chromosome pairs. A location on a chromosome is called a locus (pl. loci). The locus can be either a single nucleotide or a string of nucleotides. Different variants that are present in the population at a specific locus are called alleles, if there are only two variants the locus is biallelic. When an individual inherits DNA from his/her parents one copy of each chromosome is inherited from each parent. This means that for every individual there are two alleles at each locus, the number of a specific allele observed at a locus is thus 0, 1 or 2. The combined outcome of the two alleles at a locus is called a genotype. The two allele outcomes are often assumed to be independent and the number of a specific allele at a biallelic locus can be considered to be binomially distributed with two trials, p denoting the population allele frequency. In the case of more than two alleles genotypes will be multinomially distributed under independence. The independence property is referred to as Hardy-Weinberg equilibrium (HWE). It can be shown that genotype frequencies stabilize at Hardy-Weinberg proportions after a single mating, starting with any genotype frequencies that are equal in males and 21 females if assumptions are met about of infinite population size, discrete generations and no selection, migration or mutation. Exceptions to HWE can result if a sample is derived from a mixture of different populations (population stratification), if there is non-random mating or if the probability of ascertainment or survival is not independent of genotype. If a genotype consists of two copies of the same allele it is homozygous and otherwise heterozygous. Traits resulting from genotypes are called phenotypes. In this report the word phenotype will be used to indicate an observed variable in a broad sense. The phenotype can for example appear in a causal pathway for some final endpoint. Phenotypes that are considered final endpoints could be for example disease status and phenotypes in a causal pathway of the endpoint could for example be produced level of a specific protein. When an individual inherits a chromosome from a parent it is not one of the two parental copies, each parental chromosome recombine on average about 1.5 times, see for example Strachan & Reed (1999), so that the inherited chromosome is a patchwork of the parental chromosomes. The location at which parental chromosomes recombine differ from generation to generation so that after several generations only small fragments of the original chromosomes remain and only bases that are located close together are inherited together. Recombination does not occur uniformly over chromosomes and distances between loci are often described in terms of recombination rather than physical distance, see for example Strachan & Reed (1999). Dependence between loci is called linkage disequilibrium. It is typically not feasible to collect information on whole areas of the genome. Loci are instead selected which have previously been confirmed to be polymorphic, that is, where there exists variation between individuals. Areas that are segregating, but not necessarily coding for the gene of interest, are called genetic markers. When searching for a gene the hope is that markers are either in a coding part of the gene or are in linkage disequilibrium with the gene. Commonly used genetic markers are single nucleotide polymorphisms (SNPs) and microsatellites. A SNP is a variation between individuals in a single nucleotide base. The mapping of SNPs has been in rapid progress and currently approximately 2.7 million SNPs have been mapped (Carlson, Eberle, Rieder, Smith, Kruglyak & Nickerson 2003), most having been discovered in recent years. Microsatellites consist of tandem repeats of between one and five base-pairs. Microsatellites are more informative than SNPs since there are many possible allele but there are not as many microsatellites in 22 the genome as there are SNPs, see for example Strachan & Reed (1999). Combinations of alleles from different loci which reside on the same chromosome are called haplotypes. Typically genetic markers are measured one-at-atime so that it is not always possible to infer haplotype phase with certainty, that is, which alleles belong together on the same chromosome. When information about parental phenotype is available phase can sometimes be inferred with certainty by observing which combinations are possible to inherit. Otherwise when phase is unknown the estimation of haplotypes can be viewed as a missing data problem and classical statistical methods can be used, see for example Excoffier & Slatkin (1995). The way in which phenotype is related to genotype is referred to as penetrance. In the early days of genetics a few traits where found to be determined by single uncommon genes with full penetrance, that is the gene fully determined the phenotype. Genes like these are often called Mendelian after the monk Gregor Mendel who is often referred to as the father of genetics. Mendel studied traits in Moravian peas, such as color and texture (Mendel 1865), which are typical examples of traits that are determined by full penetrance genes. The full penetrance genes are relatively easy to identify since they create recognizable inheritance patterns in families and the gene locations were often found via linkage analysis. Linkage analysis is based on samples of related subjects and information about gene location is obtained by observing if the marker and the phenotype is inherited together in relatives. Most genes in humans do however not have full penetrance. A trait is called complex if multiple genes and/or environmental factors determine the trait. To be of clinical relevance a gene that affects a complex trait typically has to have a disease predisposing variant with a higher allele frequency of the than a full penetrance gene does. The relative efficiency of the study designs depend among other things on frequency of the disease causing allele and the penetrance. Rare alleles with a big effects are detected most efficiently in related individuals while for common, low penetrance genes, unrelated individuals are well suited. As mentioned there are three possible genotypes, (AA, Aa, aa), at a single biallelic loci. If AA and Aa have the same effect on the phenotype the genetic model is referred to as recessive while if Aa and aa have the same effect on the phenotype the genetic model is referred to as dominant. There are also models where Aa has an intermediate effect of AA and aa. Such a genetic model is called co-dominant. In statistical analysis the co-dominant model will also be referred to as an additive or multiplicative model, depending 23 on how effects are parameterized. To avoid making assumptions about the nature of the relationship between genotype and phenotype two variables can be used to describe the dependence instead of one. In this report the codominant model will be assumed most of the time but the methods described can be used for any genetic model. 3.2 Genetic association studies This report covers some methods which are potentially of use for fine-mapping and characterization of genetic factors. We focus on what are known as genetic association studies. For a review of genetic association studies see Hirschhorn, Lohmueller, Byrne & Hirschhorn (2002). We concentrate on studies of unrelated individuals, but will discuss the use of related individuals in Section 3.3.1. We do not cover linkage analysis, which is more relevant for mapping genes to larger regions of the genome. The aim of genetic association studies is to find or characterize relationships between genes and phenotypes. The analysis is carried out by comparing phenotype distributions between persons with different genotypes. Typically the exact location of interest on the genome is not known and genetic markers, such as SNPs, are measured instead of the gene of interest. To interpret the results of an association study it is important to understand which mechanisms can lead to association between marker and phenotype. The most desired reason is that the marker is causally related to the phenotype. It is also possible that the marker is in linkage disequilibrium with a causal gene. There may also be sources of bias in the data, population stratification, which was mentioned in Section 3.1, is an example of this. Comorbidity, which was discussed in Section 2.5 may also bias the analysis. Some of the sources of bias in epidemiological studies, like recall bias, differential recollection of events in cases and controls, are however not an issue, since genotype is constant over life and since measurement is not affected by subjective judgement. 3.3 Study design of genetic association studies In recent years the fast development in the technology for analyzing genetic samples has created great optimism for finding and characterizing genetic causes to diseases. Statistical power and precision of genetic studies of com24 plex diseases in humans are however typically low, since the gene of interest is only one of several possible reasons for getting the disease. Genetic factors are likely to be individually of small importance and as a consequence it is important that study designs are well thought through. In gene characterization estimates of gene effects are relevant and it may also be of importance to disentangle the effect of related phenotypes and environmental factors to understand the biological effect of the gene. If gene characterization is of interest, the study should be designed with this in mind so that power is calculated for realistic scenarios, all variables of interest are measured and so that the structure of the data allows estimation of effects. How sampling strategies can be used to increase power was described in Section 2.1 and in Section 3.3.1 designs with unrelated individuals will be compared with designs with related individuals. 3.3.1 Related or unrelated individuals? The difficulty of gene characterization and identifying genetic association are clearly apparent, as is pointed out by for example Terwilliger & K.M. (2003). The structure of the data material will determine how well these difficulties are dealt with, in particular it is important if the selected individuals are known to be related to each other or if they are ’unrelated’ subjects from the population. If the studied population consists of more than one ethnical subgroup population stratification may be a problem in unrelated individuals. Bias can be introduced by population stratification if the subgroups differ in prevalence of the studied disease and in allele frequency of the studied marker even though the marker is not in linkage disequilibrium with the gene. In family data population stratification will not pose a problem since the information about association between gene and trait comes from within families. Practical considerations will also often affect the choice of design. In family based studies it is desirable to have multiple affected individuals in each family. If the phenotype is complex with multiple measurements this is practically difficult to obtain. For late onset diseases it is also often hard to find family data since the parents and siblings of the affected individual are unlikely to be alive. On the other hand, relatives of persons affected by a disease are often more willing to participate in a study than unrelated individuals. In reality persons labelled as unrelated do have common ancestors, generally 25 so many generations back that the nature of the kinship is not known. In family data the shared regions in the genome will be large since few recombination events will have happened in closely related individuals while in unrelated individuals the shared regions will be small since many recombination events will have occurred in the chromosomes since the common ancestors. For that reason related individuals studies are used to search big areas for a gene location while unrelated individuals are used for fine-mapping when candidate areas are already identified. In family studies effect estimates are generally not meaningful since subjects are on purpose sampled in such a manner that individuals that are likely to have a large gene effect are included in the study (Burton 2003). In unrelated individuals effect estimates can be computed but for multiple-testing reasons there is reason to be cautious about the interpretation of the results if combined with testing. If several studies investigate the same effect, the studies that by random had the highest effect estimates are most likely to get significant results. Given that almost only significant results get published it is likely that size of the effects in published studies are somewhat inflated. 3.4 Statistical analysis of association studies In this section some fundamentals of statistical methods used in the analysis of association studies are briefly described. Let the probability of disease in a case-control setting be πAA , πAa and πaa , given genotype (AA, Aa, aa). The effect of the genotype on the risk of disease can for example be described either with a relative risk or an odds ratio. For both measures a reference group has to be chosen. If we let AA be the reference group the relative risk for genotype kl compared with the reference is RRkl = πkl πAA (3.1) for πkl = probability of disease given genotype kl. A multiplicative model in 2 terms of relative risks would mean RRaa = RRAa The odds ratio, which was mentioned in Section 2.3, is in similar notation as the relative risk ORkl = πkl 1−πkl πAA 1−πAA 26 . (3.2) Asymptotic confidence intervals can be calculated for the odds ratios if a normal distribution is assumed for the logarithm of the odds ratio (Balding, Bishop & Cannings 2001). While relative risk is a somewhat more intuitive measure, odds ratios are convenient since the estimate of the odds ratio will be unbiased in a case-control setting, as described in Section 2.3. If the disease is rare for all genotypes odds ratios in the sample will be give very similar values to relative risks in the population (Balding et al. 2001). Discrete phenotypes are often modelled with logistic regression. Most commonly a logistic link function is used but other link functions can be considered, for example if the outcome is a dichotomized normal variable a probit link would be appropriate. The advantage of the logistic link in case-control data was discussed in Section 2.3. When phenotypes are continuous standard statistical methods such as linear regression are often used, but since these methods require ascertainment correction continuous phenotypes are sometimes categorized to avoid this problem. This approach is however not efficient, as was discussed in Section 2.4. Another way of testing for association is to use a log likelihood ratio test to test if a null hypothesis, such as H0 : βGP h = 0 where βGP h is the effect of the gene of the phenotype, can be rejected in favor of some alternative hypothesis, such as HA : βGP h 6= 0. The test statistic is of the usual form 2[lA − l0 ] (3.3) where the log likelihoods lA and l0 are computed using the maximum likelihood estimates of the parameters. In the alternative hypothesis all parameters are estimated while under the null the parameters hypothesized about are fixed. The test statistic will be asymptotically chi-square distributed with degrees of freedom equal to the number of parameters used to describe the genetic effect (Balding et al. 2001). 3.5 Example of complex ascertainment in genetic association studies of unrelated individuals There are numerous examples of genetic association studies where a complex ascertainment scheme has been used. Here one such study, concerning the metabolic syndrome, will be described briefly. The work presented in this 27 report is inspired by this data-set and the goal of the evaluation of the methods in Section 4 is to find ways in which this kind of data can be analyzed in practice. The intention has been that the methods should be flexible enough to handle both discrete and continuous variables, correction for confounding and complex ascertainment. The metabolic syndrome consists of a number of co-dependent phenotypes like insulin production and sensitivity, glucose levels, cholesterol levels, BMI, body fat distribution and hypertension. Diseases closely connected with these phenotypes are for example diabetes and coronary heart disease. The dependence between the phenotypes is complex and yet not fully disentangled, they may both affect each other and they may have common causes. Common causes could be for example fetal malnutrition or genetic effects (Stern 1995). While some relationships between phenotypes are likely to be causal, like the effect of BMI on diabetes, other are more controversial. Jarrett (1984) argue that while there is a statistical association between diabetes and risk for coronary heart disease it is more likely that the two diseases have a common cause than that diabetes affect the risk for coronary heart disease. Lifestyle factors such as diet and exercise also have a big impact on the metabolic syndrome. Since diabetes and coronary heart disease are common diseases research about the metabolic syndrome is of high relevance to public health. One study concerning the metabolic syndrome is the Stockholm Diabetes Prevention Program (SDPP). For a more detailed description of the SDPP see for example Agardh, Ahlbom, Andersson, Efendic, Grill, Hallqvist, Norman & Ostenson (2003) or Gu, Abulaiti, Ostenson, Humphreys, Wahlestedt, Brookes & Efendic (2004). A part of the SDPP is to study genes which are believed to affect the metabolic syndrome and also to describe the effect of the genes on the different phenotypes to increase understanding of the biological mechanisms. As indicated in Figure 3.1, ascertainment probability depended on BMI, fasting glucose and 2 hour fasting glucose. The selection on the two last variables was done in two stages, first persons with known diabetes or impaired glucose tolerance were excluded since they were likely to be on medication affecting these values, then a selection was made oversampling persons that qualify for a diabetes diagnosis or an impaired glucose tolerance diagnosis based on measurements made in the study. Fasting glucose and 2 hour fasting glucose are used to diagnose diabetes and impaired glucose tolerance according to the WHO diagnostic criteria for diabetes, a plasma glucose level of at least 7.8 or a 2 hour fasting glucose level of at least 11.1 gives a diabetes diagnosis and while a fasting glucose level of less than 7.8 combined with a 2 hour fasting glucose level between 7.8 and 11.1 28 Diagnosis known to subject Ascertainment Diagnosis from lab values Fasting glucose 2 hour fasting glucose Fasting insulin 2 hour fasting insulin Body mass index (BMI) Gene Figure 3.1: Study on the metabolic syndrome gives an impaired glucose tolerance diagnosis. For controls there was an over-sampling on persons with a low BMI. Using this ascertainment scheme 500 controls, 339 persons with impaired glucose tolerance and 106 persons with diabetes were selected. Fasting plasma glucose level and 2 hour plasma glucose level are measured by separating plasma from blood taken from a fasting subject and then measuring the amount of glucose in the plasma, and BMI is calculated by dividing body weight in kilo by the squared height in meters. Other phenotypes, for example fasting insulin and 2 hour fasting insulin were also of interest in the analysis. 29 4 Estimation under complex ascertainment schemes In this section some methods that can be used for estimation of parameters in statistical models, accounting for ascertainment, will be described. We will here assume that the probability that a unit is ascertained is known given phenotype, but typically this probability is estimated based on external information, for example registry data. As before, we use G to denote genotype score, P h to denote phenotype and A to represent an indicator variable signifying that ascertainment has/has not occurred. P h can be multivariate. We assume that the distribution of P h conditional on G is parameterized by θ and that the probability of ascertainment is independent of θ given the observed data, that is P (A = 1|G, P h, θ) = P (A = 1|G, P h). Furthermore, we assume that ascertainment is independent of G conditional on P h, P (A = 1|G, P h) = P (A = 1|P h). The extension to let ascertainment depend on G is straightforward. We represent the dependence graphically as in Figure 4.1. G Ph A Figure 4.1: Data with ascertainment on phenotype Where the likelihood of the data under non-random ascertainment is modelled we will deal with the joint likelihood, (2.3), rather than the prospective likelihood, (2.1). The methods described below are not restricted to the joint likelihood but it is convenient to model the data jointly for treatment of missing data on G, via the EM algorithm (Dempster, Laird, & Rubin 1977). An example of when this is of importance is in the estimation of haplotypes (Excoffier & Slatkin 1995). The possibility to extend the methods to handle haplotype estimation will however not be investigated in this report. The joint likelihood of the data (G, P h) that is ascertained is L(θ) = f (G, P h|θ, A = 1) = 30 P (A = 1|P h)f (G, P h|θ) P (A = 1|θ) (4.1) which corresponds to the log likelihood log(L) = log(P (A = 1|P h)) + log(f (G, P h|θ)) − log(P (A = 1|θ)) (4.2) where log(P (A = 1|P h)) does not depend on the model parameters. The complicated form of this likelihood makes standard likelihood-based estimation difficult. The computational problem is essentially that P (A = 1|θ) = R P (A = 1|P h)f (G, P h|θ)dP h is typically intractable, which is for example the case when P h, conditional on G, is normally distributed. One way to solve this problem is to use simulation based methods and it is such methods that we concentrate on and describe in the sections following. The methods described below all use simulation to correct for ascertainment but they differ in the distribution, from which data is simulated. In what we refer to as the stochastic EM-algorithm, (Section 4.3), the missing data is simulated. In the data augmentation method due to Clayton (2003), (Section 4.2), the ascertained data is simulated and in the importance sampler, (Section 4.1), the data is simulated from the population distribution. A way in which the stochastic EM-algorithm differs from the importance sampling and Clayton’s method is that the stochastic EM-algorithm is an iterative procedure while the other methods do not require to be iterated, even if it is possible to do so. 4.1 Importance sampling As mentioned above the difficulty in calculating the likelihood of the ascertained data lies in the integration in P (A = 1|θ) = Z P (A = 1|P h)f (P h, G|θ)dP h. (4.3) Importance sampling (Hammersly & Handscomb 1964) is a Monte Carlo method used for numerical integration. The basic idea is to sample from one distribution to obtain the expectation of another. This is advantageous for 31 sampling efficiently but also when drawing samples from the target distribution is difficult. In general terms, for a random variable X which has density f1 (x), the expectation of some function of X, g(x), can be written as µ = Ef 1 [g(x)] = = Z Z g(x)f1 dx f1 f1 g(x)f2 dx = Ef2 [ g(x)] f2 f2 (4.4) for f2 > 0 whenever g(x) × f1 > 0. This means that samples can be drawn from f2 to obtain the expectation of g(x). Two possible estimates of the expectation above are µ̂ = PM wi g(xi ) i wi (4.5) µ̃ = PM wi g(xi ) M (4.6) i PM and i where w = ff12 and M is the number of simulated observations. For the importance sampler to give a good approximation of f1 , M should be large. The estimate µ̂ is sometimes more effective than µ̃ but while µ̃ is unbiased µ̂ has a bias of order 1/n, see for example Elston, Olson & Palmer (2002). For other estimates of Ef 1 [g(x)] see Hesterberg (1995). The choice of f2 does effect the efficiency of the estimates, for µ̃ the theoretically best distribution of f2 is c|g(x)|f1 for some constant c, see for example Hesterberg (1995). If f2 is badly chosen a large variance for the estimate may result. Ways of choosing f2 efficiently have been proposed by for example Torrie & Valleu (1977), Green (1992) and Geyer (1993). We can apply the importance sampling technique to approximate P (A = 1|θ) in (4.1). It may be advantageous if we choose a distribution to simulate from which is close to the target distribution. One way to implement importance sampling in this context is to draw observations from a distribution which has the same parametric form as the target distribution f (y|θ), where Y = 32 (G, P h), but in the place of the unknown θ, use naive guesses of the values of θ, which we call θ′ . In this case P (A = 1|θ) is estimated by noting that P (A = 1|θ) = Z P (A = 1|y)f (y|θ)dy = Z [P (A = 1|y, θ) f (y|θ) ]f (y|θ′ )dy, f (y|θ′ ) (4.7) ′ so that if we draw M observations from f (y|θ′ ) which we denote as y1′ , . . . , ym , we can estimate P (A = 1|θ) using the estimator in (4.6). We then get P̂ (A = 1|θ) = f (y ′ |θ) PM j=1 P (A = 1|yj′ ) f (y′j|θ′ ) j M . (4.8) As a consequence we can approximate the log likelihood contribution of individual i, log(L) ∝ log(f (yi |θ)) − log(P (A = 1|θ)), (4.9) up to a constant, by replacing P (A = 1|θ) by (4.8), thereby obtaining log(f (yi |θ)) − log( M X P (A = 1|yj′ ) j=1 f (yj′ |θ) ). f (yj′ |θ′ ) (4.10) Since the approximation of the likelihood is expressed in terms of θ an approximation of the information matrix can be computed as minus the second derivative of the log likelihood as usual. It would also be possible to construct an importance sampler by drawing from f (y|θ, A = 1) instead of f (y|θ), and basing estimation of P (A = 1|θ) on noting that P (A = 1|θ) = Z P (A = 1|y)f (y|θ)dy = Z [P (A = 1|y) f (y|θ) ]f (y|A = 1, θ′ )dy. ′ f (y|A = 1, θ ) 33 (4.11) ′ (y|θ ) , requires calculabut since the denominator, f (y|A = 1, θ′ ) = P (A=1|y)f P (A=1|θ′ ) ′ tion of P (A = 1|θ ), it is not practical doing so. We mention this because this is advocated as a possible approach by Clayton (2003) to correct for ascertainment. Clayton introduces this approach in the framework described in Geyer & Thompson (1992) where what is essentially importance sampling is used in the approximation of exponential family likelihoods, although it is described as Monte Carlo likelihood approximation. Clayton uses the notation from this paper and derives a likelihood that also incorporates correction for ascertainment. In Clayton’s likelihood P (A = 1|θ) is written as c(θ) = P (A = 1|θ) = Z y∈A f (y|θ)dy = Z y∈A [ f (y|θ) ]f (y|θ′ )dy f (y|θ′ ) (4.12) which is the same as (4.11), when ascertainment probabilities are defined to be 0/1 only, since f (y|θ) ]f (y|A = 1, θ′ )dy f (y|A = 1, θ′ ) (4.11) = Z [P (A = 1|y) = Z P (A = 1|y)f (y|θ′ ) f (y|θ) dy [P (A = 1|y) P (A=1|y)f (y|θ′ ) ] P (A = 1) P (A=1) = Z Z f (y|θ) f (y|θ) ′ [P (A = 1|y) [ ]f (y|θ )dy = ]f (y|θ′ )dy = (4.12). ′ ′ f (y|θ ) y∈A f (y|θ ) Based on this Clayton then uses f (yj |θ) j=1 f (yj |θ′ ) PM M (4.13) as an estimator of P (A = 1|θ). This estimator does however not concur with the estimator (4.6) since the sampling distribution f2 was f (y|A = 1, θ′ ) and not f (yj |θ′ ), as suggested by using (4.13) as an estimator of (4.12). Clayton’s data augmentation method will be used in this report (see Section 4.2), but we will use a matched case-control likelihood described in Section 3 of Clayton’s paper and not the likelihood resulting from the reasoning above. 34 4.2 A data augmentation approach to the ascertainment problem Clayton (2003) derives an ascertainment corrected likelihood by using an analogy to the conditional likelihood for matched case-control data. The idea behind this approach is to simulate a number of pseudo-observations for each real observation and use these in combination with the real data to build the likelihood. As in the importance sampling the true parameter values θ are unknown and are substituted by guesses, θ′ . We will first review the conditional likelihood for matched case-control data and then describe how it applies to the ascertainment problem. For further description of the conditional likelihood for matched case-control data see Clayton & Hills (1996). The resulting likelihood resembles the importance sampling likelihood in Section 4.1, but a major difference is that data is simulated from the population distribution in the importance sampling but under the ascertainment scheme in the likelihood, (4.17), below. The conditional likelihood is commonly used in the analysis of data from matched case-control studies. The data is matched based on characteristics that are believed to be potential confounders, such as age and gender. In each group there is a number of cases y1 , . . . , yk = 1 and a number of controls P yk+1 , . . . , yJ = 0. The notation Z = J1 yj = k will be used to indicate the number of cases in the set of J observations. For simplicity we let k = 1. The likelihood is based on the joint probability that y1 = 1 and y2 , . . . , yJ = 0, given that Z = 1 P (y1 = 1, y2 , . . . , yJ = 0) P (Z = 1) P (y1 = 1)P (y2 , . . . , yJ = 0) = PJ j=1 P (yj = 1)P (yl6=j , . . . , yJ = 0) P (y1 = 1, y2 , . . . , yJ = 0|Z = 1) = = = P (y1 =1) P (y1 = 0)P (y2 , . . . , yJ = 0) P (y1 =0) PJ P (yj =1) j=1 P (yj =0) P (yj = 0)P (yl6=j , . . . , yJ = P (y1 =1) P (y1 =0) PJ P (yj =1) j=1 P (yj =0) 0) (4.14) In Clayton’s model it is not probability of being a case but probability for 35 the observations to be real that is modelled. We can evaluate this probability using the same reasoning as that used in the case-control situation if we define an indicator variable Rj where Rj = 1 if yj is a real observation, generated from f (y|θ, A), and Rj = 0 if it is a pseudo-observation, generated from f (y|θ′ , A). Let y1 be the real observation. Then, for M pseudo-observations, P (R1 = 1, R2 , . . . , RM +1 = 0|Z = 1, y) = = f (y1 |R1 =1)P (R1 =1) f (y1 |R1 =0)P (R1 =0) PM +1 f (yj |Rj =1)P (Rj =1) j=1 f (yj |Rj =0)P (Rj =0) P (R1 =1|y1 ) P (R1 =0|y1 ) PM +1 P (Rj =1|yj ) j=1 P (Rj =0|yj ) =∗= f (y1 |R1 =1)P (R1 =1) f (y1 ) f (y1 |R1 =0)P (R1 =0) f (y1 ) = PM +1 f (y1 |R1 =1) f (y1 |R1 =0) PM +1 f (yj |Rj =1) j=1 f (yj |Rj =0) j=1 = f (yj |Rj =1)P (Rj =1) f (yj ) f (yj |Rj =0)P (Rj =0) f (yj ) f (y1 |θ) f (y1 |θ′ ) PM +1 f (yj |θ) . j=1 f (yj |θ′ ) (4.15) ∗ The probability of R = 1 for an observation in the set of m observations is a constant (1/m) if no information is provided about the observation. Given the pseudo-observations The log likelihood contribution of individual i is M +1 X f (yij |θ) f (yi |θ) ) − log( ) log( ′ ′ f (yi |θ ) j=1 f (yij |θ ) (4.16) which, up to a constant, can be written as log(f (yi |θ)) − log( M +1 X j=1 f (yij |θ) ). f (yij |θ′ ) (4.17) Since an expression for the likelihood is available parameter estimates can be obtained using maximum likelihood. Variances of these estimates are obtained as usual by calculating the information matrix from the likelihood For details see Appendix B. The likelihood (4.17) is similar to the likelihood approximated with the importance sampler, (4.10), especially when ascertainment probabilities are 0/1. The essential differences are that 36 • Data is drawn under non-random ascertainment in (4.17), using Clayton’s method, while it was drawn from the population distribution in (4.10), using the importance sampler. • The sum in the second term is over the pseudo-observations only in (4.10) while the real observation are also included in (4.17). • In (4.17) a separate estimate of P (A = 1) is calculated for each real observation while in (4.10) P (A = 1) is calculated only once. The last of these differences means that while M pseudo-observations are produced in the importance sampler, for a sample size of n real observations, M × n pseudo-observations are produced in Clayton’s method. As Clayton suggests the data augmentation method can be iterated to refine the values of θ′ in each step by using the parameter estimates from the previous iteration. There is however no guarantee that this will overcome problems of convergence that result from poor choices of θ′ , the point estimates may diverge if θ′ is to far from θ in the first iteration. As we will illustrate in Section 4.4 Clayton’s method suffers, not surprisingly, from problems common to other simulation based methods, namely that a poor choice of sampling distribution may result in large variability in the estimates. Approaches that have been proposed in other contexts, such as importance sampling, may be useful. 4.3 Other sampling based algorithms, missing data Although it does not completely fit into the classical framework of missing data problems (Little & Rubin 1987), non-random ascertainment can still be viewed in terms of a missing data problem. In the classical framework of a missing data problem there is a well-defined set of observations of which some the values are not observed and the data is partitioned into observed data, Y Obs , and missing data ,Y M iss . Usually there is partially complete information on each sample unit. In the ascertainment problem the unobserved observations are often of a different nature. If data is missing on an individual it is usually missing altogether, and it is not always even obvious how many individuals are unobserved. Nevertheless it is useful to consider 37 algorithms used in missing data problems, such as the Estimation Maximization (EM) algorithm (Dempster et al. 1977) and it’s extensions. We start by briefly describing the EM-algorithm as a background. An algorithm similar in spirit to the stochastic EM-algorithm is then described and is applied it to the ascertainment problem. These methods are frequentist. Bayesian approaches to the missing data problem will not be covered in this report. The EM algorithm has been applied extensively to missing data problems. The EM algorithm can be used to obtain maximum likelihood estimates using a numerical technique. First starting values for the parameter estimates are decided upon and then the following two steps are iterated: In the E-step the expectation of the complete data is calculated using the parameter values, θ̂, from the previous M-step. In the M-step the maximum likelihood estimates, θ̂, from the complete data, created in the E-step, are calculated. It can be shown that if the likelihood has a unique maximum the EM-algorithm converges to that value (Wu 1983). A problem that sometimes occurs when using the EM-algorithm is convergence to local maxima, the algorithm is therefore sensitive to what starting values are chosen. Another disadvantage of the EM-algorithm is that there is no direct way to calculate standard errors. One way to tackle the problem is to compute an asymptotic covariance matrix (Louis 1982). This approach uses the property that ′′ −lObs (θ, y) = Eθ [−lC′′ (θ, x)|y] − covθ [lC′ (θ, x)|y]. (4.18) The variance is obtained by taking the inverse of the observed information ′′ matrix −lObs (θ, y). If calculating the expected value of the missing data requires computationally demanding numerical integration one way to side-step the problem is to simulate the missing data and to use the value the observed mean instead of the calculated expectation. This is the Monte Carlo EM-algorithm (Wei & Tanner 1990). The algorithm is performed in two steps, in the S-step the missing data is simulated M times and in the M-step maximum likelihood estimates θ̂ are calculated using the combined data set containing observed and simulated data. Since the maximum likelihood estimates are calculated using the combined data set, the likelihood for the full data is used. The stochastic EM-algorithm (SEM) (Celeux & Diebolt 1985) is a special 38 case of the Monte Carlo EM-algorithm with only one simulation step per maximization step (McLachlan & Krishnan 1997). In iteration i the calculations are performed according to the following algorithm: S-step: Simulate M = 1 set of the missing data Y M iss using current parameter estimates θ̂i−1 . ↓ Construct the complete data likelihood using the observed data and the simulated data: Obs M iss L(θ; Y Complete ) = L(θ; Y∈A=1 , Y∈A=1 ) / Obs Sim ≈ L(θ; Y∈A=1 , Y∈A=1 ) / = = Y Y Obs Obs Sim f (Y∈A=1 , Y∈A=1 |θ) / Obs |θ) f (Y∈A=1 Y Sim Sim f (Y∈A=1 |θ) / (4.19) ↓ M-step: Get new parameter estimates θ̂i from (4.19) using maximum likelihood. ↓ Repeat: Go to iteration i + 1 and repeat the steps above. The stochastic EM will not converge to a single value but will have random variation, induced by the simulated data, around the estimate, and the result will be similar to that of a stationary Markov Chain Monte Carlo. See for example Gilks, Richardson & Speigelhalter (1996) for a description of Markov 39 Chain Monte Carlo. We will use the word convergence in a non-stringent manner to denote the process of the estimates moving towards a distribution around the correct value. In analogue with the terminology of Markov Chain Monte Carlo we will use the word burn-in to refer to the initial iterations of the chain that should be excluded from analysis in order to ensure that the estimates are produced by the right distribution. Missing data problems such as censored data sets are applications where the stochastic EM-algorithm is useful (Ip 1994). Starting values are required for the stochastic EM-algorithm but it is more robust to misspecified starting values than the deterministic EM-algorithm (Gilks et al. 1996). One way of estimating parameters in the stochastic EM algorithm is to choose the set of parameter values in the iteration that gives the highest value of the likelihood for the observed data. The likelihood for the observed data might however be so complicated that this is unfeasible. A simpler way is to compute the mean, θ̃, of the parameter values in the iterations after an appropriate burnin period. An approximation of the variance of θ̃ can according to Gilks et al. (1996) be computed using the method for the EM-algorithm by Louis that is described above. There are however some suspicions that this method may underestimate the variance (Gilks et al. 1996). For further reading about the stochastic EM-algorithm see for example Gilks et al. (1996) or McLachlan & Krishnan (1997). We can implement an algorithm similar in spirit to the stochastic EM algorithm for the ascertainment problem. The non-ascertained data is considered missing and is imputed in the S-step using the parameter estimates from the previous M-step. Normally when the stochastic EM algorithm is used to fill in missing data there is a fixed sample size and data is filled in for those individuals where data is missing. Here we assume that the sample size of the full data is not known but only the ascertainment probabilities conditional on the data. This is however not a problem if the data is simulated as described below. In the S-step the missing data is filled in by rejection sampling (see for example Gilks et al. 1996), using a reverse ascertainment scheme: Simulate: Simulate data from the population distribution f (Y |θ̂) and sort Sim the observations into data that would have been ascertained, Y∈A=1 , and data Sim that would not have been ascertained, Y∈A=1 . Stop when n observations from / Sim Y∈A=1 have been obtained. 40 ↓ Sim Sim Reject: Throw out the observations in Y∈A=1 and keep those in Y∈A=1 . / Obs The sample size, n, of the observed ascertained data, Y∈A=1 , is a known value but the size of the simulated data-set is random and will depend on θ̂. If population size is known, data is instead simulated until the combined Obs Sim data-set of Y∈A=1 and Y∈A=1 has the appropriate sample size. In the M/ step maximum likelihood estimates are obtained from the likelihood of the real ascertained data combined with the simulated non-ascertained data as described above. Sim If Y∈A=1 is large this algorithm will be slow. An alternative to sampling the / whole set of missing data is to simulate only a portion of the data and weigh up the likelihood contribution of the simulated data. If a proportion of 1/k of the missing data is desired, data is simulated as above until n/k observations Sim from Y∈A=1 have been produced. If n/k is not an integer randomization can be used to determine if it should be rounded up or down. Alternatively n/k can be fixed to an integer value and the value of k calculated. Small values of n/k will cause large variability in the estimates so the choice of k is a balance between sample size and number of steps in the chain. Ripatti, Larsen & Palmgren (2002) suggest a rule for increasing the number of samples in a Monte Carlo EM-algorithm when approaching convergence. In this context simulating a proportion of 1/k of the missing data would correspond to simulating 1/k samples. The basic idea of altering the number of samples when approaching the estimate might however be used also in this context. If the size of the missing data is small it is of course also possible to choose M > 1, giving an algorithm similar to the Monte Carlo EM. In the simulations in Section 4.4 M = 1 will be used. 4.4 Simulations To illustrate how the different methods perform in our context some simulations are performed. A simple model with only one phenotype is first investigated and then a few runs for a model with two co-dependent phenotypes are made. 41 The phenotypes are simulated to mimic two phenotypes in the metabolic syndrome, BMI and fasting plasma glucose level. According to WHO a BMI of at least 30 indicate obesity. About 10 percent of the Swedish population in the ages of 25-64 have such a BMI according to the WHO MONICA project (WHO 2000). A plasma glucose level of at least 7.8 or a 2 hour fasting glucose level of at least 11.1 gives a diabetes diagnosis according to the WHO diagnostic criteria for diabetes. We will use a BMI of 30 and a plasma glucose level of 7.8 as cut-offs in the ascertainment schemes. For each method a number of simulations are performed where a new dataset is produced under the ascertainment scheme in each simulation, this is our ’real’ data that is then analyzed. The sample-size in each data-set is 300 if another sample-size is not indicated. The parameter estimates presented in Table 4.2-4.9 are the mean of the estimates in 100 simulations under the same conditions and standard errors reported measure the variation between parameter estimates in these simulations. Starting values are required for the stochastic EM-algorithm and parameter values for the simulated data have to be specified for the other methods. Simulations are run both for correctly specified values and for misspecified values to investigate how the methods perform both under ideal and not so ideal conditions. For simplicity we will denote both the parameter values for the sampling distributions in the importance sampler and in Clayton’s method and the starting values in the SEM as θ′ . The number M has to be decided for Clayton’s method and the importance sampler, recall that in Clayton’s method M is the number of pseudoobservations per ’real’ observation while in the importance sampler it is the total number of pseudo-observations. Clayton’s model and the importance sampler are run under some different values of M for the simpler model to investigate the effect of M on the variability of the estimates. In the comparison between methods M = 50 pseudo-observations are used for Clayton’s method and in the importance sampler M = 30000 has been used. In the analysis the phenotype and the genotype frequency are modelled with a joint likelihood, as in (2.3). Genotype scores, here representing outcomes of exp(β0G ) ) SNP’s, will be assumed to be binomially distributed G ∼ Bin(2, 1+exp(β 0G ) and phenotypes will be normally distributed. The simulations are carried out using the software R (The-R-Development42 G BMI A Figure 4.2: Model i Core-Team 2001). In all three methods the R optimizing algorithm ”optim()” is used to calculate maximum likelihood estimates. The starting values used in the algorithm is θ′ in the importance sampler and in Clayton’s method. In the stochastic EM-algorithm the current value of θ̂i−1 is used. 4.4.1 Model i In this model we have a genotype represented by a biallelic SNP with an exp(β0G ) allele frequency of 1+exp(β ≈ 0.2, (β0G = −1.4), and the genotype score is 0G ) exp(β0G ) ). The genotype is assumed to have an additive effect G ∼ Bin(2, 1+exp(β 0G ) of βGP h = 4 per copy of the rare allele, on a normally distributed phenotype √ (P h), BMI, with standard deviation σP h = 2. The intercept, the mean phenotype value for genotype AA, is β0P h = 24. That is; P h|G = g ∼ N (β0P h + βGP h × g, σP h ). The ascertainment is made on the phenotype. All individuals with a BMI larger or equal to 30 are selected while individuals with a lower BMI have an ascertainment probability of about 0.067. This will give an approximately equal number of cases as controls. The model is illustrated by the graph in Figure 4.2. 4.4.2 Model ii The genotype in this model is also represented by a biallelic SNP, with the same allele frequency as in model i. Instead of one phenotype as in model i we here have two, P h1 =BMI and P h2 = plasma glucose level. BMI is here a co-morbid disease of plasma glucose level, that is, BMI is affected by the gene and will in turn affect the plasma glucose level. The genotype is assumed to have an additive effect on both phenotypes, and P h1 will have an additive effect on P h2 . Given the genotype, P h1 has distribution √ N (β0P h1 + βGP h1 × g, σP h1 ) where β0P h1 = 24, βGP h1 = 4 and σP h1 = 2 43 BMI G A Plasma Glucose Figure 4.3: Model ii while P h2 will, given genotype score, G = g, and BMI value, P h1 = ph1 , have distribution N (β0P h2 +βGP h2 ×g +βP h1 P h2 ×ph1 , σP h2 ), where β0P h2 = 3, βGP h2 = 1, βP h1 P h2 = 1/15 and σP h2 = 0.5. The ascertainment probability is dependent upon both phenotypes, according to Table 4.1. Model ii is illustrated by the graph in Figure 4.3. P h2 < 7.8 P h2 ≥ 7.8 P h1 < 30 P h1 ≥ 30 0.1 0.3 0.3 1 Table 4.1: Ascertainment probabilities in model ii 4.5 4.5.1 Results of simulations Results for model i Clayton’s model and the importance sampler for different values of M : To investigate how the number of pseudo-observations affects the parameter estimates in Clayton’s method and the importance sampler, these models 44 were run for different values of M . The true parameter values θ were here used as θ′ . In the tables standard errors of the estimates are presented in parentheses. Clayton’s method was run for M =2, 5, 10 and 50 as seen in Table 4.2. Since M pseudo-observations was produced for each real observation the total number of pseudo-observations was M × n = 600, 1500, 3000 and 15000. As Clayton (2003) points out, the information loss in the method seems to be of the order of M/(M + 1). β̂0G β̂0P h β̂GP h σ̂P h True -1.4 M =2 -1.388 (0.096) 24 23.999 (0.174) 4 3.996 (0.152) q (2) ≈ 1.414 1.420 (0.072) M =5 -1.395 (0.092) 23.995 (0.141) 3.996 (0.130) 1.415 (0.065) M =10 M =50 -1.408 -1.403 (0.073) (0.062) 24.009 24.007 (0.146) (0.110 ) 3.988 3.991 (0.117) (0.098 ) 1.415 1.416 (0.060) (0.054) Table 4.2: Clayton’s method run for different number of pseudo-observations, n = 300 The importance sampler was also investigated with respect to M . For the Importance sampler M is the total number of pseudo-observations so M = 600, 1500, 3000 and 15000 were chosen. Since the importance sampler seemed to need a larger total number of pseudo-observations than Clayton’s method the importance sample was also run for M = 30000. In the analysis below M = 50 will be used in Clayton’s method and M = 30000 will be used in the importance sampler. Comparison of models: Results of the three simulations based methods, calculated where true parameter values were used as θ′ , are presented in Table 4.4. Naive estimates, calculated by optimizing the likelihood of the data without ascertainment 45 β̂0G β̂0P h β̂GP h σ̂P h True -1.4 M =600 M =1500 M =3000 M =15000 M =30000 -1.359 -1.384 -1.399 -1.404 -1.394 ( 0.146 ) ( 0.097 ) ( 0.090 ) ( 0.087 ) ( 0.075 ) 24 24.031 24.025 24.022 24.022 23.996 ( 0.200 ) ( 0.138 ) ( 0.131 ) ( 0.139 ) ( 0.131 ) 4 3.979 3.987 3.991 3.991 4.008 ( 0.214 ) ( 0.159 ) ( 0.131 ) ( 0.115 ) ( 0.119 ) q (2) ≈ 1.414 1.429 1.406 1.407 1.406 1.416 ( 0.108 ) ( 0.071 ) ( 0.065 ) ( 0.050 ) ( 0.054 ) Table 4.3: Importance sampler run for different values of M , n = 300 correction, are also presented. Standard errors of the estimates are presented in parenthesis. β̂0G β̂0P h β̂GP h σ̂P h True values -1.4 Naive estimates -0.107 (0.083 ) 24 24.409 (0.141 ) 4 4.152 (0.105 ) q (2) ≈ 1.414 1.589 (0.059 ) Importance sampling -1.394 (0.075 ) 23.996 (0.131) 4.008 (0.119) 1.416 (0.054) Clayton’s method SEM -1.403 -1.400 (0.062 ) (0.068 ) 24.007 23.988 (0.110 ) (0.100 ) 3.991 4.006 (0.098 ) (0.106 ) 1.416 1.408 (0.054) (0.055) Table 4.4: Comparison of models when θ′ = θ. n = 300 Any observed differences in variances of estimates between models in Table 4.4 should be interpreted with caution since the variance of an estimate based on Clayton’s method and the importance sampler depends on M , and the variance of an estimate based on using the SEM depends on chain length. 46 Analysis with misspecified θ′ In Table 4.5 parameter estimates and standard errors of estimates for the ′ importance sampler are presented for βGP h = 4, 2 and 0, the correct value ′ of βGP h is 4. For the other parameters θ = θ was used. In Table 4.6 corresponding results are presented for Clayton’s model. Misspecified θ′ affect the standard errors of the estimates in both importance sampling and in Clayton’s model; the effect seems to be more pronounced in the importance sampler than in Clayton’s model. The effect of the misspecification on the standard errors is not linear. It is worth noting that the standard error of β̂GP h in the importance sampler actually seems to be larger for slightly mis′ ′ specified βGP h than for more severely misspecified βGP h , as can be seen in Table 4.5, Table 4.7 or Table 4.8. The effect estimates seem however to be biased for misspecified parameter values so the standard errors may not be a satisfying tool for comparison of the methods. Since the importance sampler estimator is claimed to be unbiased it may seem surprising that the parameter estimates are biased. As stated in Section 4.1, a condition for the importance sampler is that the sampling distribution f2 should be positive whenever g(x) × f1 > 0. This condition is fulfilled in the simulations above, but when θ′ is misspecified f2 may be so small for some g(x) × f1 , that no observations are actually sampled from these regions. Further investigation is however required determine if this is the cause of the bias in the parameter estimates. β̂0G β̂0P h β̂GP h σ̂P h ′ ′ ′ βGP h = βGP h = 4 βGP h = 2 βGP h = 0 -1.394 -1.189 3.213 (0.075 ) (0.310 ) (2.081 ) 24 23.996 23.850 26.342 (0.131) (0.212 ) (1.833 ) 4 4.008 4.292 4.687 (0.119) (0.352 ) (0.304 ) q (2) ≈ 1.414 1.416 1.401 2.001 (0.054) (0.156) (0.370) True -1.4 Table 4.5: Misspecified θ′ in importance sampling. n = 300 M =30000 47 β̂0G β̂0P h β̂GP h σ̂P h ′ ′ ′ βGP h = βGP h = 4 βGP h = 2 βGP h = 0 -1.403 -1.409 -1.316 (0.062) (0.098) (0.344) 24 24.007 24.000 23.991 (0.110) (0.130) (0.133) 4 3.991 3.984 4.094 (0.098) (0.136) (0.414) q (2) ≈ 1.414 1.416 1.392 1.434 (0.054) (0.070) (0.093) True -1.4 Table 4.6: Misspecified θ′ in Clayton’s model. n = 300 M = 50 For the SEM misspecified θ′ does not have the effect of inflated standard errors of the estimates. If the SEM converges to a distribution around the parameter estimates the standard errors of the estimates after the convergence will be the same as for correctly specified starting values. The running time of the SEM will be longer when θ′ is misspecified to allow for convergence and similar as to in Marcov Chain Monte Carlo simulations an appropriate burn in period has to be identified. There is a risk that the parameter estimates in the SEM will diverge when θ′ is misspecified, this did however not happen in any of our simulations. To illustrate the behavior of the SEM under misspecified starting values a ′ ′ ′ ′ = 0, β0P chain was run under β0G h = β0P h , βGP h = 0 and σP h = σP h and the parameter estimates plotted in Figure 4.4. The values of θ′ chosen here are meant to resemble a situation where there is some knowledge of the distribution of the phenotype in the population but no knowledge of the allele frequency or the gene’s effect on the phenotype. If Clayton’s method is run for the same values of θ′ as the SEM the estimates seem to be unbiased but with large variances while the Importance sampling did not give useful estimates at these values of θ′ . A possible way to obtain better starting values is to use the naive parameter estimates as θ′ . Running Clayton’s method from naive estimates did however still give large variances. For Clayton’s model theoretical variance estimates were also calculated, for technical details see Appendix B. The theoretical variances seemed to concur with the observed variances and will therefore not be reported separately. If 48 Intercept, Starting value 24 23.4 23.8 ^ βOPh −1.0 −1.5 ^ βOG −0.5 24.2 0.0 Gene freq, Starting value 0 50 100 150 200 0 50 100 150 Iteration Iteration Gene effect, Starting value 0 SD, Starting value 1.41 200 2.0 ^ Ph σ 2 0 1.5 1 ^ βGPh 3 2.5 4 0 0 50 100 150 200 Iteration 0 50 100 150 Iteration Figure 4.4: The first 200 iterations in SEM model i for misspecified θ′ , True parameter values as solid line, n = 300 49 200 θ′ is severely misspecified so that the variances of the estimates are very large the information matrix needed for the theoretical variance estimates will sometimes be singular. The importance sampler for different sample sizes: It would be of interest to see if the effect of misspecification of θ′ is dependent upon the sample-size, n, of the data. To investigate this the importance ′ sampler with misspecified βGP h was run for n = 600, and n = 1200. The results are presented in Table 4.7 and 4.8. As can be seen by comparing Table 4.7 and 4.8 with Table 4.5, where n = 300, the standard errors seem to be of similar size regardless of sample-size when ′ ′ βGP h = 0. When βGP h is not misspecified the standard errors decrease with sample-size in the usual rate. An interpretation of the results is that the variability resulting from misspecification dominates over the variability from ′ the data when βGP h = 0, so that the decrease of the standard deviation due to the increase in sample-size, n, is harder to detect. β̂0G β̂0P h β̂GP h σ̂P h ′ ′ ′ βGP h = βGP h = 4 βGP h = 2 βGP h = 0 -1.399 -1.197 3.039 ( 0.050 ) ( 0.301 ) ( 1.676 ) 24 24.003 23.862 26.459 (0.095 ) (0.204 ) ( 1.929 ) 4 3.999 4.239 4.705 (0.079 ) ( 0.372 ) ( 0.278 ) q (2) ≈ 1.414 1.414 1.389 2.083 ( 0.039) ( 0.156) ( 0.386 ) True -1.4 Table 4.7: Misspecified θ′ in importance sampling. n = 600 M = 30000 4.5.2 Results for model ii When using θ′ = θ in model ii both the importance sampler, Clayton’s method and the SEM gave reasonable estimates . The SEM was somewhat time-consuming to run in for this model. 50 β̂0G β̂0P h β̂GP h σ̂P h ′ ′ ′ βGP h = βGP h = 4 βGP h = 2 βGP h = 0 ] -1.395 -1.186 2.968 (0.045 ) (0.286 ) ( 1.657 ) 24 24.000 23.837 26.482 ( 0.055 ) (0.194 ) ( 3.050 ) 4 3.997 4.261 4.661 (0.049 ) (0.373 ) ( 0.272 ) q (2) ≈ 1.414 1.415 1.390 2.068 ( 0.026) ( 0.124) ( 0.445 ) True -1.4 Table 4.8: Misspecified θ′ in importance sampling. n = 1200 M = 30000 ′ Model ii was also run for misspecified θ′ , the values of θ′ were β0G = 0 β0P h1 = ′ ′ ′ ′ ′ β0P h1 , βGP h1 = 0, σP h1 = σP h1 , β0P h2 = β0P h2 , βGP h2 = 0, βP h1 P h2 = 0, and σP′ h2 = σP h2 . As in model i these values were chosen as if some knowledge was available about the distribution of the phenotypes while no knowledge was available about the allele frequency or the effect of the gene. The effect of BMI on plasma glucose level was also assumed unknown. As can be seen in Table 4.9 these values of θ′ do not give adequate parameter values in neither Clayton’s method nor the importance sampler. The SEM does converge but does take longer to converge than in model i, a run of the SEM is shown in Figure 4.5. 51 Intercept Ph1, Starting value 24 0 100 200 300 400 4 0 −1.5 23.8 1 2 ^ βGPh1 24.0 ^ βOPh1 3 24.2 0.0 −0.5 −1.0 ^ βOG Gene effect Ph1, Starting value 0 24.4 Gene freq, Starting value 0 0 100 200 300 400 0 100 200 300 400 Iteration Iteration SD(Ph1), Starting value 1.41 Intercept Ph2, Starting value 3 Gene effect Ph2, Starting value 0 0.8 0.4 ^ βGPh2 2 0 100 200 300 400 0.0 −1 0 0 100 200 300 400 Iteration Effect of Ph1 on Ph2, Starting value 0 SD(Ph2), Starting value 0.5 0 100 200 Iteration 1.0 0.8 0.6 0.10 ^ Ph σ 2 0.20 1.2 Iteration 0.00 ^ βPh1Ph2 1 ^ βOPh2 1.8 1.4 ^ Ph σ 1 2.2 3 2.6 Iteration 0 100 200 300 400 0 100 Iteration 200 300 400 Iteration Figure 4.5: The first 400 iterations in SEM model ii for misspecified θ′ , True parameter values as solid line. n = 300 52 300 400 β̂0G β̂0P h1 β̂GP h1 σ̂P h1 β̂0P h2 β̂GP h2 β̂P h1 P h2 σ̂P h2 True θ -1.4 θ′ 0 Clayton Importance sampling -0.376 -1.001 (1.049 ) (0.935 ) 24 24 23.654 24.371 (1.094 ) (0.898 ) 4 0 0.304 0.584 (0.432 ) (1.015 ) q q (2) ≈ 1.414 (2) 1.719 1.704 (0.200 ) (0.382 ) 3 3 4.825 4.977 (0.325 ) (0.515 ) 1 0 0.557 0.645 (0.191 ) (0.390 ) 1/15 ≈ 0.067 0 0.061 0.015 (0.028 ) (0.020 ) 0.5 0.5 0.064 0.002 (0.077) (0.040) Table 4.9: Clayton’s method and importance sampling for model ii under misspecified θ′ . n = 300 - 53 5 Discussion The Stochastic EM algorithm converges in the tested examples. If the starting values are misspecified we have to run the algorithm for a longer time to allow it to converge but the starting values will not affect the estimates once convergence is achieved. If the model is complex the Stochastic EM algorithm therefore seems to be preferable to Clayton’s method and the importance sampler since both of these methods seem prone to break down when θ′ is misspecified. The importance sampler seems to be more sensitive to misspecified θ′ than Clayton’s method but it is not investigated here how much of that difference that is due to the choices of the size, M , of the simulated data. It would be interesting to investigate if a larger M would give adequate estimates in model ii with misspecified θ′ , where poor estimates were obtained in the simulations in Section 4.5.2. The importance sampler may perform better if another importance sampling estimate than (4.6) is used. Hesterberg (1995) describes such alternatives and argues that (4.6) is unreliable since the weights, wi , do not sum to one. Hesterberg (1995) also points out that a mixture of sampling distributions can be used in importance sampling for better coverage of the sample space. This approach may be beneficial in the ascertainment problem since the choice of sampling distribution turned out to be a major difficulty. Using a similar approach in Clayton’s method may also be considered. Another possible strategy to avoid the effects of seriously misspecified θ′ in Clayton’s method and the importance sampler is to iterate the procedure using the estimates from the previous step as θ′ . This however demands that the parameter estimates from the first step are somehow reasonable, otherwise the iterative procedure could diverge. Both the importance sampler, Clayton’s method and the Stochastic EM algorithm demands prior knowledge of sampling probabilities given the data. These probabilities are often not known and approximations may have to be made using for example registry data or prior knowledge about disease occurrence. If the sampling probabilities are known the complexity of ascertainment scheme does however hardly affect the complexity of the calculations. In Clayton’s method and the SEM the ascertainment probabilities are used only when simulating data, and not in the likelihood, while in the importance sampler the sampling probabilities are used in the estimator of P (A = 1|θ) in a computationally simple manner. Another advantage of the three simulation based methods is that they are not restricted to any specific 54 kind of model while some of the traditional methods handle only specific kinds of data. The results of the simulation based methods described here will be sensitive to distributional assumptions. Phenotypes are here assumed to be normally distributed given genotype scores but in real data they are often not, so nonparametric extensions of the models would be of interest. It is not possible to check distributional assumptions using standard procedures such as normal QQ-plots since the ascertained data is not assumed to follow the distribution in the population. When missing data is filled in, as in the SEM, checks of distributional assumptions can be misleading since the combined data is a mixture of data from the population distribution and data simulated according to the distributional assumptions. If distributional assumptions are to be checked custom made checks have to be constructed. Another useful extension would be to incorporate analysis of ambiguous haplotypes in the models. 55 A Power calculations for dichotomized data Description of the power calculations presented in Section 2.4: The log likelihood ratio is computed, but instead of real data the expected data under the ascertainment scheme is used, as described below. The procedure has been described for generalized linear models by Self, Mauritsen, & Ohara (1992) and applied to binary case-control data by Longmate (2001) who calls it the exemplary data method. Under the null hypothesis the log likelihood ratio statistic T 2 is asymptotically χ2 distributed and the null hypothesis is rejected when T 2 > q ∼ χ2 (df, 1 − α) where df is the discrepancy in parameters between the alternative and the null hypothesis and α is the intended significance level. The power is calculated by observing how large proportion of samples would be larger than q under the alternative hypothesis. The distribution of T 2 under the alternative hypothesis is asymptotically non-central χ2 . T 2 is also called the non-centrality parameter. The expectation of T 2 is obtained by calculating T 2 using the expected data. The power is calculated using the noncentral χ2 probability function Fχ2 (q, df, ν) where ν is the expected value of T 2 under the alternative. The non-central χ2 distribution is available in for example the statistical software R (The-R-Development-Core-Team 2001). Calculation of T 2 : The power calculation are performed by computing the expected value of the log likelihood ratio statistic testing if βGP h = 0 T 2 = 2[lA − l0 ] (A.1) where l is the log likelihood. In the alternative hypothesis the parameters β0P h and βGP h are estimated using maximum likelihood while under the null only β0P h is estimated. The test statistic will be asymptotically chisquare distributed with one degree of freedom. The expected value of the test statistic is E(2[lA − l0 ]) = 2[E(lA ) − E(l0 )]. 56 (A.2) For continuous data ascertainment correction has to be made and the likelihood for a single observation is thus LA = f (P h, G|A, θ̂) = P (A|G, P h, θ̂)f (P h|G, θ̂)P (G|θ̂) P (A|θ̂) P (A|G, P h, θ0 )f (P h|G, θ0 )P (G|θ0 ) L0 = f (P h, G|A, θ0 ) = P (A|θ0 ) (A.3) where θ̂ = θ for the exemplary data. The parameter values depend on the hypothesis in L but not in the other parts of (A.6). The expressions P (A|G, P h) and P (G) do not depend on the parameter values at all and will be the same for both hypothesis. The ascertainment probability A is calculated as P (A) = = Z Ph X G P (A|P h)f (P h)dy = P (G)[P (A|P h < c) Z X G c −∞ P (G) Z Ph P (A|P h)f (P h|G)dy f (P h|G)dy + P (A|P h ≥ c) Z ∞ c f (P h|G)dy]. (A.4) Some numerical integration will have to be made to obtain P (A). For simplicity the dichotomized data is here assumed to fit a logistic regression model with a logistic link even though a probit link would be more appropriate since the outcome data is generated by a normal distribution. Using the logistic link function ascertainment correction does not have to be made, as argued in Section 2.3, and the likelihood will be l∝ exp(θ) . 1 + exp(θ) (A.5) Note however that β̂0P h is affected by the ascertainment scheme and has to be calculated with this in mind. The expectation of the log likelihoods are also calculated under the ascertainment scheme. Here the true value of θ should be used, even when calculation 57 the expectation of l0 . E[l] = XZ G Ph log(L) P (A|G, P h, θ)f (P h|G, θ)P (G, θ) dy P (A|θ) Z X 1 log(L)P (A|P h, θ)P (P h|G, θ)dy P (G) = P (A|θ) G Ph Z c X 1 log(L)P (P h|G, θ)dy = P (G|θ)P (A|P h < c, θ) P (A|θ) G −∞ Z ∞ X 1 + log(L)P (P h|G, θ)dy P (G|θ)P (A|P h ≥ c, θ) P (A|θ) G c (A.6) Since the observations are iid the expected log likelihood of the whole data set obtained by multiplying the log likelihood with n. B Information matrix in Clayton’s method The contribution of the ith individual to the information matrix is according to Clayton (2003) m X 1 wi.∗ wi.∗ ∗ ∗ 2 ∗ ¯ (I − I ) + ( w (u − ū ) + wij (u∗ij − ū∗i )2 ) (B.1) i i i i i wi + wi.∗ wi + wi.∗ wi + wi.∗ j=1 where ∗ indicates simulated data 58 f (yi ; θ) f (yi ; θ′ ) f (yij∗ ; θ) ∗ wij = f (yij∗ ; θ′ ) wi = wi.∗ = m X ∗ wij j=1 δ log f (yi ; θ) δθ δ log f (yij∗ ; θ) ∗ uij = δθ ui = and wi.∗ (Ii − I¯i∗ ) = 0. wi + wi.∗ If θ is a vector of parameters θ = (β0G , β0P h , βGP h , σ) then for row vectors u, (ui −ū∗i )2 is replaced by (ui −ū∗i )T (ui −ū∗i ) and (u∗ij −ū∗i )2 by (u∗ij −ū∗i )T (u∗ij −ū∗i ) in the expression above. √ 2 The derivatives of log(f (P h|θ)) = − log( 2π) − log(σ) − (y−(β0P h2σ+β2 GP h g)) + exp(β0G ) exp(β0G ) g log( 1+exp(β ) + (2 − g) log(1 − 1+exp(β ) denoted 0G ) 0G ) uθ = (uβ0G , uβ0P h , uβGP h , uσ ) are δ log(f ) δβ0G δ log(f ) uβ0P h = δβ0P h δ log(f ) uβGP h = δβGP h δ log(f ) and uσ = δσ uβ0G = (g exp(−β0G ) − 2 + g) (exp(−β0G ) + 1) (ph − (β0P h + βGP h g)) = σ2 g(ph − (β0P h + βGP h g)) = σ2 1 (ph − (β0P h + βGP h g))2 =− + . σ σ3 = 59 References Agardh, E., Ahlbom, A., Andersson, T., Efendic, S., Grill, V., Hallqvist, J., Norman, A. & Ostenson, C. (2003), ‘Work stress and low sense of coherence is associated with type 2 diabetes in middle-aged swedish women.’, Diabetes Care 26(3), 719–24. Allison, D., Heo, M., Schork, N., Wong, S. & Elston, R. (1998), ‘Extreme selection strategies in gene mapping studies of oligogenic quantitative traits do not always increase power.’, Human Hered. 48, 97–107. Armitrage, P. & Colton, T. (1999), Encyclopedia of biostatistics, Vol. 4 & 6, John Wiley & sons, Great Britain, pp. 2791–2793, 4735–4738. Balding, D. J., Bishop, M. & Cannings, C. (2001), Handbook of statistical genetics, Chichester : Wiley, Midsomer Norton, chapter 19 Population Association, David Clayton, pp. 519–540. Burton, P. (2003), ‘Correcting for nonrandom ascertainment in generalized linear mixed models (GLMMs), fitted using gibbs sampling’, Genetic Epidemiology 24, 24–35. Carlson, C. S., Eberle, M. A., Rieder, M. J., Smith, J. D., Kruglyak, L. & Nickerson, D. A. (2003), ‘Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans.’, Nat Genet. 33(4), 518–21. Celeux, G. & Diebolt, J. (1985), ‘The SEM algorithm: A probabilistic teacher algorithm derived from the EM algorithm for the mixture problem’, Computational Statistics [Formerly: Computational Statistics Quarterly] 2, 73–82. Chen, H. Y. (2003), ‘A note on the prospective analysis of outcome-dependent samples’, J.R. Statist. Soc. B 65, Part 2, 575–584. Clayton, D. (2003), ‘Conditional likelihood inference under complex ascertainment using data augmentation.’, Biometrika 90(4), 976–981. Clayton, D. & Hills, M. (1996), Statistical models in epidemiology, Oxford University Press, New York, chapter 29, pp. 294–295. Cohen, J. (1983), ‘The cost of dichotomization’, Applied psychological measurement 7(3), 249–253. 60 Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977), ‘Maximum likelihood from incomplete data via the EM algorithm (with discussion)’, Journal of the Royal Statistical Society, Series B, Methodological 39, 1–37. Elston, R., Olson, J. & Palmer, L. (2002), Biostatistical Genetics and genetic epidemiology, Wiley reference series in biostatistics, Wiley, Chippenham Wiltshire, p. 399. Excoffier, L. & Slatkin, M. (1995), ‘Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population’, Mol Biol Evol 12(5), 921–927. Fisher, R. (1934), ‘The effects of methods of ascertainment upon the estimation of frequencies’, Annals of Eugenics 6, 13–25. Geyer, C. J. (1993), ‘Estimating normalizing constants and reweighting mixtures in markov chain monte carlo.’, Technical Report 568, School of Statistics, University of Minnesota . Geyer, C. J. & Thompson, E. A. (1992), ‘Constrained monte carlo maximum likelihood for dependent data’, J.R. Statist. Soc. B 54(3), 657–99. Gilks, W. R., Richardson, S. & Speigelhalter, D. J. (1996), Markov Chain Monte Carlo in practice, first edn, Chapman & Hall, London. Green, P. (1992), ‘Discussion on constrained monte carlo maximum likelihood for dependent data (by Geyer, C. J. and Thompson, E. A.)’, J.R. Statist. Soc. B 54, 683–684. Gu, H., Abulaiti, A., Ostenson, C., Humphreys, K., Wahlestedt, C., Brookes, A. & Efendic, S. (2004), ‘Single nucleotide polymorphisms in the proximal promoter region of the adiponectin (APM1) gene are associated with type 2 diabetes in swedish caucasians.’, Diabetes 53, Suppl 1:S31– 5. Hammersly, J. M. & Handscomb, D. C. (1964), Monte Carlo methods, Methuen, London. Heckman, J. (1979), ‘Sample selection bias as a specification error’, Econometrica 47, 153–161. Hesterberg, T. (1995), ‘Weighted average importance sampling and defensive mixture distributions’, Technometrics 37, 185–194. 61 Hirschhorn, J., Lohmueller, K., Byrne, E. & Hirschhorn, K. (2002), ‘A comprehensive review of genetic association studies’, Genetics in Medicine 4, 45–61. Houghton-Mifflin-Company (1993), The American heritage college dictionary, 3rd edn, Houghton Mifflin, Boston. Ip, E. H. S. (1994), ‘A stochastic EM estimator in the presense of missing data- theory and applications.’, Technical report, Department of Statistics, Stanford University . Jarrett, R. (1984), ‘Type 2 (non-insulin-dependent) diabetes mellitus and coronary heart disease - chicken, egg or neither?’, Diabetologia 26, 99– 102. Kagan, A. (2001), ‘A note on the logistic link function’, Biometrika 88(2), 599–601. Kraft, P. & Thomas, D. C. (2000), ‘Bias and efficiency in family-based genecharacterization studies: Conditional, prospective, retrospective, and joint likelihoods’, Am. J. Hum. Genet 66, 1119–1131. Little, R. J. A. & Rubin, D. (1987), Statistical analysis with missing data, John Wiley & Sons, New York; Chichester. Longmate, J. (2001), ‘Complexity and power in case-control association studies.’, Am J Hum Genet. 68(5), 1229–1237. Louis, T. (1982), ‘Finding the observed information matrix when using the EM algorithm’, J.R. Statist. Soc. B 44, 226–233. McLachlan, G. J. & Krishnan, T. (1997), The EM algorithm and extensions, John Wiley & sons Inc, chapter 6. Mendel, J. (1865), ‘Verhandlungen des naturforschenden vereines in brünn’, Abhandlungen 4, 3–47. Morton, N. & Collins, A. (1998), ‘Tests and estimates of allelic association in complex inheritance’, Proc. Natl. Acad. Sci. USA 95, 11389–11393. Neale, M., Eaves, L. & Kendler, K. (1994), ‘The power of the classical twin study to resolve variation in threshold traits.’, Behav Genet 24, 239–258. Neuhaus, J. M. (2000), ‘Closure of the class of binary generalized linear models in some non-standard settings’, J.R. Statist. Soc. B 62(Part 4), 839–846. 62 Neuhaus, J. M. (2002), ‘Bias due to ignoring the sample design in case-control studies.’, Aust. N.Z.J. Stat 44(3), 285–293. Prentice, R. L. & Pyke, R. (1979), ‘Logistic disease incidence models and case-control studies’, Biometrika 66, 403–412. Purcell, S., Cherny, S., Hewitt, J. & Sham, P. (2001), ‘Optimal sibship selection for genotyping in quantitative trait locus linkage analysis’, Hum Hered. 52(1), 1–13. Ripatti, S., Larsen, K. & Palmgren, J. (2002), ‘Maximum likelihood inference inference for multivariate frailty models using an automated monte carlo EM algorithm’, Lifetime Data Analysis 8, 349–360. Robins, J. M., Smoller, J. W. & Lunetta, K. L. (2001), ‘On the validity of the TDT test in the presence of comorbidity and ascertainment bias’, Genetic Epidemiology 21(4), 326–336. Rubin, D. B. & Schenker, N. (1991), ‘Multiple imputation in health-care databases: An overview and some applications’, Statistics in Medicine 10, 585–598. Schork, N. J., Nath, S. K., Fallin, D. & Chakravarti, A. (2000), ‘Linkage disequilibrium analysis of biallelic DNA markers, human quantitative trait loci, and threshold-defined case and control subjects’, Am. J. Hum. Genet. 67, 1208–1218. Self, S. G., Mauritsen, R. H., & Ohara, J. (1992), ‘Power calculations for likelihood ratio tests in generalized linear models’, Biometrics 48, 31– 39. Smoller, J., Lunetta, K. & Robins, J. (2000), ‘Implications of comorbidity and ascertainment bias for identifying disease genes.’, Am J Med Genet. 96(6), 817–22. Stern, M. P. (1995), ‘Perspectives in diabetes. diabetes and cardiovascular disease. the ”common soil” hypothesis’, Diabetes 44. Strachan, T. & Reed, A. P. (1999), Human Molecular Genetics 2, second edn, BIOS Scientific Publishers Ltd, Bath, pp. 40,271,273,303. Terwilliger, J. & K.M., W. (2003), ‘Confounding, ascertainment bias, and the blind quest for a genetic ’fountain of youth”, Annals of Medicine 35, 1–13. 63 Terwilliger, J. & Ott, J. (1992), ‘A haplotype-based ’haplotype relative risk’ approach to detecting allelic associations.’, Hum Hered. 42(6), 337–46. The-R-Development-Core-Team (2001), ‘R’, Version 1.4.0. Torrie, G. & Valleu, J. (1977), ‘Nonphysical sampling distributions in monte carlo free-energy estimation:umbrella sampling.’, J. Comput. Phys. 23, 187–199. Vargha, A., Rudas, T., Delaney, H. D. & Maxwell, S. E. (1996), ‘Dichotomization, partial correlation, and conditional independence’, Journal of educational and behavioral statistics 21(3), 264–282. Wei, G. C. G. & Tanner, M. A. (1990), ‘A monte carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms’, Journal of the American Statistical Association 85, 699–704. WHO (2000), ‘Monica project’, ”http://www.ktl.fi/publications/ monica/surveydb/title.htm”. Wu, C. (1983), ‘On the convergence properties of the EM algorithm’, Annals of Statistics 11, 95–103. 64