Statistical Methods For High Dimensional Biology: Stat/Biof/Gsat 540

Statistical Methods for High
Dimensional Biology
STAT/BIOF/GSAT 540
Lecture 13 – Association Studies
Sara Mostafavi
March 18,2019
Outline
• eQTL analysis cont’d
– Including Multiple SNPs
– Prediction in high dimension and overfitting
– Regularized regression
• Epigenome-wide association studies (EWAS)

– Epigenetics and gene regulation
– Challenges: cellular heterogeneity (CH) and
approaches for adjusting for CH
• Oral presentations April 1 & 3

What’s next after finding your GWAS hits?
GWAS “hit”
gene
Expression quantitative trait (eQTL) studies
• eQTL study: identification of associations between genetic variants (single
nucleotide polymorphisms; SNPs) and gene expression levels.
• eQTL: pairwise association between SNP i’s genotype and gene j’s
expression levels
Corr
elati
Genotype data on : e
QTL
Samples
P Expression data
SN A
me
to
Samples
P sc rip
SN C n vel
s
Tra NA
l e
mR
4
Expression quantitative trait (eQTL) studies
• eQTL study: identification of associations between genetic variants (single
nucleotide polymorphisms; SNPs) and gene expression levels.
Corr
elati
Genotype data on : e
QTL
Samples
• Could be “static” or “context specific”

Expression data
• i.e., Because gene
SN A expression is dynamic, some
P
e
m
eQTLs areNonly observable ipin
t the appropriate
o
Samples
P c r
S C a ns vel
s
context Tr
m R NA
l e
• Most eQTLs are “static” and hence “reference

population” studies are typically used.
5
• Consortium project looking doing eQTL analysis based on gene
expression from over 50 different tissues
• >80% of eQTLs are found in multiple tissues
• Tissue-specific eQTLs are fewer, but still very informative of gene

regulation mechanisms.
Data: eQTL studies
Expression data
Genotype data
Samples
Samples
• Simplify the problem: look at one SNP and one gene at a

time
– E.g., are gene expression level of gene i and genotype of
SNP j associated?
– Take a 1Mb “window” around the start of each gene (TSS:
transcription start side). Typically, every SNPs in a 1Mb
window is tested for association in an eQTL analysis.
Statistics of eQTL studies
• Simplify the problem: look at one SNP and one gene at a
time
– E.g., are gene expression level of gene i and genotype of SNP j
associated?
– Continuous outcome, categorical explanatory variable(s) (Linear
regression or Anova)
!" = $ + &'( + *⃗"(
Considering multiple SNPs to model gene
expression levels
SNP k
Gene i
• Can we include all the SNPs in one model?

expression levels
SNP k
Gene i

• What are some solutions?
expression levels
SNP k
Gene i

• What are some solutions?
- Dimensionality reduction (PCA)
- Clustering of SNPs
- “Clumping” based on correlation/LD regions
- Regularized regression
Why you shouldn’t include “many” variables in
your model
• Generate completely random data where no relationship
between X and y exist:
– 100 observations (y~N(0,1))
– Variable number of “explanatory variables” {10,30, 90}
(X~N(0,I))
• Fit a linear regression model, and get the coefficients
"! = (% & %)() % & *
• Predict y
*+ = %"!
your model
• Generate completely random data where no relationship between X and y
exist:
– Variable number of “explanatory variables” {10,30, 90} (X~N(0,I))
your model
• Completely random data:
(X~N(0,I))
your model
(X~N(0,I))
your model
100 observations (y~N(0,1))
• Be– aware of how many variables you are using in the model
(X~N(0,I))
• Use adjusted R-squared instead of R-squared (i.e., %
variance explained is not meaningful when you have too
many variables)
• Rule of thumb: In Ordinary Linear Regression, you need 20

data points (observations) to estimate each additional
parameter/coefficient
Reminder Regularized Regression
• Determine the model’s parameter by optimizing
penalized likelihood (i.e., not the “raw” likelihood)
• Determine the model’s parameter by optimizing penalized
likelihood (i.e., not the “raw” likelihood)
• Review of MLE for fitting the parameters:

! "# ~%('# (, * +)
Log likelihood of data is proportional to
“squared error”
(- = argmin/ ||1 − 3(||++

! "# ~%('# (, * +)
Log likelihood of data is proportional to
“squared error”
(- = argmin/ ||1 − 3(||++
Overfitting in high D; Empirically, all estimates will have

measures the fit of the non-zero values
model
Tuning parameter
"! = argmin$ ||& − ("||)) + +,(")
Model fit Penalization/regularization

(~log data likelihood) term
"! = argmin$ ||& − ("||)) + +,(")
3
ridge , " = ||"||)) = / "0)
012
• Ridge Regression still has a closed form solution and a

probabilistic interpretation
2
"! = argmin$ ||& − ("||)) + +, " ; , " = ||"||)) = . "/)

/01
"! = argmin$ ||& − ("||)) such that ||"||)) ≤ :
Mathematically equivalent statements for appropriate tuning

parameters + and :
2
"! = argmin$ ||& − ("||)) + +, " ; , " = ||"||)) = . "/)

/01
(Probabilistic derivation on board)

"! = argmin$ ||& − ("||)) + +,(")
3
ridge , " = ||"||)) = / "0)
012
lasso , " = |"|= ∑3012 |"0 |
"! = argmin$ ||& − ("||)) + +,(")
3
ridge , " = ||"||)) = / "0)
012
lasso , " = |"|= ∑3012 |"0 |
Elastic , " = 5|"|+ (1 − 5)||"||)) = 5 ∑3012 "0 + (1 − 5) ∑3012 "0)

net
Regularized regression in practice
• Need to select/set tuning constant !
• Common mistake:
– Do “feature selection” outside of cross validation
(e.g., correlate each feature/SNP with outcome, select
the correlated ones and then fit the model)
– ! needs to be selected in nested CV. i.e., performance
should be reported on test set and not validation set.
Outline
Deriving mechanisms: from genomics data to mechanism to
complex disease
Epigenome
SNP DNAm
A
SNP
C
Genome
mRNA Complex
Transcriptome
disease
Proteome
Pathways, Gene Networks;

“Interactome”
• Histone molecules help package and order DNA into structural units
• Addition of other molecules (histone modification) to histones

provides a code for interpreting the “state” of a piece of DNA (e.g.,
active)
https://www.shmoop.com/dna/dna-packaging.html
…. Epigenetic marks can also sit on the DNA (DNA methylation)
Epigenetic modifications of DNA are associated with
how tightly packed a region of DNA is
… Which is then associated with transcription of genes

in that region
…. On average, genes in tightly packed regions have

lower transcription, but this relationship depends on
what and where the epigenetic marks are.
Measuring DNA methylation with
Measuring%DNAm
arrays
Measuring%DNAm
In adult mammals, DNA methylation is largely
In adult mammals, DNA methylation occurs largely at CpG dinucleotides
restricted to CpG dinucleotides
AC G G C GTCAGTTAC G
CpG A
1 C G
CpGG2 C G T CAG T T 3A C G
CpG
Subject 1
Subject 2
Subject n
CpG 1 CpG 2 CpG 3
Subject 1
Subject 2
.
.
.
.
.
.
CpG 1
CpG 2
CpG 1
…
CpG 2
…
20% 100%
…
CpG m
…
20% 100%
CpG m
Adapted from Ge et al., Environ Health Perspect; DOI:10.1289/ehp.1307047 7

Measuring%DNAm
arrays
Measuring%DNAm
CpG A
1 C G
CpGG2 C G T CAG T T 3A C G
CpG
Subject 1
Subject 2
Subject n
CpG 1 CpG 2 CpG 3
Subject 1
Subject 2
.
.
.
.
.
.
CpG 1
CpG 2
• What would the data look like if we could
…
measure
CpG 1 it exactly?
CpG 2
…
20% 100%
…
CpG m
…
20% 100%
CpG m

Measuring%DNAm
arrays
CpG
CpG 11 CpG
CpG 22 CpG
CpG 33
Subject11
Subject22
Subjectnn
Subject
Subject
Subject
..
..
..
Arrays average the signal across multiple cells
presentsCpG
in a sample from one
CpG 11
subject/individual
CpG
CpG 22
…
…
…
…
20%
20% 100%
100%
CpG
CpG m
m

Measuring%DNAm
arrays
CpG
CpG 11 CpG
CpG 22 CpG
CpG 33
Subject11
Subject22
Subjectnn
Subject
Subject
Subject
..
..
..
CpG
CpG 11
CpG
CpG 22
…
…
…
…
20%
20% 100%
100%
CpG
CpG m
m

xWAS
• GWAS (genome-wide association study)
assessing the association between genetic
factors and phenotypic traits
• EWAS: epigenome wide association study
• TWAS: transcriptome wide association study

xWAS
• GWAS (genome-wide association study)
assessing the association between genetic
factors and phenotypic traits
• EWAS: epigenome wide association study
• TWAS: transcriptome wide association study

E/TWAS opportunities and challenges
Pros
• Capture joint effect of genetics and environment (past and
present exposures)
• Closer to disease in the chain of associations and hence
potential for larger effect sizes (and more statistical power)
Cons
• Causality can not be inferred (in a typical human study)
• Impact on gene function needs to be worked out (as in
GWAS)
• Tissue of choice (relevant vs accessible)
• Cellular heterogeneity
Human tissues are complex
• Large number of unique cell types in any given tissue
• Cell types differ from each other in their epigenomes
and transcriptomes
Human tissues are complex
DMs sites and DE genes detected in bulk tissue
can represent:
• Differential expression within the cell
• Differential cell type composition between
Genomic studies in complex tissues
samples
Sample/individual 1 Sample/individual 2
vs.
Adapted from
Sandberg, 2014
Cellular heterogeneity
• Blood as a motivating example
Cellular heterogeneity
• Gene expression (and other molecular trait)
vary across cell types
“cell type stratification”
Approaches for correcting for cell type
heterogeneity
• Unsupervised: “reference-free”
– No need for cell sorted data, or specifying which
cells to correct for
– Over correction?
• Semi-supervised / Supervised
– Informed by known cell types and their “markers”
Reference-free (i.e., unsupervised)
deconvolution
• Assume “primary” signal in your data is related to cell counts
• Motivation: few studies have shown such primary signals
correlate with cell type proportions
X = SDU … -log P-values

T cell prop
Monocyte Prop
..
Sex
Principal components
…
...
...
Batch
Top 30 Principal Components
Reference-free procedure
1. Compute SVD decomposition, and get the
top k principal components (PCs) (hint: k
vectors of length n; n=sample size)
2. Include the top k PCs in your regular linear

regression analysis as extra covariates
y = X β + S1:kγ
Include K PCs in your model
Reference-free procedure
1. Compute SVD decomposition, and get the
top k principal components (PCs) (hint: k
vectors of length n; n=sample size)
2. You could also compute the residual after

regressing out the top k PCs
yres = y − S1:kγ
OLS coefficient vector
Reference free methods
• Based on linear mixed effect models
– EWASHER (Zou, Nature Methods 2013)
• Based on probabilistic factor analysis
– PEER (Stegle, PLOS Comp Bio 2010)
– PANAMA (Fusi, PLOS Comp Bio 2011)
• Based on PCA
– SVA (Leek and Storey, PLOS Genetics 2007)
– Sparse PCA (Rahmani, Nature Methods 2016)
Reference free: more nuanced methods
• Huge amount of variability is removed, is there a more subtle way

to do this?
• Sparse PCA (ReFractor)
– Perform PCA on a subset of probes
– L1 sparsity penalty on PCs, so to simulate the behavior of small
signature gene/probe list
• What is the main issue with completely
unsupervised approaches?
• What are some potential issues with
completely unsupervised approaches?
– What if top PCs also correlate with outcome of
interest?
Reference based approaches
All reference based approaches consists of two
steps:
1) Signature selection: select a set of marker genes
for each cell type of interest using cell sorted
data.
2) Estimate the relative score for each individual
and each cell types based on factor analysis on
limited set of probes/genes.
Signature gene selection
• Use “existing” cell sorted data for K

cell types.
• Simple differential
expression/methylation analysis
(e.g. t-test) comparing one cell type
to the rest.
• Identifying 10-100 signature

genes/CpGs.
[Abbas et al., PLOS ONE 2009]

Same story for DNAm data
DNAm signature for 6 immune cell types
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-86]
Semi-supervised
• Example: RUV
• Motivation: batch effects
RUV
1. Define a set of negative control “features”
2. Log transform and Standardize data
3. Take SVD of data, only considering the negative
control features
4. Use linear regression the remove the effect of PCs
(from SVD) from the entire normalized data
5. Obtain the residuals
Other semi-supervised based methods
• DSA
• CellCode
How good is the predictions?
• Generate cell counts with FACS and compare
predictions to observed.
• Some times validation data are too easy:

“artificial mixture”
• Compare quantile-quantile plot of “before”

and “after” correction.
[Abbas et al., PLOS ONE 2009]
Discussion
• What “resolution” are you aiming to correct at?
– “broad” cell types? Or refined ones?
– We can potentially view DNAm/gene expression data as
mostly containing signal about proportion of broad and
narrow cell types
– Implicit decision (number of PCs) vs explicit (reference-
based number of cells)
• Absolute values vs relative ranking
• Do you apply the algorithm on raw or corrected data?
Outline

• Oral presentations April 1 & 3

Plan for oral presentations
• 14 groups; 20 min each – 15min presentation
+ 5min QA
• Need to start at 9:10 on April 1 and April 3
• We’ll take up the entire seminar session (11-
1pm) on Wed April 3
• Schedule and details will be on GitHub
Discussion repo
Link to doc describing oral presentation
assignment/notes
https://docs.google.com/document/d/1K_QWFNb3Kn8rGoelk837l-
FPLwbkLPB4oWEeEZmwBvc/edit?usp=sharing

Statistical Methods For High Dimensional Biology: Stat/Biof/Gsat 540

Uploaded by

Copyright:

Available Formats

Statistical Methods For High Dimensional Biology: Stat/Biof/Gsat 540

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Methods For High Dimensional Biology: Stat/Biof/Gsat 540

Uploaded by

Copyright:

Available Formats

Statistical Methods for High

Lecture 13 – Association Studies

• Epigenome-wide association studies (EWAS)

• Oral presentations April 1 & 3

• Could be “static” or “context specific”

• Most eQTLs are “static” and hence “reference

• >80% of eQTLs are found in multiple tissues

• Tissue-specific eQTLs are fewer, but still very informative of gene

• Simplify the problem: look at one SNP and one gene at a

• Can we include all the SNPs in one model?

• Can we include all the SNPs in one model?

• Can we include all the SNPs in one model?

• Fit a linear regression model, and get the coefficients

"! = (% & %)() % & *

• Rule of thumb: In Ordinary Linear Regression, you need 20

• Review of MLE for fitting the parameters:

(- = argmin/ ||1 − 3(||++

(- = argmin/ ||1 − 3(||++

Overfitting in high D; Empirically, all estimates will have

"! = argmin$ ||& − ("||)) + +,(")

Model fit Penalization/regularization

"! = argmin$ ||& − ("||)) + +,(")

• Ridge Regression still has a closed form solution and a

"! = argmin$ ||& − ("||)) + +, " ; , " = ||"||)) = . "/)

"! = argmin$ ||& − ("||)) such that ||"||)) ≤ :

Mathematically equivalent statements for appropriate tuning

"! = argmin$ ||& − ("||)) + +, " ; , " = ||"||)) = . "/)

(Probabilistic derivation on board)

"! = argmin$ ||& − ("||)) + +,(")

"! = argmin$ ||& − ("||)) + +,(")

Elastic , " = 5|"|+ (1 − 5)||"||)) = 5 ∑3012 "0 + (1 − 5) ∑3012 "0)

Pathways, Gene Networks;

• Addition of other molecules (histone modification) to histones

… Which is then associated with transcription of genes

…. On average, genes in tightly packed regions have

Adapted from Ge et al., Environ Health Perspect; DOI:10.1289/ehp.1307047 7

Adapted from Ge et al., Environ Health Perspect; DOI:10.1289/ehp.1307047 7

Adapted from Ge et al., Environ Health Perspect; DOI:10.1289/ehp.1307047 7

Adapted from Ge et al., Environ Health Perspect; DOI:10.1289/ehp.1307047 7

• EWAS: epigenome wide association study

• TWAS: transcriptome wide association study

• EWAS: epigenome wide association study

• TWAS: transcriptome wide association study

X = SDU … -log P-values

2. Include the top k PCs in your regular linear

2. You could also compute the residual after

• Huge amount of variability is removed, is there a more subtle way

• Use “existing” cell sorted data for K

• Identifying 10-100 signature

[Abbas et al., PLOS ONE 2009]

• Some times validation data are too easy:

• Compare quantile-quantile plot of “before”

• Epigenome-wide association studies (EWAS)

• Oral presentations April 1 & 3

You might also like