Statistical Methods For High Dimensional Biology: Stat/Biof/Gsat 540

Download as pdf or txt
Download as pdf or txt
You are on page 1of 63

Statistical Methods for High

Dimensional Biology
STAT/BIOF/GSAT 540

Lecture 13 – Association Studies

Sara Mostafavi
March 18,2019
Outline
• eQTL analysis cont’d
– Including Multiple SNPs
– Prediction in high dimension and overfitting
– Regularized regression

• Epigenome-wide association studies (EWAS)


– Epigenetics and gene regulation
– Challenges: cellular heterogeneity (CH) and
approaches for adjusting for CH

• Oral presentations April 1 & 3


What’s next after finding your GWAS hits?
GWAS “hit”
gene
Expression quantitative trait (eQTL) studies
• eQTL study: identification of associations between genetic variants (single
nucleotide polymorphisms; SNPs) and gene expression levels.

• eQTL: pairwise association between SNP i’s genotype and gene j’s
expression levels
Corr
elati
Genotype data on : e
QTL
Samples

P Expression data
SN A
me
to

Samples
P sc rip
SN C n vel
s
Tra NA
l e
mR

4
Expression quantitative trait (eQTL) studies
• eQTL study: identification of associations between genetic variants (single
nucleotide polymorphisms; SNPs) and gene expression levels.
Corr
elati
Genotype data on : e
QTL
Samples

• Could be “static” or “context specific”


Expression data
• i.e., Because gene
SN A expression is dynamic, some
P
e
m
eQTLs areNonly observable ipin
t the appropriate
o

Samples
P c r
S C a ns vel
s
context Tr
m R NA
l e

• Most eQTLs are “static” and hence “reference


population” studies are typically used.

5
• Consortium project looking doing eQTL analysis based on gene
expression from over 50 different tissues

• >80% of eQTLs are found in multiple tissues

• Tissue-specific eQTLs are fewer, but still very informative of gene


regulation mechanisms.
Data: eQTL studies
Expression data
Genotype data

Samples
Samples

• Simplify the problem: look at one SNP and one gene at a


time
– E.g., are gene expression level of gene i and genotype of
SNP j associated?
– Take a 1Mb “window” around the start of each gene (TSS:
transcription start side). Typically, every SNPs in a 1Mb
window is tested for association in an eQTL analysis.
Statistics of eQTL studies
• Simplify the problem: look at one SNP and one gene at a
time
– E.g., are gene expression level of gene i and genotype of SNP j
associated?
– Continuous outcome, categorical explanatory variable(s) (Linear
regression or Anova)
!" = $ + &'( + *⃗"(
Considering multiple SNPs to model gene
expression levels

SNP k
Gene i

• Can we include all the SNPs in one model?


Considering multiple SNPs to model gene
expression levels

SNP k
Gene i

• Can we include all the SNPs in one model?


• What are some solutions?
Considering multiple SNPs to model gene
expression levels

SNP k
Gene i

• Can we include all the SNPs in one model?


• What are some solutions?
- Dimensionality reduction (PCA)
- Clustering of SNPs
- “Clumping” based on correlation/LD regions
- Regularized regression
Why you shouldn’t include “many” variables in
your model
• Generate completely random data where no relationship
between X and y exist:
– 100 observations (y~N(0,1))
– Variable number of “explanatory variables” {10,30, 90}
(X~N(0,I))

• Fit a linear regression model, and get the coefficients

"! = (% & %)() % & *

• Predict y
*+ = %"!
Why you shouldn’t include “many” variables in
your model
• Generate completely random data where no relationship between X and y
exist:
– 100 observations (y~N(0,1))
– Variable number of “explanatory variables” {10,30, 90} (X~N(0,I))
Why you shouldn’t include “many” variables in
your model
• Completely random data:
– 100 observations (y~N(0,1))
– Variable number of “explanatory variables” {10,30, 90}
(X~N(0,I))
Why you shouldn’t include “many” variables in
your model
• Completely random data:
– 100 observations (y~N(0,1))
– Variable number of “explanatory variables” {10,30, 90}
(X~N(0,I))
Why you shouldn’t include “many” variables in
your model
• Completely random data:
100 observations (y~N(0,1))
• Be– aware of how many variables you are using in the model
– Variable number of “explanatory variables” {10,30, 90}
(X~N(0,I))
• Use adjusted R-squared instead of R-squared (i.e., %
variance explained is not meaningful when you have too
many variables)

• Rule of thumb: In Ordinary Linear Regression, you need 20


data points (observations) to estimate each additional
parameter/coefficient
Reminder Regularized Regression
• Determine the model’s parameter by optimizing
penalized likelihood (i.e., not the “raw” likelihood)
Reminder Regularized Regression
• Determine the model’s parameter by optimizing penalized
likelihood (i.e., not the “raw” likelihood)

• Review of MLE for fitting the parameters:


! "# ~%('# (, * +)
Log likelihood of data is proportional to
“squared error”

(- = argmin/ ||1 − 3(||++


Reminder Regularized Regression
• Determine the model’s parameter by optimizing
penalized likelihood (i.e., not the “raw” likelihood)

! "# ~%('# (, * +)
Log likelihood of data is proportional to
“squared error”

(- = argmin/ ||1 − 3(||++

Overfitting in high D; Empirically, all estimates will have


measures the fit of the non-zero values
model
Reminder Regularized Regression
• Determine the model’s parameter by optimizing
penalized likelihood (i.e., not the “raw” likelihood)

Tuning parameter

"! = argmin$ ||& − ("||)) + +,(")

Model fit Penalization/regularization


(~log data likelihood) term
Reminder Regularized Regression
• Determine the model’s parameter by optimizing
penalized likelihood (i.e., not the “raw” likelihood)

"! = argmin$ ||& − ("||)) + +,(")

3
ridge , " = ||"||)) = / "0)
012

• Ridge Regression still has a closed form solution and a


probabilistic interpretation
Reminder Regularized Regression
• Determine the model’s parameter by optimizing
penalized likelihood (i.e., not the “raw” likelihood)
2

"! = argmin$ ||& − ("||)) + +, " ; , " = ||"||)) = . "/)


/01

"! = argmin$ ||& − ("||)) such that ||"||)) ≤ :

Mathematically equivalent statements for appropriate tuning


parameters + and :
Reminder Regularized Regression
• Determine the model’s parameter by optimizing
penalized likelihood (i.e., not the “raw” likelihood)
2

"! = argmin$ ||& − ("||)) + +, " ; , " = ||"||)) = . "/)


/01

(Probabilistic derivation on board)


Reminder Regularized Regression
• Determine the model’s parameter by optimizing
penalized likelihood (i.e., not the “raw” likelihood)

"! = argmin$ ||& − ("||)) + +,(")

3
ridge , " = ||"||)) = / "0)
012
lasso , " = |"|= ∑3012 |"0 |
Reminder Regularized Regression
• Determine the model’s parameter by optimizing
penalized likelihood (i.e., not the “raw” likelihood)

"! = argmin$ ||& − ("||)) + +,(")

3
ridge , " = ||"||)) = / "0)
012
lasso , " = |"|= ∑3012 |"0 |

Elastic , " = 5|"|+ (1 − 5)||"||)) = 5 ∑3012 "0 + (1 − 5) ∑3012 "0)


net
Regularized regression in practice
• Need to select/set tuning constant !

• Common mistake:
– Do “feature selection” outside of cross validation
(e.g., correlate each feature/SNP with outcome, select
the correlated ones and then fit the model)
– ! needs to be selected in nested CV. i.e., performance
should be reported on test set and not validation set.
Outline
• eQTL analysis cont’d
– Including Multiple SNPs
– Prediction in high dimension and overfitting
– Regularized regression
• Epigenome-wide association studies (EWAS)
– Epigenetics and gene regulation
– Challenges: cellular heterogeneity (CH) and
approaches for adjusting for CH
Deriving mechanisms: from genomics data to mechanism to
complex disease
Epigenome

SNP DNAm
A

SNP
C
Genome

mRNA Complex
Transcriptome
disease

Proteome

Pathways, Gene Networks;


“Interactome”
• Histone molecules help package and order DNA into structural units

• Addition of other molecules (histone modification) to histones


provides a code for interpreting the “state” of a piece of DNA (e.g.,
active)
https://www.shmoop.com/dna/dna-packaging.html
…. Epigenetic marks can also sit on the DNA (DNA methylation)
Epigenetic modifications of DNA are associated with
how tightly packed a region of DNA is

… Which is then associated with transcription of genes


in that region

…. On average, genes in tightly packed regions have


lower transcription, but this relationship depends on
what and where the epigenetic marks are.
Measuring DNA methylation with
Measuring%DNAm
arrays
Measuring%DNAm
In adult mammals, DNA methylation is largely
In adult mammals, DNA methylation occurs largely at CpG dinucleotides
restricted to CpG dinucleotides
In adult mammals, DNA methylation is largely
restricted to CpG dinucleotides

AC G G C GTCAGTTAC G
CpG A
1 C G
CpGG2 C G T CAG T T 3A C G
CpG

Subject 1
Subject 2

Subject n
CpG 1 CpG 2 CpG 3

Subject 1
Subject 2
.
.
.
.
.
.
CpG 1
CpG 2
CpG 1

CpG 2

20% 100%

CpG m

20% 100%
CpG m

Adapted from Ge et al., Environ Health Perspect; DOI:10.1289/ehp.1307047 7


Measuring DNA methylation with
Measuring%DNAm
arrays
Measuring%DNAm
In adult mammals, DNA methylation is largely
In adult mammals, DNA methylation occurs largely at CpG dinucleotides
restricted to CpG dinucleotides
In adult mammals, DNA methylation is largely
restricted to CpG dinucleotides

AC G G C GTCAGTTAC G
CpG A
1 C G
CpGG2 C G T CAG T T 3A C G
CpG

Subject 1
Subject 2

Subject n
CpG 1 CpG 2 CpG 3

Subject 1
Subject 2
.
.
.
.
.
.
CpG 1
CpG 2
• What would the data look like if we could

measure
CpG 1 it exactly?
CpG 2

20% 100%

CpG m

20% 100%
CpG m

Adapted from Ge et al., Environ Health Perspect; DOI:10.1289/ehp.1307047 7


Measuring DNA methylation with
Measuring%DNAm
arrays
In adult mammals, DNA methylation is largely
In adult mammals, DNA methylation occurs largely at CpG dinucleotides
restricted to CpG dinucleotides

AC G G C GTCAGTTAC G
CpG
CpG 11 CpG
CpG 22 CpG
CpG 33

Subject11
Subject22

Subjectnn
Subject
Subject

Subject
..
..
..
Arrays average the signal across multiple cells
presentsCpG
in a sample from one
CpG 11
subject/individual
CpG
CpG 22




20%
20% 100%
100%
CpG
CpG m
m

Adapted from Ge et al., Environ Health Perspect; DOI:10.1289/ehp.1307047 7


Adapted from Ge et al., Environ Health Perspect; DOI:10.1289/ehp.1307047 7
Measuring DNA methylation with
Measuring%DNAm
arrays
In adult mammals, DNA methylation is largely
In adult mammals, DNA methylation occurs largely at CpG dinucleotides
restricted to CpG dinucleotides

AC G G C GTCAGTTAC G
CpG
CpG 11 CpG
CpG 22 CpG
CpG 33

Subject11
Subject22

Subjectnn
Subject
Subject

Subject
..
..
..
CpG
CpG 11
CpG
CpG 22




20%
20% 100%
100%
CpG
CpG m
m

Adapted from Ge et al., Environ Health Perspect; DOI:10.1289/ehp.1307047 7


Adapted from Ge et al., Environ Health Perspect; DOI:10.1289/ehp.1307047 7
xWAS
• GWAS (genome-wide association study)
assessing the association between genetic
factors and phenotypic traits

• EWAS: epigenome wide association study

• TWAS: transcriptome wide association study


xWAS
• GWAS (genome-wide association study)
assessing the association between genetic
factors and phenotypic traits

• EWAS: epigenome wide association study

• TWAS: transcriptome wide association study


E/TWAS opportunities and challenges
Pros
• Capture joint effect of genetics and environment (past and
present exposures)
• Closer to disease in the chain of associations and hence
potential for larger effect sizes (and more statistical power)

Cons
• Causality can not be inferred (in a typical human study)
• Impact on gene function needs to be worked out (as in
GWAS)
• Tissue of choice (relevant vs accessible)
• Cellular heterogeneity
Human tissues are complex
• Large number of unique cell types in any given tissue
• Cell types differ from each other in their epigenomes
and transcriptomes
Human tissues are complex
DMs sites and DE genes detected in bulk tissue
can represent:
• Differential expression within the cell
• Differential cell type composition between
Genomic studies in complex tissues
samples
Sample/individual 1 Sample/individual 2

vs.

Adapted from
Sandberg, 2014
Cellular heterogeneity
• Blood as a motivating example
Cellular heterogeneity
• Gene expression (and other molecular trait)
vary across cell types
“cell type stratification”
Approaches for correcting for cell type
heterogeneity

• Unsupervised: “reference-free”
– No need for cell sorted data, or specifying which
cells to correct for
– Over correction?

• Semi-supervised / Supervised
– Informed by known cell types and their “markers”
Reference-free (i.e., unsupervised)
deconvolution
• Assume “primary” signal in your data is related to cell counts
• Motivation: few studies have shown such primary signals
correlate with cell type proportions

X = SDU … -log P-values


T cell prop
Monocyte Prop
..
Sex
Principal components

...
...
Batch
Top 30 Principal Components
Reference-free procedure
1. Compute SVD decomposition, and get the
top k principal components (PCs) (hint: k
vectors of length n; n=sample size)

2. Include the top k PCs in your regular linear


regression analysis as extra covariates

y = X β + S1:kγ
Include K PCs in your model
Reference-free procedure
1. Compute SVD decomposition, and get the
top k principal components (PCs) (hint: k
vectors of length n; n=sample size)

2. You could also compute the residual after


regressing out the top k PCs

yres = y − S1:kγ
OLS coefficient vector
Reference free methods
• Based on linear mixed effect models
– EWASHER (Zou, Nature Methods 2013)
• Based on probabilistic factor analysis
– PEER (Stegle, PLOS Comp Bio 2010)
– PANAMA (Fusi, PLOS Comp Bio 2011)
• Based on PCA
– SVA (Leek and Storey, PLOS Genetics 2007)
– Sparse PCA (Rahmani, Nature Methods 2016)
Reference free: more nuanced methods

• Huge amount of variability is removed, is there a more subtle way


to do this?
• Sparse PCA (ReFractor)
– Perform PCA on a subset of probes
– L1 sparsity penalty on PCs, so to simulate the behavior of small
signature gene/probe list
• What is the main issue with completely
unsupervised approaches?
• What are some potential issues with
completely unsupervised approaches?
– What if top PCs also correlate with outcome of
interest?
Reference based approaches
All reference based approaches consists of two
steps:
1) Signature selection: select a set of marker genes
for each cell type of interest using cell sorted
data.
2) Estimate the relative score for each individual
and each cell types based on factor analysis on
limited set of probes/genes.
Signature gene selection

• Use “existing” cell sorted data for K


cell types.

• Simple differential
expression/methylation analysis
(e.g. t-test) comparing one cell type
to the rest.

• Identifying 10-100 signature


genes/CpGs.

[Abbas et al., PLOS ONE 2009]


Same story for DNAm data
DNAm signature for 6 immune cell types

[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-86]
Semi-supervised
• Example: RUV
• Motivation: batch effects
RUV
1. Define a set of negative control “features”
2. Log transform and Standardize data
3. Take SVD of data, only considering the negative
control features
4. Use linear regression the remove the effect of PCs
(from SVD) from the entire normalized data
5. Obtain the residuals
Other semi-supervised based methods
• DSA

• CellCode
How good is the predictions?
• Generate cell counts with FACS and compare
predictions to observed.

• Some times validation data are too easy:


“artificial mixture”

• Compare quantile-quantile plot of “before”


and “after” correction.
[Abbas et al., PLOS ONE 2009]
Discussion
• What “resolution” are you aiming to correct at?
– “broad” cell types? Or refined ones?
– We can potentially view DNAm/gene expression data as
mostly containing signal about proportion of broad and
narrow cell types
– Implicit decision (number of PCs) vs explicit (reference-
based number of cells)
• Absolute values vs relative ranking
• Do you apply the algorithm on raw or corrected data?
Outline
• eQTL analysis cont’d
– Including Multiple SNPs
– Prediction in high dimension and overfitting
– Regularized regression

• Epigenome-wide association studies (EWAS)


– Epigenetics and gene regulation
– Challenges: cellular heterogeneity (CH) and
approaches for adjusting for CH

• Oral presentations April 1 & 3


Plan for oral presentations
• 14 groups; 20 min each – 15min presentation
+ 5min QA
• Need to start at 9:10 on April 1 and April 3
• We’ll take up the entire seminar session (11-
1pm) on Wed April 3
• Schedule and details will be on GitHub
Discussion repo
Link to doc describing oral presentation
assignment/notes
https://docs.google.com/document/d/1K_QWFNb3Kn8rGoelk837l-
FPLwbkLPB4oWEeEZmwBvc/edit?usp=sharing

You might also like