Statistical Methods For High Dimensional Biology: Stat/Biof/Gsat 540
Statistical Methods For High Dimensional Biology: Stat/Biof/Gsat 540
Statistical Methods For High Dimensional Biology: Stat/Biof/Gsat 540
Dimensional Biology
STAT/BIOF/GSAT 540
Sara Mostafavi
March 18,2019
Outline
• eQTL analysis cont’d
– Including Multiple SNPs
– Prediction in high dimension and overfitting
– Regularized regression
• eQTL: pairwise association between SNP i’s genotype and gene j’s
expression levels
Corr
elati
Genotype data on : e
QTL
Samples
P Expression data
SN A
me
to
Samples
P sc rip
SN C n vel
s
Tra NA
l e
mR
4
Expression quantitative trait (eQTL) studies
• eQTL study: identification of associations between genetic variants (single
nucleotide polymorphisms; SNPs) and gene expression levels.
Corr
elati
Genotype data on : e
QTL
Samples
Samples
P c r
S C a ns vel
s
context Tr
m R NA
l e
5
• Consortium project looking doing eQTL analysis based on gene
expression from over 50 different tissues
Samples
Samples
SNP k
Gene i
SNP k
Gene i
SNP k
Gene i
• Predict y
*+ = %"!
Why you shouldn’t include “many” variables in
your model
• Generate completely random data where no relationship between X and y
exist:
– 100 observations (y~N(0,1))
– Variable number of “explanatory variables” {10,30, 90} (X~N(0,I))
Why you shouldn’t include “many” variables in
your model
• Completely random data:
– 100 observations (y~N(0,1))
– Variable number of “explanatory variables” {10,30, 90}
(X~N(0,I))
Why you shouldn’t include “many” variables in
your model
• Completely random data:
– 100 observations (y~N(0,1))
– Variable number of “explanatory variables” {10,30, 90}
(X~N(0,I))
Why you shouldn’t include “many” variables in
your model
• Completely random data:
100 observations (y~N(0,1))
• Be– aware of how many variables you are using in the model
– Variable number of “explanatory variables” {10,30, 90}
(X~N(0,I))
• Use adjusted R-squared instead of R-squared (i.e., %
variance explained is not meaningful when you have too
many variables)
! "# ~%('# (, * +)
Log likelihood of data is proportional to
“squared error”
Tuning parameter
3
ridge , " = ||"||)) = / "0)
012
3
ridge , " = ||"||)) = / "0)
012
lasso , " = |"|= ∑3012 |"0 |
Reminder Regularized Regression
• Determine the model’s parameter by optimizing
penalized likelihood (i.e., not the “raw” likelihood)
3
ridge , " = ||"||)) = / "0)
012
lasso , " = |"|= ∑3012 |"0 |
• Common mistake:
– Do “feature selection” outside of cross validation
(e.g., correlate each feature/SNP with outcome, select
the correlated ones and then fit the model)
– ! needs to be selected in nested CV. i.e., performance
should be reported on test set and not validation set.
Outline
• eQTL analysis cont’d
– Including Multiple SNPs
– Prediction in high dimension and overfitting
– Regularized regression
• Epigenome-wide association studies (EWAS)
– Epigenetics and gene regulation
– Challenges: cellular heterogeneity (CH) and
approaches for adjusting for CH
Deriving mechanisms: from genomics data to mechanism to
complex disease
Epigenome
SNP DNAm
A
SNP
C
Genome
mRNA Complex
Transcriptome
disease
Proteome
AC G G C GTCAGTTAC G
CpG A
1 C G
CpGG2 C G T CAG T T 3A C G
CpG
Subject 1
Subject 2
Subject n
CpG 1 CpG 2 CpG 3
Subject 1
Subject 2
.
.
.
.
.
.
CpG 1
CpG 2
CpG 1
…
CpG 2
…
20% 100%
…
CpG m
…
20% 100%
CpG m
AC G G C GTCAGTTAC G
CpG A
1 C G
CpGG2 C G T CAG T T 3A C G
CpG
Subject 1
Subject 2
Subject n
CpG 1 CpG 2 CpG 3
Subject 1
Subject 2
.
.
.
.
.
.
CpG 1
CpG 2
• What would the data look like if we could
…
measure
CpG 1 it exactly?
CpG 2
…
20% 100%
…
CpG m
…
20% 100%
CpG m
AC G G C GTCAGTTAC G
CpG
CpG 11 CpG
CpG 22 CpG
CpG 33
Subject11
Subject22
Subjectnn
Subject
Subject
Subject
..
..
..
Arrays average the signal across multiple cells
presentsCpG
in a sample from one
CpG 11
subject/individual
CpG
CpG 22
…
…
…
…
20%
20% 100%
100%
CpG
CpG m
m
AC G G C GTCAGTTAC G
CpG
CpG 11 CpG
CpG 22 CpG
CpG 33
Subject11
Subject22
Subjectnn
Subject
Subject
Subject
..
..
..
CpG
CpG 11
CpG
CpG 22
…
…
…
…
20%
20% 100%
100%
CpG
CpG m
m
Cons
• Causality can not be inferred (in a typical human study)
• Impact on gene function needs to be worked out (as in
GWAS)
• Tissue of choice (relevant vs accessible)
• Cellular heterogeneity
Human tissues are complex
• Large number of unique cell types in any given tissue
• Cell types differ from each other in their epigenomes
and transcriptomes
Human tissues are complex
DMs sites and DE genes detected in bulk tissue
can represent:
• Differential expression within the cell
• Differential cell type composition between
Genomic studies in complex tissues
samples
Sample/individual 1 Sample/individual 2
vs.
Adapted from
Sandberg, 2014
Cellular heterogeneity
• Blood as a motivating example
Cellular heterogeneity
• Gene expression (and other molecular trait)
vary across cell types
“cell type stratification”
Approaches for correcting for cell type
heterogeneity
• Unsupervised: “reference-free”
– No need for cell sorted data, or specifying which
cells to correct for
– Over correction?
• Semi-supervised / Supervised
– Informed by known cell types and their “markers”
Reference-free (i.e., unsupervised)
deconvolution
• Assume “primary” signal in your data is related to cell counts
• Motivation: few studies have shown such primary signals
correlate with cell type proportions
y = X β + S1:kγ
Include K PCs in your model
Reference-free procedure
1. Compute SVD decomposition, and get the
top k principal components (PCs) (hint: k
vectors of length n; n=sample size)
yres = y − S1:kγ
OLS coefficient vector
Reference free methods
• Based on linear mixed effect models
– EWASHER (Zou, Nature Methods 2013)
• Based on probabilistic factor analysis
– PEER (Stegle, PLOS Comp Bio 2010)
– PANAMA (Fusi, PLOS Comp Bio 2011)
• Based on PCA
– SVA (Leek and Storey, PLOS Genetics 2007)
– Sparse PCA (Rahmani, Nature Methods 2016)
Reference free: more nuanced methods
• Simple differential
expression/methylation analysis
(e.g. t-test) comparing one cell type
to the rest.
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-86]
Semi-supervised
• Example: RUV
• Motivation: batch effects
RUV
1. Define a set of negative control “features”
2. Log transform and Standardize data
3. Take SVD of data, only considering the negative
control features
4. Use linear regression the remove the effect of PCs
(from SVD) from the entire normalized data
5. Obtain the residuals
Other semi-supervised based methods
• DSA
• CellCode
How good is the predictions?
• Generate cell counts with FACS and compare
predictions to observed.