Academia.eduAcademia.edu

Decision Trees and Random Forests

Machine Learning . Prediction of Phenotype by Transcriptome classification using Random Forest Machine Learning Uwe Menzel, 2012 [email protected] www.matstat.org o Transcriptomes (RNA-Seq, Illumina), different ages, 5 species o > 1000 transcriptomes www.matstat.org “Finclip study” o Fin biopsis from 152 individuals of N. furzeri aged 10 or 20 weeks (max. lifespan 60 weeks) without sacrificing the fish → record lifespan data o → for each fish: transcriptome at 10 & 20 weeks + lifespan Fishes were subdivided into 3 lifespan groups: o short-lived, o medium-lived, o long-lived Note: grouping into (G1, G2,G3) is not mandatory. RF can also be used for regression. Transcriptome data samples, obervations (fishes) predictors, features, variables (genes) Predict this symbol (G1, G2, G3) from other columns TODO: Identify those genes whose expression is predictive for lifespan → establish statistical model www.matstat.org Decision tree: Training set After the 1st split, we have one pure leaf with G1 and samples from lifespan group G2 & G3 With 2 splits, “pure” leaves were created (containing one class only) These are now our decision rules – the model Test set, prediction A new sample with still unknown class (lifetime) We apply the decision rules ”learned” above o Using the model (the decision tree), we can predict the lifespan class of some indvidual based on it's transcriptome (at early age). o In order to make unique predictions, we must achieve pure leafs (with a few splits). Creating pure leaves: Measuring purity 𝐻: entropy ; 𝑝𝑖 : probability (fraction) “purity” = low entropy & high information content www.matstat.org “impurity” = higher entropy & high information content Creating pure leaves: Entropy before and after split Weighted average of child node entropies: o 𝑁𝑖 = nr. samples in a leaf o 𝐻𝑖 = entropy in a leaf o 𝑁 = σ 𝑁𝑖 Information gain = 0.925 − 0.792 = 0.133 ; Information gain = entropy reduction from parent to child nodes. Uwe Menzel, 2015 Creating pure leaves: Finding the (best) splits order when this variable is probed o Try out: o all variables (Nfu_ …...) o all splits (midpoints of ordered expression values) o choose the split with highest information gain o ID3, C4.5 algorithms (Ross Quinlan*) o Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81-106 o https://en.wikipedia.org/wiki/ID3_algorithm Uwe Menzel, 2015 library(RWeka) tree <- J48(lifespan ~ ., data = df) summary(tree) plot(tree) Finclip-study J48 is an implementation of the C4.5 algorithm which in turn is an extension of Quinlan's earlier ID3 algorithm. Keep tree reasonable small to avoid overfitting, even if impure leaves remain! Uwe Menzel, 2015 Classification by Machine Learning o Once the tree is created, we can classify new samples by "running them down the tree" (Breiman, Cutler*) → classification o *https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm o We habe established a “model” and are now able to automatically classify new samples → machine learning http://gureckislab.org/blog www.matstat.org Use Random Forest ! Advantages: o classifier that is favourable when the number of variables (genes) is higher than the number of observations (samples) (90 samples, 15.000 genes) o tree pinpoints genes that are informative with regard to some attribute (here with regard to age). Drawback: o Overfitting, bias-variance dilemma o Having huge data with big variance, correlation between predictors and response variables can occur which is purely casual! Solution: Use Random Forest! https://en.wikipedia.org/wiki/Pruning_(decision_trees) www.matstat.org Random Forest (RF) o Breiman, Cutler, 2001* o ensemble classifier, creates many decision trees (typically 1000) o prediction by aggregating multiple deep (unpruned) trees o prevention of overfitting to training set o use subset of data for every tree: 1. sample with replacement from the observations → reduces variance 2. select random subset of variables → identify additional predictors * Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1): 5–32 www.matstat.org Sampling in RF Ignore about 1Τ3 of the observations (fishes) by sampling with replacement Randomly choose 𝑁 of the 𝑁 variables (genes) every tree is build with a subset of the data Uwe Menzel, 2015 www.matstat.org Liaw, Wiener: Classification and Regression by randomForest Error estimation for RF Liaw, Wiener: Classification and Regression by randomForest Variable Importance in RF o a measure of the explanatory power of a variable (gene) with regard to the response (lifespan) o Permutation importance: o scramble i-th variable in training set and record change in the outof-bag error Plot: permutation importance gene names www.matstat.org Feature selection (variable selection) o selecting a subset of relevant features (variables, predictors) for use in the model (Wikipedia) o here: select relevant genes (= variables) for prediction o R package varSelRF (Diaz-Uriarte, 2006) o builds RF models recursively, and abandons the variables (genes) with the lowest predictive power on every step o predictive power is measured by out-of-bag (OOB) error set.seed(99) rf.model <- varSelRF(expr.data.df, class, ntree = 5000, vars.drop.frac = 0.2, keep.forest = T) best.variables = rf.model$selected.vars length(best.variables) www.matstat.org “Averaged tree” The tree is based on the genes with the largest importance values calculated by the Random Forest model. Uwe Menzel, 2015 Important Genes = Biomarkers o Biomarkers for ageing o Enrichment analysis: GO, KEGG o #1: "spectrin alpha 2": scaffold proteins that stabilize the plasma membrane o #4 "clathrin" major protein component of the cytoplasmic face of intracellular organelles o #7 "mitochondrial ribosomal protein" …. Plot: permutation importance vs. genes www.matstat.org Proximity of the samples RF calculates a proximity matrix (how close is every pair of samples?) o if samples end up in the same leaf frequently, they are considered similar o proximity matrix → MultiDimensional Scaling (MDS-) plot o MDS-plot presents samples in a 2or 3-D plot by preserving the relative distances between samples o a point represents the whole transcriptome of a sample o samples cluster well according to lifespan group Uwe Menzel, 2015 Cross Validation Training set: build RF model Test set: “forget” the response (for a while) 10-fold CV: subdivide samples randomly into 10 parts o Use 90% as training set, 10% as test set ca. 70% correctly classified o Repeat classification 10 times in the finclip study o Calculate average classification errror Uwe Menzel, 2015 Pitfall in Cross Validation Common mistake: knowledge leaking1 o Feature selection must be made on the training set only o The test set must not be included in feature selection http://gemler.fzv.uni-mb.si/results.php 1 http://www.alfredo.motta.name/cross-validation-done-wrong www.matstat.org Variable Selection and Cross Validation Summary o Expression levels at early age (10/20 weeks) can be used to predict, with some accuracy, lifespan of individuals of killifish (N. furzeri). o Samples cluster fairly well to the recorded lifespan (in MDS-plot), confirming the lifespan-predictive power of the identified biomarkers. o 10-fold cross validation shows that almost 70% of the samples of the test set can be classified correctly (but not for a held-out dataset!) o Classification performance is similar for 10 & 20 week-transcriptomes, and for the change of expression between 10 & 20 weeks. o Complete separation of the groups is unlikely, as some of the short-lived animals may not have survived for reasons not related to ageing. o Validation using an independent test set is desirable in order to obtain a more solid assessment of the prediction performance. www.matstat.org Resources o Rweka: https://cran.r-project.org/web/packages/RWeka/index.html o randomForest: https://cran.r-project.org/web/packages/randomForest/index.html o varSelRF: https://cran.r-project.org/web/packages/varSelRF/index.html o Breiman/Cutler RF: https://www.stat.berkeley.edu/~breiman/RandomForests/ o “RWeka Odds and Ends” by Kurt Hornik (R core team), 2014 o Liaw, Wiener: “Classification and Regression by randomForest”, R News, 2002 o Diaz-Uriarte, “GeneSrF and varSelRF ...” http://www.ncbi.nlm.nih.gov/pubmed/17767709 o “A Brief Tour of the Trees and Forests”, R-Bloggers, http://www.r-bloggers.com/abrief-tour-of-the-trees-and-forests/ www.matstat.org Appendix . Prediction of Phenotype by Transcriptome classification using Random Forest Machine Learning Uwe Menzel, 2012 [email protected] www.matstat.org www.matstat.org ID3, C4.5 (Quinlan), RF (Breiman)